UC Berkeley UC Berkeley Electronic Theses and Dissertations

Title Networks of Splice Factor Regulation by Unproductive Splicing Coupled With NMD

Permalink https://escholarship.org/uc/item/4md923q7

Author Desai, Anna

Publication Date 2017

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California Networks of Splice Factor Regulation by Unproductive Splicing Coupled With NMD

by

Anna Maria Desai

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Doctor of Philosophy

in

Comparative Biochemistry

in the

Graduate Division

of the

University of California, Berkeley

Committee in charge: Professor Steven E. Brenner, Chair Professor Donald Rio Professor Lin He

Fall 2017

Abstract

Networks of Splice Factor Regulation by Unproductive Splicing Coupled With NMD

by

Anna Maria Desai

Doctor of Philosophy in Comparative Biochemistry

University of California, Berkeley

Professor Steven E. Brenner, Chair

Virtually all multi-exon undergo alternative splicing (AS) to generate multiple isoforms. Alternative splicing is regulated by splicing factors, such as the serine/arginine rich (SR) protein family and the heterogeneous nuclear ribonucleoproteins (hnRNPs). Splicing factors are essential and highly conserved. It has been shown that splicing factors modulate alternative splicing of their own transcripts and of transcripts encoding other splicing factors. However, the extent of this alternative splicing regulation has not yet been determined. I hypothesize that the splicing factor network extends to many SR and hnRNP , and is regulated by alternative splicing coupled to the nonsense mediated mRNA decay (NMD) surveillance pathway.

The NMD pathway has a role in preventing accumulation of erroneous transcripts with dominant negative phenotypes. During the pioneer round of translation, NMD recognizes mRNA transcripts with in-frame premature termination codons (PTCs) and degrades them. Generally, NMD is thought to play a protective role by degrading transcripts that may generate truncated proteins that can be non-functional or deleterious. The NMD pathway also has physiological targets: it impacts expression through alternative splicing coupled with NMD. In this mode of regulation, high levels of one splicing factor cause target pre-mRNAs to be spliced into unproductive isoforms and degraded, resulting in lower levels of the spliced RNAs. Interestingly, many splicing factors undergo this mode of regulation. For example, SR proteins SRSF1, SRSR2, SRSF3, and SRSF7 are known to auto-regulate their own expression by coupling alternative splicing and NMD. In addition, splice factors hnRNP L and PTB are regulated in the same manner. Evidence also exists that splicing factors cross regulate each other via NMD. Since all 12 canonical human SR factors and many hnRNP factors have at least one isoform that contains evolutionarily conserved in-frame PTC, it is possible that this mode of gene regulation extends to all SR splicing factors, many hnRNP factors, and even beyond, forming a regulatory network that is dependent upon NMD.

Approximately 18% of expressed genes are reported to be natural targets of NMD, yet it still remains unclear why the would express mRNAs that are immediately degraded by the NMD pathway. It is especially intriguing that splicing

1 factors, which are responsible for the entire proteomic diversity, are enriched in this pool of natural NMD targets. To date, there has been no comprehensive and systematic study of human splicing factors and their role in genome wide gene regulation via NMD.

Regulation via alternative splicing coupled to NMD requires binding of a splicing factor to the regulated mRNA. CLIP-seq and related studies reveal that splicing factors bind abundantly to all transcripts of our selected 100 splicing factors. In collaboration with Arun Desai, I characterized the network of protein-RNA interactions between splicing factors. I find that splicing factors form a highly-connected network, where 30-60% of all possible interactions between splicing factors and the transcripts encoding splicing factors are observed. Dr. Zhiqiang Hu and I compared the hierarchy of splicing factors to the hierarchy of transcription factors. Dr. Hu calculated hierarchies of transcription and splicing factors using ENCODE ChIP-seq and eCLIP data, applying a hierarchy metric described in Gerstein et al. (Nature 2012 489:91-100). . Our limited data show that the hierarchy among splicing regulators is different from that of transcription factors. Gerstein et al. plot networks in 3 layers, with a top “executive” layer, the bottom under- regulation layer, and a middle layer in between. Unlike transcription factors which concentrate at the extremes of hierarchy metric, splicing factors form a hierarchical network that has nearly uniform distribution of proteins across the hierarchy metric and thus less clearly defined separation into the three distinct layers. Nearly all splicing factors that bind their own transcripts are found in the middle layer.

Dr. Courtney French, Dr. Hu, and I combined experimental data and a model for NMD mechanism to identify targets of NMD. I inhibited NMD in HeLa and GM12878 cells via knockdown of UPF1 and SMG6, two core NMD factors, and by exposure to cycloheximide (CHX). Dr. French and Dr. Hu performed RNA-seq data analysis for targets of NMD. We observed that NMD factor knockdown is likely a better method to identify NMD targets than the CHX treatment. We found that approximately 30% of NMD isoforms are shared between HeLa and GM12878, while the remainder are not substantially expressed in the other cell line.

2 CHAPTER 1 NETWORK OF SPLICE FACTOR REGULATION BY UNPRODUCTIVE SPLICING ...... 1 ABSTRACT ...... 1 INTRODUCTION ...... 1 RESULTS ...... 6 DISCUSSION ...... 19 MATERIALS AND METHODS ...... 21 CHAPTER 2 TRANSCRIPTOME-WIDE IDENTIFICATION OF POTENTIAL RUST TARGETS REVEALS EXTENSIVE REDUNDANCY BETWEEN HELA AND GM12878 ...... 24 ABSTRACT ...... 24 INTRODUCTION ...... 24 RESULTS ...... 26 DISCUSSION ...... 54 MATERIALS AND METHODS ...... 56 CHAPTER 3 REFERENCES ...... 63

i

List of Figures Figure 1.1. Experimentally proven RUST network...... 7 Figure 1.2. Splicing factor-mRNA interaction network...... 10 Figure 1.3 Splicing factor-mRNA interaction network extended to 100 splicing regulators ...... 14 Figure 1.4 Comparison of hierarchies of TF-TF network and SF-SF network in K562 cell line ...... 16 Figure 1.5. Hierarchies of TF-TF network in GM12878 cell line and SF-SF network in Hep2 cell line...... 17 Figure 1.6. All evaluated splicing factors bind transcripts of other splicing factors more prevalently than transcripts of other genes...... 18 Figure 2.1 Experimental validation of SMG6 and UPF1 knockdown in HeLa and GM12878 cells by qPCR and western blots...... 28 Figure 2.2. Validation of NMD inhibition through SRSF6 isoform expression alterations...... 29 Figure 2.3. Impact of SMG6/UPF1 knock down and CHX treatment on PTC50 and non- PTC50 isoforms...... 31 Figure 2.4. Differentially expressed PTC50 and non-PTC50 isoforms upon NMD inhibition in HeLa and GM12878 cells...... 34 Figure 2.5. Expression changes of genes which have previously been identified as NMD sensitive...... 36 Figure 2.6. Expression changes of genes which have previously been identified as NMD sensitive...... 37 Figure 2.7. Expression changes of genes which have previously been identified as NMD sensitive...... 38 Figure 2.8. alterations of NMD factors upon SMG6 knockdown in GM12878 and HeLa...... 40 Figure 2.9. Major isoform expression alterations of NMD factors upon SMG6 knockdown in GM12878 and HeLa...... 41 Figure 2.10. Comparison of NMD targeted isoforms identified in GM12878 and HeLa cells...... 43 Figure 2.11. Comparison of expression of NMD targeted genes and isoforms in GM12878 and HeLa cells...... 44 Figure 2.12. Detection of stringent cell line-specific NMD targets...... 45 Figure 2.13. Example of mis-identified cell line specific NMD target...... 46 Figure 2.14. and KEGG pathway enrichment of genes with NMD targets shared in GM12878 and HeLa cells...... 48 Figure 2.15. Examples of NMD target common to GM12878 and HeLa cells...... 50 Figure 2.16. Significant difference between proportions of NMD targets and non-NMD targets expressed and identified in GM12878 and HeLa...... 51 Figure 2.17. Gene Ontology Enrichment Analysis of cell line specific expressed NMD targets...... 52 Figure 2.18. Examples of NMD target unique to GM12878 or HeLa cells...... 53

ii List of Tables Table 1.1 RUST regulations collected from literature ...... 5 Table 1.2. Literature collected eCLIP studies ...... 9 Table 1.3. Literature collected CLIP-seq and related studies ...... 12 Table 1.4. Overlap of RUST and CLIP ...... 13 Table 1.5. eCLIP data used to plot Figure 1.6...... 19 Table 1.6. eCLIP accession codes ...... 23 Table 2.1. Experimental approaches employed to achieve strong transient gene knockdown in GM12878 cells...... 27 Table 2.2. Spearman and Pearson correlation coefficients of isoform expression ratio (control/NMD inhibition) between different treatments in HeLa and GM12878 cells...... 32 Table 2.3. In-depth investigation of the detected stringent cell line specific NMD targets...... 47 Table 2.4. Summary of RNA-seq data for HeLa...... 60 Table 2.5. Summary of RNA-seq data for GM12878...... 61

iii Acknowledgements

My doctoral training would have not been possible without the support of many individuals. I would like to thank the faculty of the Comparative Biochemistry Department. Professor Jack Kirsch deserves a special thank you for the help and guidance he has given me. This dissertation would not have been possible without the supervision of Professor Steven Brenner. I am grateful for Professor Fenyong Liu and Professor Lin He’s support.

I would like to thank the faculty of the Molecular and Cell Biology Department for their tutelage. I am indebted to Professor Donald Rio who provided invaluable support.

I have been fortunate that I have had the opportunity to have worked with my wonderful lab mates. I want to thank all members of the Brenner and Rio labs for providing me with mentorship and guidance.

In particular, I would like to thank Courtney French and James Lloyd for their analyses, discussions, and revisions. Special thank you to Zhiqiang Hu for data analyses and manuscript revisions. I would also like to thank Tiffany Cheng, Max Shatsky, and Eric Odell for making the lab an enjoyable place to work and learn.

Many people introduced me to various molecular biology techniques. Special thanks to Yeon Lee for teaching me how to electroporate cells, for sharing materials and protocols, and for offering technical help when needed. Thank you to Gang Wei for showing me how to extract RNA and prepare RNA-seq library. Thank you to Chandani Limbad for showing me how Western Blot is done. Thank you to Amita Gorur for introducing me to microscopy. Thank you to Marilyn Kobayashi and Setsuko Wakao for teaching me the principles of qRT-PCR and for assistance with chemical supplies and protocols. I would like to thank the staff of the Cell Culture Facility for their technical support. I would like to single out Ann Fischer for introducing me to the mammalian cell culture. Big thanks to Xiaozhu Zhang, Carissa Tasto, and Alison Killilea for proving me with happy cells.

I would like to thank Arun Desai for his help with the CLIP-seq data analyses.

I acknowledge Malina Desai and Andrew Sharo for help with proofreading the dissertation.

My training has been supported in a large part by the NIH fellowship grant F31 GM 108462. My training was also supported by the UCB Molecular Biophysics Training grant. I also received support from the Kosciuszko Foundation Scholarship.

iv Chapter 1 Network of Splice Factor Regulation by Unproductive Splicing

Abstract

A large fraction of eukaryotic genes are alternatively spliced and much of this splicing leads to the introduction of premature termination codons that lead to nonsense mediated mRNA decay (NMD). Regulation of gene expression through alternative splicing coupled to NMD has been reported for many splicing factors. Regulation via alternative splicing coupled to NMD requires binding of a splicing factor to the regulated mRNA. CLIP-seq and related studies reveal that splicing factors bind abundantly to all transcripts of our selected 100 splicing factors. I characterized the network of protein- RNA interactions between splicing factors. I find that splicing factors form a highly- connected network, where 30-60% of all possible interactions between splicing factors and the transcripts encoding splicing factors are observed. Dr. Hu compared the hierarchy of splicing factors to the hierarchy of transcription factors. Dr. Hu calculated hierarchies using hierarchy height metric of transcription and splicing factors with a method presented in Gerstein et al. 20121 using ENCODE ChIP-Seq and eCLIP data. Our limited data show that the hierarchy among splicing regulators is different from that of transcription factors. The networks are plotted in 3 layers, with the top “executive” layer, the bottom under-regulation layer, and a middle layer in between. Unlike transcription factors, splicing factors form a hierarchical network that has nearly uniform distribution with fewer factors acting at the extremes of the hierarchy. Nearly all splicing factors that bind their own transcripts group in the middle layer.

Introduction

Splicing is regulated by RNA binding proteins, including arginine/serine rich (SR) regulators, heterogeneous nuclear ribonucleoproteins (hnRNPs), tissue specific regulators, and other splicing factors. Splicing factors impact virtually all human multi- exon genes that undergo alternative splicing 2,3 to considerably increases proteome diversity 4,5,6. Since splicing factors modulate constitutive and alternative splicing, they dynamically control transcriptome during growth and development, in a cell-type and tissue-specific manner 7,8. For example, PTBP2 is essential for neuronal maturation 9,10, MBNL3 antagonizes muscle differentiation 11, and NOVA2 is a key regulator of angiogenesis 12. Splicing factors maintain homeostasis, but when unbalanced can cause disease 13,14,15,16. For example, mutations in NOVA or TRDBP result in severe neuronal pathogeneses 17,18,19,20.

Gene expression is regulated at multiple stages, from transcriptional initiation, through RNA processing, to post-translational modifications of the final protein product. Each stage of gene regulation is controlled by regulatory proteins. At the DNA level, transcription factors bind cis-regulatory elements to modulate gene expression 21,22,23,24 , and this regulation is often achieved in a combinatorial fashion through regulatory networks . The architecture of these networks is hierarchical. Gerstein et al. 2012 1 employ hierarchy height metric to measure the flow of information in a transcription factor network1. They show that using the hierarchy height metric, the transcription factor network can be separated into three levels: top-level ‘executive’ transcription factors that regulate many other factors, middle-level transcription factors that co- regulate targets to mitigate information-flow bottlenecks, and bottom-level ‘foreman’ factors that are mostly regulated. Gene regulation at the RNA level is controlled by splicing factors through binding to RNA cis-regulatory elements. It has been proposed that splicing factors also form hierarchical networks that are controlled by master splicing regulators which are positioned at the top of a splicing cascade 25. However, this master splicing regulator definition also requires that the master splicing regulator is obligatory for the proper differentiation or specification of a cell type, and once a cell is committed to a lineage, a master splicing regulator is required for maintaining homeostasis 25.

Regulated unproductive splicing and translation (RUST) is a mechanism central to gene expression through splicing coupled to NMD, where unproductive alternative splicing is the introduction of a premature termination codon into a transcript; the transcript is consequently degraded by the nonsense mediated mRNA decay (NMD). Concurrently, the productive transcript decreases in abundance, resulting in reduction of the active gene product 26,27,28. Here, we stress a subtle difference between RUST and alternative splicing coupled to NMD (AS-NMD): regulation of gene expression via RUST is regulated whereas AS-NMD can occur in a constitutive manner. Many splicing factors are known to auto-regulate their own expression by binding their own pre-mRNA and promoting splicing that leads to expression of the unproductive transcript isoform 29,30,31,32,33,34. This negative feedback loop keeps the expression of many splicing factors tightly controlled, ensuring that gene expression does not exceed desired levels. Splicing factors are also known to control the expression of other splicing factors through RUST. For example, PTBP1 (PTB) is expressed in non-neuronal tissues and normally acts to repress the expression of the productive transcript isoform of PTBP2 (nPTB). This ensures that the neuronal splicing program initiated by PTBP2 is not activated in non-neuronal cells. In neuronal cells, PTBP1 is repressed through the actions of a microRNA 10, leading to elevated levels of the isoform of PTBP2 and the neuronal splicing events it is responsible for.

To characterized the extent of the splicing factor regulation via RUST, to assess the architecture of the network, and to identify any potential master regulators, I collected known cross-regulatory events through RUST between splicing factors (Table 1.1). Based on reported RUST regulations among splicing factors, I assembled the experimentally proven RUST subnetwork. The RUST based network reveals regulation between most examined factors based mainly on small scale studies. For regulation to occur, splicing factors are required to interact with mRNAs. To study the extent of splicing factor-mRNA (SF-mRNA) interactions, Arun Desai and I collected crosslinking- immunoprecipitation (CLIP) based studies, such as CLIP-seq, iCLIP, PAR-CLIP, and eCLIP. All CLIP protocols involve in vivo, covalent protein-RNA cross-linking followed by immunoprecipitation with antibodies against the protein of interest. RNA is then purified by proteinase digestion, reverse-transcribed into cDNA, and sequenced. CLIP-seq is

2 the most common technique used with a relative ease of sample preparation (as revived in 35). One advantage of CLIP-seq is that it informs the location where the RNA binding protein interacts with RNA. The drawback of the CLIP-seq experiment, which might affect our study, is low efficiency of the UV-mediated crosslinking, which ranges between 1% and 5%. As a consequence, some interactions between a protein and its target RNA might not be captured, potentially leading to a less connected SF-mRNA interaction network.

PAR-CLIP (photoactivatable-ribonucleoside-enhanced crosslinking and immunoprecipitation) is based on incorporation of photoreactive ribonucleoside analogs into the RNA of living cells 36. These analogs cross-link to interacting proteins under UV light, marking the protein-RNA interaction site with mutations. These mutations are easily identified against reference sequences providing single-nucleotide resolution. While PAR-CLIP has increased resolution and decreased signal-to-noise ratio when compared to CLIP-seq, it is restricted mainly to cell culture. This cell culture restriction might not reveal physiological protein-RNA interactions when highly aberrant cells are used. In our study, the single nucleotide resolution is not utilized, however, we aim to incorporate physiological targets of RNA binding proteins into the SF-RNA interaction network. Usage of cancerous cells such as HeLa and HEK293 might identify SF-RNA interactions that are not present physiologically, leading to a network that might not exist in the physiological setting.

iCLIP (individual-nucleotide resolution CLIP) uses a 3’ exonuclease to degrade protein-bound RNA. This exonuclease digests the isolated RNA and stops at the cross- linked protein. An adapter is then ligated to this position. The RNA is reverse transcribed and sequenced, revealing the exact protein binding site in the RNA 37. iCLIP, just like CLIP-seq, can be used in most experimental systems but is not limited to cell culture. Researchers choose iCLIP when resolution is important but PAR-CLIP is not possible.

eCLIP (enhanced CLIP) incorporates modifications of the iCLIP method such as improvements in library preparation of RNA fragments, efficient 2-step adapter ligation which helps in eliminating PCR duplicates, and paired size-matched input control (SMInput) which helps eliminate background noise 38. eCLIP has been employed by the ENCODE consortium to assay 161 RNA binding proteins. All experiments have been performed in the laboratory of Professor Gene Yeo, using the same protocol and just two cell lines, either HepG2 or K562. This set of experiments provides the most uniform collection of SF-RNA interactions currently available.

Here, we present literature-based experimentally validated RUST network which appear underexplored. We also present splicing factor-mRNA interaction networks based on eCLIP data. This network contains 30% of all possible interactions. We combined the eCLIP data with other publically available CLIP-seq and related studies to determine the extent of interactions among splicing factors and their transcripts. This extensive network contains 30% of all possible interactions, hinting that regulation among splicing factors is likely prevalent. We used the eCLIP data to compare the

3 hierarchy of the splicing-factor interaction network to the hierarchy observed for transcription factors. The limited data shows that the hierarchy among splicing regulators is different from that of transcription factors. In contrast to transcription factors, splicing factors form a hierarchical network that has a nearly uniform distribution with fewer factors acting at the extremes of the hierarchy. Nearly all splicing factors that bind their own transcripts group in the middle layer.

4 Table 1.1 RUST regulations collected from literature FUS RBM10 SRSF3 SRSF3 SRSF2 SRSF1 PTBP2 HNRNPL HNRNPA2B1 TIA1 RBM10 SRSF3 SRSF3 SRSF3 RBFOX2 RBFOX2 PTBP1 KHDRBS1 HNRNPL HNRNPA1 ELAVL1 SRSF9 SRSF4 SRSF4 SRSF4 SRSF4 SRSF4 SRSF1 RBFOX2 RBFOX2 RBFOX2 RBFOX2 PTBP2 FUS PTBP1 PTBP1 HNRNPL SRSF2 SRSF1 PTBP1 TRA2B RBFOX3 SRSF8 SRSF7 SRSF6 (Splicing Factor) Source FUS RBM10 SRSF3 SRSF3 SRSF2 SRSF1 PTBP2 HNRNPL HNRNPA2B1 TIA1 RBM5 SRSF5 SRSF2 SRSF7 SF1 TRA2A PTBP2 SRSF1 HNRPLL HNRNPA2B1 PTBP2 SRSF1 SRSF7 SRSF5 SRSF3 SRSF2 SRSF1 SRSF3 TIA1 PTBP2 SRSF7 SNRNP70 SRSF3 SNRNP70 PTBP1 PTBP1 SRSF3 SRSF1 SRSF2 SRSF3 TRA2B RBFOX2 SRSF2 SRSF2 SRSF1 Transcirpt tested for NMD yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes NA NA NA NA NA CLIP-seq? undergo Did the factor repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing repressing no regulation observed no regulation observed no regulation observed no regulation observed no regulation observed no regulation observed activating activating activating activating activating activating activating repressing repressing repressing no regulation observed no regulation observed activating repressing repressing no regulation observed no regulation observed no regulation observed Regulation type yes yes yes yes yes yes yes yes yes yes no no no no no no no no no no no no no no no no no no no no no no no no yes yes no no no no yes no no no no regulation Is this auto- ? human human mouse mouse human human mouse human human mouse human mouse mouse mouse mouse mouse human human human human human human mouse mouse mouse mouse human mouse mouse mouse mouse mouse human mouse human human human human human human human mouse human human human Organism ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ^ ^ ^ ↑ ^ ^ Network symbol Zhou et al. Sun et al. Jumaa et al. Änkö et al. Sureau et al. Sun et al. Jangi et al. Genes Dev. Mar 15;28(6):637-51 (2014) Rossbach et al. McGlincy et al. Jangi et al. Genes Dev. Mar 15;28(6):637-51 (2014) Sun et al. Änkö et al. Änkö et al. Änkö et al. Jangi et al. Genes Dev. Mar 15;28(6):637-51 (2014) Jangi et al. Genes Dev. Mar 15;28(6):637-51 (2014) Spellman et al. Valacca et al. Rossbach et al. Bonomi et al. Lebedeva et al. Mol. Cell 43, 340-352, (2011) Sun et al. Änkö et al. Änkö et al. Änkö et al. Änkö et al. Sun et al. Jumaa et al. Jangi et al. Genes Dev. Mar 15;28(6):637-51 (2014) Jangi et al. Genes Dev. Mar 15;28(6):637-51 (2014) Jangi et al. Genes Dev. Mar 15;28(6):637-51 (2014) Jangi et al. Genes Dev. Mar 15;28(6):637-51 (2014) Guo et al. Nakaya et al. RNA. Apr;19(4):498-509 (2013) Spellman et al. Wollerton et al. Jia et al. Sun et al. Sureau et al. Guo et al. Stoilov et al. Dredge et al. Sureau et al. Sureau et al. Sun et al. Sci Rep. Nucleic Acids Res. Nat Struct Mol Biol. Nucleic Acids Res. Nat Struct Mol Biol. Nat Struct Mol Biol. Nat Struct Mol Biol. Nat Struct Mol Biol. Sci Rep. Sci Rep. PLoS Genet. Genome Biol. Genome Biol. Genome Biol. Genome Biol. Genome Biol. Genome Biol. Genome Biol. Genome Biol. EMBO J EMBO J Hum Mol Genet. EMBO J. EMBO J. EMBO J. EMBO J. PLoS One J Cell Biol. Nucleic Acids Res. Mol Cell Biol. BMC Genomics Mol Cell. Mol Cell Biol. Mol Cell. Mol Cell. Nov 3;6:35976 (2016) Sep 29;5:14548 (2015) Sep 29;5:14548 (2015) . Aug 15;16(16):5077-85 (1997) . Aug 15;16(16):5077-85 (1997) Apr 2;20(7):1785-96 (2001) Apr 2;20(7):1785-96 (2001) Apr 2;20(7):1785-96 (2001) Apr 2;20(7):1785-96 (2001) . ;6(6):e21585 (2011) Oct;9(10):e1003895 (2013) Oct 4;191(1):87-99 (2010) Aug 3;27(3):420-34 (2007) Aug 3;27(3):420-34 (2007) 13(3):R17 (2012) 13(3):R17 (2012) 13(3):R17 (2012) 13(3):R17 (2012) 13(3):R17 (2012) 13(3):R17 (2012) 13(3):R17 (2012) 13(3):R17 (2012) Jan 16;13(1):91-100 (2004 ) Reference Jun 6. doi: 10.1093/nar/gkx508. [Epub ahead of print](2017) Jun 6. doi: 10.1093/nar/gkx508. [Epub ahead of print](2017) Mar;29(6):1442-51 (2009) Mar;29(6):1442-51 (2009) Mar;17(3):306-12 (2010) Mar;17(3):306-12 (2010) Mar;17(3):306-12 (2010) Mar;17(3):306-12 (2010) Mar 1;13(5):509-24 (2004) Mar;17(3):306-12 (2010) . Oct 14;11:565 (2010) Oct;41(18):8665-79 ( 2013) yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes no no no no no no yes yes yes yes yes yes yes no no no yes yes no NA NA NA NA NA Is interaction observed by CLIP? yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes no no no no no no NA NA NA yes NA regulation? interaction Is genome- correlated wide with yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes (Cavaloc et al. 1999) yes (Stoilov et al. 2004) yes correalted with All interaction regulation

5 Results

It has been reported that expression of many splicing factors is regulated through unproductive splicing and translation (RUST). I collected published regulation between splicing factors through RUST to examine its extent and interconnectedness (Table 1.1). I identified 19 publications reporting 19 splicing factors as targets of RUST in human and murine cells. The following factors are identified as targets of RUST: five SR factors (SRSF1-3, SRSF5, and SRSF7), three hnRNPs (hnRNPA2B1, L, and LL), and eleven other splicing regulators (RBFOX2, RBFOX3, FUS, SNRNP70, TRA2A, TRA2B, TIA1, PTBP1, PTBP2, RBM10, and RBM5) (Figure 1.1). Eleven splicing factors have been shown to auto-regulate their own expression through RUST: SRSF1-3, PTBP1, PTBP2, hnRNPL, hnRNPA2B1, TIA1, TRA2B, FUS, and RBM10. All of the auto- regulatory RUST events are repressive. For example, PTBP1 binds its own pre-mRNA and promotes an alternative splicing change that leads to the exclusion of exon 11, creating a transcript that is consequently degraded by the NMD pathway 33. Fourteen targets of RUST are cross-regulated by other splicing factors; these factors include: SRSF1-3, SRSF5, SRFS7, hnRNPL, hnRNPA2B1, PTBP2, RBM5, RBFOX2, TIA1, TRA2A, SF1, and SNRNP70 (Table 1.1, Figure 1.1). Twelve of these targets have been shown to be regulated by a single splicing factor. For example, RBFOX2, has been shown to be regulated by RBFOX3. The caveat is that for the RUST regulation to be observed, the regulatory factor, in this example, RBFOX3, has to be overexpressed many times over its physiological levels with concurrent inhibition of the NMD pathway39. Just two targets of RUST have been shown to be regulated by more than one splicing factor: PTBP2 and SRSF3. In both cases, the observed cross-regulatory events are activating and repressive. For example, hnRNPL has been shown to repress the expression of SRSF3 via RUST 40, whereas SRSF1, PTBP1, and PTBP2 have been shown to activate the expression of SRSF3 41,42. In the activating mode of RUST, a splicing factor binds to a transcript and promotes alternative splicing changes that favor production of the productive transcript. It has been shown that the PTBP2 transcript is activated by RBFOX2 and repressed by PTBP1 and ELAVL1 34,32,43. Overall, out of the 21 cross-regulatory events reported, 8 are activating and 13 are repressive (Table 1.1 and Figure 1.1).

The collected RUST regulations reveal that when tested, RUST is observed in 76% (34/45) cases. This number is likely an over estimation. There is a bias towards the publications yielding positive connections and likely bias towards factors that have important roles in splicing and biological processes. These observed regulations include13 auto-regulations and 21 cross-regulations (Table 1.1, Figure 1.1). In most cases, RUST regulation has been determined on a small scale. For example, a splicing factor of choice is over-expressed or knocked down, and changes in alternative splicing on a limited number of transcripts are observed via semi-quantitative PCR, usually in an NMD-inhibited background. These non-genome-wide studies reported that SRSF3 represses expression of SRSF2, SRSF5, SRSF7, and self via RUST 44,41. The RUST- mediated regulatory effects of just one splicing regulator, RBFOX2, have been assessed on a wider scale and have been confirmed for 6 splicing regulators 34.

6 RBFOX2 has been shown to activate TIA1, PTBP2, SRSF7, and SNRNP70, and repress SF1 and TRA2A via RUST.

Figure 1.1. Experimentally proven RUST network. Each node represents a splicing factor and its transcript. Node color represents a family of splicing regulators as described in the legend. Red font represents a gene that has been shown to express NMD targeted transcript. Nodes colored with a darker shade of green represent splicing factors that associate with the spliceosome. Edges represent RUST regulations. A node at the source of the edge represents a splicing factor and a node at the target of the edge represents a target of RUST. Solid line edges represent presence of RUST. Dashed line edges represent tested RUST experiments that were tested but not observed. Edges ending with a ^ represent gene repression via RUST. 7 For example, PTBP1 (PTB) represses the expression of the productive transcript isoform of PTBP2 (nPTB); this regulation is depicted as an edge that originates at the PTBP1 node and ends with a ^ in the PTBP2 node. Edges that end with an ­ represent gene activation via RUST. Blue edges stand for RUST observed in humans and purple edges stand for RUST observed in mouse. This network consists of 27 nodes and 45 edges.

Given that a physical interaction between a splicing factor and its target RNA is a pre-requisite for regulation, I decided to examine the transcriptome-wide binding patters of splicing factors to better understand the potential of regulation between these factors. I took the stringent set of interactions generated by eCLIP to reduce the number of false positive interactions and to keep the results consistent, given these experiments were performed together as part of ENCODE. 38,45,46 Here, I collected 21 and 22 eCLIP experiments performed in HepG2 and K562 cells, respectively, to assess the extent of interactions among splicing factors (Table 1.2, Figure 1.2, Panels A and B). Out of 27 studied factors, 16 splicing factors overlap in both cell lines. The collection of confident and stringent eCLIP binding peaks among the 21 and 22 sets of splicing regulators were plotted in two separate network views: one for K562 and one for HepG2 cells. Overall, there are 31.6% (153 out of 484) possible interactions present in the K562 eCLIP network and 30.6% (135 out of 441) possible interactions present in the HepG2 eCLIP network (Figure 1.2, Panels A and B). On average, a splicing factor in the eCLIP K562 network has 11.5 neighbors out of 21 possible, whereas a splicing factor in the eCLIP HepG2 network has 10.9 neighbors out of 20 possible. There are seven self- loops in the K562 network and six self-loops in the HepG2 network; suggesting auto- regulation in 33% of these splicing factors. This high number of interactions indicates that RUST might be more prevalent.

8 Table 1.2. Literature collected eCLIP studies

Biological Splicing Factor Reference

material 100) bound murine) clusters gene list of interest of SFs of interest of SFs PercentCLIP- of of SF Number of Number of genes Number of Type of experiment of Type NetworkColor Node that are in the 100the are SF that in Total number of CLIP- of number Total CLIP- of number Total to namegene binding of (out interestbound clusters in the 100the clusters SFs in Organism (h-human, m- (h-human, Organism clusterswith annotated Percent of genes bound Percentgenesbound of FMR1 eCLIP h K562 3743 1358 66 1.8% 26 1.9% ENCODE, Gene Yeo, UCSD HNRNPUL1 eCLIP h K562 150 66 12 8.0% 1 1.5% ENCODE, Gene Yeo, UCSD KHSRP eCLIP h K562 5023 1571 222 4.4% 34 2.2% ENCODE, Gene Yeo, UCSD U2AF1 eCLIP h K562 1119 701 66 5.9% 31 4.4% ENCODE, Gene Yeo, UCSD DDX42 eCLIP h K562 164 121 13 7.9% 6 5.0% ENCODE, Gene Yeo, UCSD TRA2A eCLIP h K562 1347 408 107 7.9% 15 3.7% ENCODE, Gene Yeo, UCSD HNRNPA1 eCLIP h K562 46 39 1 2.2% 1 2.6% ENCODE, Gene Yeo, UCSD TARDBP eCLIP h K562 3793 1694 67 1.8% 24 1.4% ENCODE, Gene Yeo, UCSD HNRNPK eCLIP h K562 1150 606 39 3.4% 11 1.8% ENCODE, Gene Yeo, UCSD QKI eCLIP h K562 1982 1226 45 2.3% 22 1.8% ENCODE, Gene Yeo, UCSD PTBP1 eCLIP h K562 3838 1865 53 1.4% 14 0.8% ENCODE, Gene Yeo, UCSD HNRNPM eCLIP h K562 3533 1108 89 2.5% 20 1.8% ENCODE, Gene Yeo, UCSD DDX3X eCLIP h K562 8422 4379 111 1.3% 48 1.1% ENCODE, Gene Yeo, UCSD KHDRBS1 eCLIP h K562 200 128 14 7.0% 9 7.0% ENCODE, Gene Yeo, UCSD SRSF1 eCLIP h K562 1167 601 55 4.7% 26 4.3% ENCODE, Gene Yeo, UCSD ZRANB2 eCLIP h K562 1381 733 85 6.2% 30 4.1% ENCODE, Gene Yeo, UCSD SRSF7 eCLIP h K562 426 241 25 5.9% 17 7.1% ENCODE, Gene Yeo, UCSD U2AF2 eCLIP h K562 2859 1384 158 5.5% 45 3.3% ENCODE, Gene Yeo, UCSD SF3B4 eCLIP h K562 7663 3023 211 2.8% 47 1.6% ENCODE, Gene Yeo, UCSD HNRNPU eCLIP h K562 75 45 1 1.3% 1 2.2% ENCODE, Gene Yeo, UCSD TIA1 eCLIP h K562 2988 1546 113 3.8% 40 2.6% ENCODE, Gene Yeo, UCSD RBFOX2 eCLIP h K562 1316 678 49 3.7% 19 2.8% ENCODE, Gene Yeo, UCSD HNRNPUL1 eCLIP h HepG2 300 49 12 4.0% 2 4.1% ENCODE, Gene Yeo, UCSD U2AF1 eCLIP h HepG2 786 403 38 4.8% 19 4.7% ENCODE, Gene Yeo, UCSD TIA1 eCLIP h HepG2 1907 926 95 5.0% 34 3.7% ENCODE, Gene Yeo, UCSD TRA2A eCLIP h HepG2 749 374 55 7.3% 20 5.3% ENCODE, Gene Yeo, UCSD HNRNPA1 eCLIP h HepG2 40 19 2 5.0% 2 10.5% ENCODE, Gene Yeo, UCSD PCBP2 eCLIP h HepG2 8284 3334 158 1.9% 34 1.0% ENCODE, Gene Yeo, UCSD SRSF9 eCLIP h HepG2 546 315 17 3.1% 8 2.5% ENCODE, Gene Yeo, UCSD SF3A3 eCLIP h HepG2 4019 2106 107 2.7% 38 1.8% ENCODE, Gene Yeo, UCSD HNRNPC eCLIP h HepG2 1507 803 45 3.0% 22 2.7% ENCODE, Gene Yeo, UCSD RBFOX2 eCLIP h HepG2 3832 1572 78 2.0% 26 1.7% ENCODE, Gene Yeo, UCSD HNRNPK eCLIP h HepG2 2282 1089 48 2.1% 15 1.4% ENCODE, Gene Yeo, UCSD QKI eCLIP h HepG2 4404 1896 78 1.8% 26 1.4% ENCODE, Gene Yeo, UCSD PTBP1 eCLIP h HepG2 4163 1951 54 1.3% 14 0.7% ENCODE, Gene Yeo, UCSD HNRNPM eCLIP h HepG2 4295 1316 55 1.3% 19 1.4% ENCODE, Gene Yeo, UCSD DDX3X eCLIP h HepG2 9644 4183 79 0.8% 38 0.9% ENCODE, Gene Yeo, UCSD SFPQ eCLIP h HepG2 295 106 1 0.3% 1 0.9% ENCODE, Gene Yeo, UCSD SRSF1 eCLIP h HepG2 1181 652 71 6.0% 26 4.0% ENCODE, Gene Yeo, UCSD SRSF7 eCLIP h HepG2 506 261 16 3.2% 11 4.2% ENCODE, Gene Yeo, UCSD U2AF2 eCLIP h HepG2 5724 2517 146 2.6% 38 1.5% ENCODE, Gene Yeo, UCSD SF3B4 eCLIP h HepG2 5076 2346 131 2.6% 48 2.0% ENCODE, Gene Yeo, UCSD HNRNPU eCLIP h HepG2 135 35 2 1.5% 2 5.7% ENCODE, Gene Yeo, UCSD

9

Figure 1.2. Splicing factor-mRNA interaction network. (A-B) eCLIP based network. Description of nodes is the same as in Figure 1.1. Edges represent splicing factor- mRNA interactions. Nodes at the source of an edge represent a splicing factor that underwent CLIP-seq and nodes at the target of the edge that ends with a dot represent the mRNA with which a splicing factor interacts. For ease of 10 visualization, multiple eCLIP binding peaks located in a given transcript are portrayed as a single edge between a splicing factor and its mRNA target. Self-edges are red. The K562 based network contains 22 nodes and 152 interactions; the HepG2 based network contains 21 nodes and 135 interactions All networks are arranged in a degree sorted circle layout. The clustering coefficient of the K562 network is 0.46 and of the HepG2 network is 0.41. (C) All- CLIP based network contains 44 nodes and 1153 edges, and the clustering coefficient is 0.67.

There is a large variation in the number of CLIP-binding peaks identified for each factor (Table 1.2). For example, genome-wide eCLIP of hnRNPA1 identified only 46 interactions, corresponding to 39 genes, out of which only 1 is a splicing factor in K562 cells, and 40 interactions, corresponding to 19 genes, out of which only 2 are splicing factors in HepG2 cells. While eCLIP has been shown to have high specificity38, it may have low sensitivity. For example, eCLIP of hnRNAPA1 did not capture the known interaction with itself. 47,48,49,50 The highest number of eCLIP binding peaks was identified for DDX3X: 8422 interactions corresponding to 4379 genes, out of which 48 are splicing factors in K562 cells and 9644 interactions corresponding to 4183 genes, out of which 38 are splicing factors in HepG2 cells. For many assayed splicing factors, eCLIP resulted in the same number of binding peaks in the two different cell lines, such as hnRNAPA1, DDX3X, hnRNPM, and PTBP1 (Table 1.2). There are several factors for which eCLIP identified nearly twice as many factor-mRNA interactions in the HepG2 cells than in the K562 cells. These factors include: hnRNPK, hnRNPU, hnRNPUL1, QKI, RBFOX2, and U2AF2 (Table 1.2). In few instances, splicing factors such as SF3B4, TIA1, TRA2A, and U2AF1, exhibit significantly fewer SF-RNA interactions in the HepG2 cells than in the K562 cells (Table 1.2). These observations indicate that many splicing factors have cell line specific targets and as a consequence might regulate different splicing factors in different cell lines.

To expand the potential extent of RUST regulation among splicing factors, in addition to eCLIP, I gathered confident splicing factor–mRNA interactions, as provided in the original publications, from additional 46 CLIP experiments representing 34 splicing factors. These studies include 27 CLIP-seq, 11 iCLIP, and 8 PAR-CLIP experiments (for detailed list see Table 1.3). Out of these additional 46 studies, 31 have been performed in human cells or tissues and 15 have been conducted with mice (Table 1.3). These SF-mRNA interactions were merged with the eCLIP data. The resulting splicing factor-mRNA interaction network (Figure 1.2, Panel C), contains 44 nodes and 1153 edges. The edges comprise nearly 60% of all possible interactions the network can exhibit (1153 present out of 1936 possible). 29 of these interactions are self-loops which might be auto-regulatory. In this network, on average, each node has 35.7 neighbors and the clustering coefficient of this network is 0.67. This highly connected network is twice as dense as the eCLIP derived network, further hinting that RUST might be prevalent. This all-CLIP network comes with some caveats: there is large variation in identified CLIP clusters even when the same splicing factor was cross- linked to RNA (see Table 1.3 for details), data analysis is diverse, and by combining experiments from different cell lines and tissues, as well as combining murine and human interactions, we likely introduced many false positive SF-mRNA interactions. This network shows a new upper limit on the amount of cross-regulation between splicing factors and their transcripts.

11 Table 1.3. Literature collected CLIP-seq and related studies

Splicing Factor Biological material Reference (out of 100) of (out Type of experiment of Type NetworkColor Node the 100 SFs of interest100 the of SFs list 100gene the SF in Number of genesbound Number of PercentCLIP-clusters of binding to SFs of interest of SFs to binding annotated with gene namegene with annotated Total number of CLIP-clusters of number Total Organism (h-human, m-murine) (h-human, Organism interest bound of SF Number of Total number of CLIP-clusters in CLIP-clusters in of number Total arethat Percentgenesbound of

SRSF1 CLIP-seq h HEK293T 23630 9185 437 1.8% 75 0.8% Sanford et al. Genome research 19, 381-394 (2009) HNRNP A1 iCLIP h HeLa 40670 6784 830 2.0% 72 1.1% Bruun et al. BMC Biol. Jul 5;14:54 (2016) HNRNP A1 CLIP-seq h HEK293T 2043 1265 110 5.4% 46 3.6% Huelga et al. Cell reports 1, 167-178 (2012) HNRNP A2/B1 CLIP-seq h HEK293T 10691 4679 204 1.9% 66 1.4% Huelga et al. Cell reports 1, 167-178 (2012) HNRNP C iCLIP h HeLa 438361 12435 5355 1.2% 85 0.7% Zarnack et al. Cell 152, 453-466 (2013) HNRNP C rep1 iCLIP h HeLa 9627 3685 168 1.7% 55 1.5% Konig et al. Nat. Struct. Mol. Biol. 17, 909-915 (2010) HNRNP C rep2 iCLIP h HeLa 7207 3210 131 1.8% 55 1.7% Konig et al. Nat. Struct. Mol. Biol. 17, 909-915 (2010) HNRNP C rep3 iCLIP h HeLa 4031 2336 47 1.2% 27 1.2% Konig et al. Nat. Struct. Mol. Biol. 17, 909-915 (2010) HNRNP F CLIP-seq h HEK293T 13618 5534 242 1.8% 73 1.3% Huelga et al. Cell reports 1, 167-178 (2012) HNRNP H1 CLIP-seq h HEK293T 30829 8877 333 1.1% 83 0.9% Huelga et al. Cell reports 1, 167-178 (2012) HNRNP L CLIP-seq h JSL1 Jurkat cells rested 40957 7825 505 1.2% 78 1.0% Shankarling et al. Mol Cell Biol. Jan;34(1):71-83 (2014) HNRNP L CLIP-seq h JSL1 Jurkat cells stimulated 31782 7288 432 1.4% 63 0.9% Shankarling et al. Mol Cell Biol. Jan;34(1):71-83 (2014) HNRNP L CLIP-seq h primary CD4+ cells rested 49021 11091 418 0.9% 77 0.7% Shankarling et al. Mol Cell Biol. Jan;34(1):71-83 (2014) HNRNP L CLIP-seq h primary CD4+ cells stimulated 46558 10850 483 1.0% 83 0.8% Shankarling et al. Mol Cell Biol. Jan;34(1):71-83 (2014) HNRNP M CLIP-seq h HEK293T 3592 1227 85 2.4% 35 2.9% Huelga et al. Cell Rep. 1, 167-178 (2012) HNRNP U CLIP-seq h HEK293T 18203 6394 303 1.7% 74 1.2% Huelga et al. Cell Rep. 1, 167-178 (2012) ELAVL1 PAR-CLIP h HeLa 32129 6875 600 1.9% 78 1.1% Lebedeva et al. Mol. Cell 43, 340-352, (2011) FUS CLIP-seq h brains (temporal lobe cortices) 863 863 16 1.9% 16 1.9% Nakaya et al. RNA. Apr;19(4):498-509 (2013) FUS PAR-CLIP h HEK293T 31794 6848 589 1.9% 73 1.1% Hoell et al. Nat. Struct. Mol. Biol. 18, 1428-1431 (2011) RBFOX2 CLIP-seq h hESC 3546 1420 65 1.8% 29 2.0% Yeo et al. Nat. Struct. Mol. Biol. 16, 130-137 (2009) RBM4 PAR-CLIP h U87MG 4182 4182 39 0.9% 38 0.9% Uniacke et al. Nature 486, 126-129 (2012) SF1 CLIP-seq h HeLa 210 196 5 2.4% 5 2.6% Corioni et al. Nucleic acids research 39, 1868-1879 (2011) TIA1 iCLIP h HeLa 21884 6936 558 2.5% 81 1.2% Wang et al. PLoS biology 8, e1000530 (2010) TIAL1 iCLIP h HeLa 51751 9348 1180 2.3% 86 0.9% Wang et al. PLoS biology 8, e1000530 (2010) U2AF2 iCLIP h HeLa 518795 13661 5232 1.0% 83 0.6% Zarnack et al. Cell 152, 453-466 (2013) QKI PAR-CLIP h HEK293 2534 1500 30 1.2% 21 1.4% Hafner et al. Cell. Apr 2;141(1):129-41 (2010 ) FMR1 (isoform 1) PAR-CLIP h HEK293 121903 6893 1607 1.3% 81 1.2% Ascano et al. Nature. Dec 20;492(7429):382-6 (2012) FMR1 (isoform 7) PAR-CLIP h HEK293 78495 8177 1056 1.3% 83 1.0% Ascano et al. Nature. Dec 20;492(7429):382-6 (2012) ELAVL1 PAR-CLIP h human cells 1040 173 5 0.5% 1 0.6% Kishore et al. Nat Methods. May 15;8(7):559-64 (2011) ELAVL1 CLIP-Seq h human cells 993 168 4 0.4% 1 0.6% Kishore et al. Nat Methods. May 15;8(7):559-64 (2011) WTAP PAR-CLIP h 293T cells expressing Flag-WTAP 1068 551 11 1.0% 11 2.0% Ping et al. Cell Res. Feb;24(2):177-89 (2014) SRSF1 CLIP-Seq m mouse embryo fibroblasts (MEFs) 50982 7424 726 1.4% 77 1.0% Pandit et al. Mol Cell. Apr 25;50(2):223-35 (2013) SRSF2 CLIP-Seq m mouse embryo fibroblasts (MEFs) 56335 7354 827 1.5% 82 1.1% Pandit et al. Mol Cell. Apr 25;50(2):223-35 (2013) SRSF3 iCLIP m diploid mouse P19 cells 2304 2304 46 2.0% 46 2.0% Änkö et al. Genome Biol. 13(3):R17 (2012) SRSF4 iCLIP m diploid mouse P19 cells 1055 1055 14 1.3% 14 1.3% Änkö et al. Genome Biol. 13(3):R17 (2012) MBNL1 CLIP-Seq m brain 3176 1142 51 1.6% 14 1.2% Wang et al. Cell. Aug 17;150(4):710-24 (2012) MBNL1 CLIP-Seq m heart 644 380 9 1.4% 8 2.1% Wang et al. Cell. Aug 17;150(4):710-24 (2012) MBNL1 CLIP-Seq m muscle 442 265 7 1.6% 5 1.9% Wang et al. Cell. Aug 17;150(4):710-24 (2012) MBNL1 CLIP-Seq m myoblasts 24190 5158 398 1.6% 67 1.3% Wang et al. Cell. Aug 17;150(4):710-24 (2012) NOVA1 CLIP-Seq m mouse brain 24482 3871 276 1.1% 44 1.1% Zhang et al. Science. Jul 23;329(5990):439-43 (2010) PTBP2 CLIP-Seq m mouse brain 258398 11287 2195 0.8% 88 0.8% Zagore et al. Mol Cell Biol. Dec;35(23):4030-42 (2015) NOVA2 CLIP-Seq m mouse brain 120 65 0 0.0% 0 0.0% Licatalosi et al. Nature. Nov 27;456(7221):464-9 (2008) MBNL2 CLIP-Seq m mouse hippocampi 5200 2179 46 0.9% 29 1.3% Charizanis et al. Neuron Aug 9;75(3):437-50 (2012) TARDBP CLIP-Seq m adult mouse brain 39960 5749 501 1.3% 62 1.1% Polymenidou et al. Nat Neurosci. Apr;14(4):459-68 (2011 ) RBFOX2 (FHFOX2) iCLIP m V6.5 ES cells 35640 6990 330 0.9% 79 1.1% Jangi et al. Genes Dev. Mar 15;28(6):637-51 (2014) FUS CLIP-Seq m mouse neurons differentiated from embryonic stem156 cells 156 4 2.6% 3 1.9% Nakaya et al. RNA. Apr;19(4):498-509.(2013)

To determine if transcripts of additional splicing factors, which have not undergone CLIP-seq studies, could be regulated by RUST, I selected additional 56 alternative splicing regulators (for selection see materials and methods) and expanded the CLIP based network to include interactions among 100 gathered splicing factors. This expanded network reveals that transcripts of all 100 splicing factors are bound by another splicing regulator. The number of these incoming interactions vary. It ranges from 4 for CELF3, to 35 for hnRNPH1 (Figure 1.3). In this network, many splicing factors interact with transcripts of over 80 splicing regulators. These factors include: SRSF2 (82), TIA1 (82), hnRNPH1 (83), SRSF1 (84), U2AF2 (85), TIAL1 (86), RBFOX2 (86), PTBP2 (88), hnRNPC (89), FMR1 (89), and hnRNPL (92). A handful of splicing regulators bind 10 or fewer splicing regulator transcripts. These factors include: SFPQ (1), hnRNPUL1 (2), SF1 (5), DDX42 (6), SRSF9 (8), and KHDRSB1 (9). This extended all-CLIP network contains 2180 interactions. The clustering coefficient of the network is 0.66. Here, on average, a splicing factor has 36 neighbors and many splicing factors

12 express NMD-targeted transcripts (56, red font nodes on Figure 1.3). This highly connected network of 100 splicing factors raises the upper limit of possible RUST regulations.

Without more experimental data, it is impossible to determine what fraction of the splicing factor- mRNA interactions might represent RUST. However, one can compare the experimentally proven RUST network to the available CLIP binding peaks to further infer if CLIP interactions could be used to infer regulation (Table 1.4). Overall, there is a statistically significant agreement between the CLIP interactions and RUST (p value equals 0.0004 (two-tailed Fisher test)). RUST is observed in 34 experiments and is confirmed by 28 CLIP interactions (82% agreement) (Table 1.4). More importantly, when RUST is not observed (11 instances), 7 CLIP interactions are absent (64% agreement) with only 2 false positives in which CLIP detects interaction but for which experimental RUST is not confirmed (Table 1.4). This overlap hints that approximately 60% of presented CLIP based interactions (Figure 1.3) might be potential regulations via RUST.

Table 1.4. Overlap of RUST and CLIP RUST RUST not CLIP observed observed yes 28 2 no 4 7 not tested 2 2 total 34 11

13

Figure 1.3 Splicing factor-mRNA interaction network extended to 100 splicing regulators

14 This network contains 2180 edges. The clustering coefficient is 0.66. Description of nodes is the same as in Figure 1.1. Description of the edges is the same as in Figure 1.2. Self-edges are gray.

It has been described that transcription factors form hierarchical networks.1,51,52,53 To determine if splicing factors also form a hierarchical network, Dr. Hu used the same approach as Gerstein et al. (2012).1 There they applied hierarchy height metric on the published ENCODE ChIP-Seq data from transcription factors. Dr. Hu applied the hierarchy height metric to the ENCODE eCLIP data of our selected splicing factors (Figure 1.4 and Figure 1.5). Based on the ratio of incoming and outgoing binding events to a particular factor, a factor can be split into three layers. The top layer is for “executive” regulators, and is for factors with a value in the top third of possible hierarchy height values. The middle layer is for factors with values from the center of this range, and “foreman” are from the lower third of this range. Splicing factors, like transcription factors, can be grouped into all three levels of the hierarchy. However, the hierarchy among splicing regulators is different from the hierarchy of transcription factors. Transcription factors are highly peaked at 1 and -1 values (Figure 1.4 and Figure 1.5) and more distinctly group into three layers (three hierarchy height peaks are seen on Figures 1.4 and 1.5), whereas, splicing factors are uniform across the entire “hierarchy height” axis. Interestingly, nearly all splicing factors that have auto-loops gather in the middle layer. The only exception is SF3B4, which is grouped in the top layer. However, its “hierarchy height” is near the border with the middle layer. Moreover, in the K562 eCLIP based network, two splicing factors, SRSF1 and PTBP1, are grouped with the middle layer and do not exhibit auto-loops, but these self-binding loops have been described elsewhere and were not detected by eCLIP.

15

Figure 1.4 Comparison of hierarchies of TF-TF network and SF-SF network in K562 cell line (A-B). The distributions of node hierarchies defined as h=(O–I)/(O+I), where O represents the out-degree and I represents the in-degree, in TF-TF network (A) and SF-SF network (B). C-D. TF-TF network (C) and SF-SF network (D) shown in a hierarchical manner. Nodes with h >1/3 and h <-1/3 are on the top and bottom level respectively, with other nodes located in the middle. The grey lines show top-down interactions while red lines show bottom-up regulations. The colors are in the same manner as in Figure 1.2.

16

Figure 1.5. Hierarchies of TF-TF network in GM12878 cell line and SF-SF network in Hep2 cell line. A-B. The distributions of node hierarchies defined as h=(O–I)/(O+I), where O represents the out-degree and I represents the in-degree, in TF-TF network in GM12878 cell line (A) and SF-SF network in HepG2 cell line (B). C-D. TF-TF network in GM12878 cell line (C) and SF-SF network in HepG2 cell line (D) shown in a hierarchical manner. Nodes with h >1/3 and h <-1/3 are on the top and bottom level respectively, with other nodes located in the middle. The grey lines show top-down interactions while red lines show bottom-up regulations. The colors are in the same manner as in Figure 1.2.

It has been described in literature that splicing factors regulate themselves and other splicing regulators. To determine if RUST regulation might be more widespread among splicing regulators versus all other genes, we compared the frequency of splicing factors binding to all genes versus binding to genes of other splicing factors (Table 5, Figure 6). The comparison reveals that on average, a splicing factor binds transcripts of other splicing regulators five times more frequently than transcripts of all expressed genes (Table 5). While all splicing factors follow this trend, there is a wide range. For example, SRSF7 and KHDRBS1 bind splicing factor genes 12 times more frequently than all other genes in the K562 cells, whereas hnRNPU binds splicing factor genes 11 times more frequently than all other genes in the HepG2 cells. Also, hnRNPA1 exhibits preferential binding to genes of other splicing factors than to all other genes (17 times more) in the HepG2 cells; however, this result is based on a small overall number of

17 eCLIP clusters identified for this particular factor. At the other spectrum, factors such as PTBP1 and DDX3X show only slight preference for binding splicing factor genes than all other genes. For example, PTBP1’s preference reaches 1.3 in both cell lines (Table 5). Interestingly, a combined plot for all regulators that underwent eCLIP (43 studies) reveals that there is a continuum in the preferential splicing factor binding to genes of other splicing factors.

Figure 1.6. All evaluated splicing factors bind transcripts of other splicing factors more prevalently than transcripts of other genes. Plotted are ratios of the number of splicing factors a protein binds over the number of all expressed splicing factors in a given cell line as described in legend, versus ratio to the number of genes a splicing factor binds over all expressed genes in a given cell line with an FPKM value of >1. The data is obtained from eCLIP experiments. The size of each circle represents total number of CLIP clusters obtained by eCLIP. Names of some of the factors are noted on the plot. Details are included in Table 1.5.

18 Table 1.5. eCLIP data used to plot Figure 1.6. bound bound to all genes all to genes all to percent of all all percent of all percent of CLIP-clusters CLIP-clusters Splicing Factor Splicing Factor Splicing eCLIP Cell Line Cell eCLIP Line Cell eCLIP Total number of of number Total of number Total percentsfbound expressed genes percentsfbound expressed genes Ratio Percent bound PercentRatio bound PercentRatio bound to SF /percent bound /percentbound SF to /percentbound SF to SRSF7 K562 0.20 0.02 426 12.13 HNRNPA1 HepG2 0.02 0.00 40 16.99 KHDRBS1 K562 0.11 0.01 200 11.70 HNRNPU HepG2 0.02 0.00 135 10.55 DDX42 K562 0.07 0.01 164 8.52 TRA2A HepG2 0.24 0.03 749 8.69 U2AF1 K562 0.37 0.05 1119 7.28 U2AF1 HepG2 0.23 0.03 786 7.85 ZRANB2 K562 0.36 0.05 1381 7.19 HNRNPUL1 HepG2 0.02 0.00 300 7.11 SRSF1 K562 0.31 0.04 1167 7.12 SRSF7 HepG2 0.13 0.02 506 6.98 TRA2A K562 0.18 0.03 1347 6.30 SRSF1 HepG2 0.31 0.05 1181 6.51 U2AF2 K562 0.54 0.11 2859 4.99 TIA1 HepG2 0.41 0.07 1907 5.89 RBFOX2 K562 0.23 0.05 1316 4.73 HNRNPC HepG2 0.27 0.06 1507 4.58 HNRNPA1 K562 0.01 0.00 46 4.51 SRSF9 HepG2 0.10 0.02 546 4.15 TIA1 K562 0.48 0.11 2988 4.28 SF3B4 HepG2 0.58 0.17 5076 3.37 HNRNPU K562 0.01 0.00 75 3.84 SF3A3 HepG2 0.46 0.15 4019 3.07 KHSRP K562 0.40 0.11 5023 3.54 RBFOX2 HepG2 0.31 0.11 3832 2.82 HNRNPUL1 K562 0.01 0.00 150 3.49 U2AF2 HepG2 0.46 0.18 5724 2.48 HNRNPK K562 0.13 0.04 1150 3.21 HNRNPM HepG2 0.23 0.09 4295 2.46 FMR1 K562 0.31 0.10 3743 3.21 QKI HepG2 0.31 0.13 4404 2.38 QKI K562 0.26 0.08 1982 3.09 HNRNPK HepG2 0.18 0.08 2282 2.37 HNRNPM K562 0.24 0.08 3533 3.05 PCBP2 HepG2 0.41 0.23 8284 1.77 SF3B4 K562 0.56 0.21 7663 2.64 SFPQ HepG2 0.01 0.01 295 1.59 TARDBP K562 0.29 0.12 3793 2.40 DDX3X HepG2 0.46 0.31 9644 1.50 DDX3X K562 0.57 0.31 8422 1.83 PTBP1 HepG2 0.17 0.13 4163 1.26 PTBP1 K562 0.17 0.12 3838 1.35 Average 4.97 Average 5.02

Discussion

Regulation of gene expression has been studied extensively at the transcription factor level, whereas regulation of gene expression by splicing factors has been studied to a lesser extent. It has been shown that individual splicing factors have crucial influence over cell fate and development, however, regulatory interactions among splicing factors and existence of potential hierarchy among splicing regulators has not been well established. It emerged that expression of many genes, and in particular, many splicing factor genes, is regulated via RUST (Table 1.1). Here, we collected published regulation between splicing factors through RUST and presented it in the form of a network. This network seems sparse and is biased towards well studied splicing regulators. To regulate alternative splicing, splicing factors are required to interact with pre-mRNA. Hence, to establish the potential for RUST, we gathered publicly available splicing factor–mRNA interactions and presented them in the form of a splicing factor–mRNA interaction network for 100 selected splicing regulators (Figure 1.3). In line with literature, this network is highly connected, with a connectivity density of 0.6, suggesting that the potential for RUST is high.

It has been proposed that there are master regulators among splicing factors, however, a master regulator has been used to mean many things. It has been proposed that a splicing factor master regulator is a protein that maintains a specific cell lineage by directly acting upon genes rather than regulating other splicing factors. For example, the distinction of a master regulator has been granted to SRSF6 for eye development in Drosophila 54. It also has been proposed that the entire family of serine/arginine rich 19 splicing factors (SR proteins) are master splicing regulators, based on the finding that they are important splicing factors which when disrupted lead to developmental defects and disease 55. The number of other potential splicing factor master regulators has been reviewed in 25. Here, instead of searching for master regulators among splicing factors, we assess the hierarchy among splicing factors using a “hierarchy height” metric, as used by Gerstein et al., 20121, and compare it to hierarchy among transcription factors. We discover that the hierarchy among splicing factors differs from that of transcription factors. This “hierarchy height” metric splits factors into three layers based on the ratio of incoming and outgoing interactions. Transcription factors seem to fall into these three categories well, with the “executive” top layer factors (for which the ratio of outgoing interactions is high relative to all interactions) and the “foreman” bottom layer factors (for which ratio of incoming interactions is high relative to all interactions). Factors in the middle layer have an approximately equal number of incoming and outgoing interactions. In contrast, our limited data shows that splicing factors do not form such sharp layers. Instead the transition between layers is less pronounced when the ratio of incoming to outgoing edges is considered. In addition, transcription factors with auto- interactions can be spotted in all three layers, whereas splicing factors with auto- interactions nearly exclusively group in the middle layer.

In the quest for a unified regulatory splicing factor network, we uncovered that splicing factors bind transcripts of other splicing factors, on average, five times as frequently as transcripts of other genes. This finding is in line with published studies which describe abundant splicing factor binding to transcripts of other splicing factors, as reviewed in 56 and 57. Here, we compare a sizeable number of factors (28 that underwent eCLIP in either K562 and HepG2 cells) and uncover that the trend of splicing factors binding more frequently to transcripts of other splicing regulators applies to all splicing factors in this study (Table 1.4, Figure 1.6).

There are several potential limitations that could undermine our study. While CLIP is a powerful method that provides experimental interactions of an RNA binding protein with RNA, CLIP experiments suffer from low crosslinking efficiency 58 and might result in false negative and false positive results. We compared the overlap of the CLIP interactions to the RUST regulatory network, and surprisingly, there was a high overlap with a low frequency of false positives and false negatives (Table 1.5). Therefore, despite limitations, CLIP-seq is a relatively confident method to predict regulations among splicing factors. In addition to deficiencies inherent to the CLIP experiment, our 44 factor CLIP network is derived from studies that combine data derived from a mix of tissues and cell lines of human and murine origin (Table 1.2). As a consequence, this could result in a network that might be biologically irrelevant and any derived hierarchy or the lack of it would be an artifact. Therefore, we used the 44 factor CLIP network to obtain an overall extent of interactions to observe potential for regulations. Here, not every interaction might be correct, but we get a sense of the magnitude of regulation among splicing factors.

20 Materials and Methods

Splicing factor regulatory network

Through extensive literature search, I collected 45 published experimental results (Table 1.1) that tested overexpression or knockdown of a splicing factor resulting in alternative splicing changes of the ratio of productive to unproductive transcripts of splicing factors. The unproductive transcripts have been shown experimentally to be targets of the NMD pathway 29,30,39,59,42,40,33,32,60,34,41,44,43,61,31,62,63,64,65.

Selection of 100 alternative splicing factors

The selection of 100 alternative splicing regulators is derived from the NCBI database of genes (www.ncbi.nlm.nih.gov/gene/), and the GeneCards database (www..org) . I also performed a search of NCBI PubMed (www.ncbi.nlm.nih.gov/pubmed) abstracts for alternative splicing regulators. As a query, we paired an “alternative splicing” term with a list of RNA binding proteins included in the GO:0003723 Gene Ontology Consortium database (geneontology.org). The full list of splicing factors selected is presented in the form of a splicing factor–RNA interaction network (Figure 1.3).

Splicing factor interaction network

I collected 89 CLIP studies that used a splicing factor as the cross-linked protein, 74 were performed in human cells or tissues and 15 were conducted using murine biological material 66,67,50,68,69,70,43,60,71,72,73,74,75,68,76,77,78,79,80,81,82,83,84,85,86,87,34,88. The results are summarized in Table 1.2. The splicing factor-RNA interaction binding sites identified in each study were used to construct the networks presented on Figure 1.2 and Figure 1.3. When original studies provided only genomic coordinates of CLIP- clusters, Arun Desai and I used appropriate Ensembl GTF annotation files (hg19, GRCh38, mm9) to add corresponding gene names. The collection of 89 CLIP studies includes 43 eCLIP studies which were downloaded from the ENCODE website (www.encodeproject.org). For eCLIP data, Arun Desai downloaded two biological replicates (bed narrowPeak) for each splicing factor of interest (for accession codes see Table 1.4). The two replicates were merged. Overlapping eCLIP clusters were collapsed using Yeo Lab Perl script (https://github.com/YeoLab/gscripts/perl_scripts/compress_l2foldenrpeak.fi.pl). The clusters were filtered for significant clusters that are defined as splicing factor-mRNA binding sites that are identified as significant (P < 0.05) by CLIPper and are eightfold enriched above SMInput 38.

Cytoscape 3.4.0 89 was used to visualize and analyze networks.

Splicing and transcription factors hierarchical networks

21 The TF-gene networks for GM12878 and K562 cell lines were downloaded from Gerstein et al.’s work 1 (http://encodenets.gersteinlab.org/). The TF-TF networks were then extracted from the TF-gene networks by requiring that both of the involved nodes of an edge are within the 119 transcription factors studied in Gerstein et al.’s work. We studied the hierarchies of these two TF-TF networks and our SF-SF networks with a similar method presented in Gerstein et al.’s work. In detail, the hierarchy (h) of each node is calculated as h=(O–I)/(O+I), where O represents the out-degree and I represents the in-degree. A high value means the node is in an “executive” position while a low level represents the node is under regulation. The networks were plotted in 3 layers, with the top “executive” layer (h>1/3), the bottom under-regulation layer (h<- 1/3) and a middle layer. This section was performed by Dr. Hu.

Assessment of splicing factor binding enrichment to transcripts of splicing factors

Expression data for HepG2 and K562: ENCODE, whole cell, long poly A, E-GEOD- 26284 experiment was downloaded through the EMBL-EBI Expression Atlas (https://www.ebi.ac.uk/gxa/experiments/E-GEOD-26284, accessed on December 12, 2016). Transcripts were sorted by expression, and only transcripts with an FPKM of 1 or above were included in our analysis. Arun Desai and I collected 12,887 genes for K562 and 12,692 for HepG2. There are 82 splicing factor genes expressed in HepG2 cell and 84 in K562 cells with an FPKM over 1.

To find the fraction of splicing factors bound by a given splicing regulator, Arun Desai and I calculated a ratio of the number of splicing factors a protein binds over the number of expressed splicing factors in a given cell line (K562 or HepG2). We compared this ratio to the number of genes a splicing factor binds over all expressed genes in a given cell line (Table 1.5, Figure 1.6).

22 Table 1.6. eCLIP accession codes HepG2 eCLIP K562 eCLIP ENCODE ENCODE Splicing Factor Accession Splicing Factor Accession Code Code HNRNPUL1 ENCFF867UTV FMR1 ENCFF473OKU HNRNPUL1 ENCFF439LWA FMR1 ENCFF464VRH U2AF1 ENCFF937SDW HNRNPUL1 ENCFF087TZL U2AF1 ENCFF580XXM HNRNPUL1 ENCFF399OSD TIA1 ENCFF103OSX KHSRP ENCFF066PCT TIA1 ENCFF354DYL KHSRP ENCFF512ZBC TRA2A ENCFF271DET U2AF1 ENCFF584WFG TRA2A ENCFF443POQ U2AF1 ENCFF840LLJ HNRNPA1 ENCFF276NLJ TIA1 ENCFF459AMC HNRNPA1 ENCFF938FPK TIA1 ENCFF827ZJE PCBP2 ENCFF868AYD DDX42 ENCFF248XLD PCBP2 ENCFF877CWI DDX42 ENCFF404VAY SRSF9 ENCFF529OUY TRA2A ENCFF149BAF SRSF9 ENCFF979SAC TRA2A ENCFF512DWR SF3A3 ENCFF912RGT HNRNPA1 ENCFF457MNN SF3A3 ENCFF506HBD HNRNPA1 ENCFF288SMV HNRNPC ENCFF264TJL TARDBP ENCFF781KGR HNRNPC ENCFF991KAG TARDBP ENCFF916MCG RBFOX2 ENCFF433UTA RBFOX2 ENCFF426AYZ RBFOX2 ENCFF179HQN RBFOX2 ENCFF474SIP HNRNPK ENCFF994SDB HNRNPK ENCFF655RBH HNRNPK ENCFF271ASO HNRNPK ENCFF951RJR QKI ENCFF714ZTU QKI ENCFF379OMP QKI ENCFF299VTZ QKI ENCFF342CVS PTBP1 ENCFF059MNT PTBP1 ENCFF196VDD PTBP1 ENCFF173FHO PTBP1 ENCFF166LLH HNRNPM ENCFF079OKS HNRNPM ENCFF675ZIU HNRNPM ENCFF746CAJ HNRNPM ENCFF571NRK DDX3X ENCFF637BHY DDX3X ENCFF483OBV DDX3X ENCFF397MAX DDX3X ENCFF912NCS SFPQ ENCFF937XDO KHDRBS1 ENCFF308KNI SFPQ ENCFF128SFP KHDRBS1 ENCFF406JXO SRSF1 ENCFF806LMS SRSF1 ENCFF647VSE SRSF1 ENCFF977LDY SRSF1 ENCFF548JES SRSF7 ENCFF250WLO ZRANB2 ENCFF927FBB SRSF7 ENCFF130DJA ZRANB2 ENCFF696QAI U2AF2 ENCFF683PRA SRSF7 ENCFF822SZT U2AF2 ENCFF202TGG SRSF7 ENCFF435RSV SF3B4 ENCFF795NIS U2AF2 ENCFF129DJK SF3B4 ENCFF367PPA U2AF2 ENCFF009ERS HNRNPU ENCFF335HFQ SF3B4 ENCFF254LKZ HNRNPU ENCFF957ECH SF3B4 ENCFF930OND HNRNPU ENCFF759QAQ HNRNPU ENCFF464SKW

23 Chapter 2 Transcriptome-wide identification of potential RUST targets reveals extensive redundancy between HeLa and GM12878

Abstract

Nonsense-mediated mRNA decay (NMD) is a mechanism that degrades transcripts with a premature termination codon. Recent data suggest that NMD is an important mechanism of global gene expression regulation that is specific to cells and tissues. Here, we describe identification of a confident set of NMD targets following the exon- junction complex (EJC) model of NMD. We performed transcriptome profiling of human cells, HeLa and GM12878, depleted of the NMD factors UPF1 and SMG6 or exposed to cycloheximide (CHX). We observe that NMD factor knockdown (KD) is likely a better method to identify NMD targets than the CHX treatment. Our experiments overwhelmingly reveal that isoforms which are NMD targets in one cell line are not expressed in the other cell line. We also find that approximately 30% of NMD targets are shared among these two cell lines.

Introduction

In the past few decades, nonsense mediated mRNA decay (NMD) has been investigated to elucidate its molecular mechanism and to determine unifying features of RNA targets degraded by NMD. Many diverging models of the NMD pathway have emerged which, despite their differences, all agree that NMD requires a pioneer round of translation (as reviewed in 90), and hence translation inhibition factors such as cycloheximide (CHX), emetine, and puromycin inhibit NMD 91. It has also been well established that the NMD pathway, in order to function properly, requires a number of well-characterized trans-acting factors. One of the core NMD factors is UPF1, an ATP dependent helicase, which is essential for NMD in all eukaryotes. UPF2, another core NMD factor, functions as a bridge that links UPF1 to UPF3, the third core NMD factor 92. UPF3 associates with exon junction complexes (EJCs) that are deposited on freshly spliced mRNAs 20 to 24 nucleotides upstream from the EJC (Le Hir et al. 2000). In addition, NMD depends on SMG1, a phosphatidylinositol 3-kinase that phosphorylates UPF1. SMG1 associates with two other NMD regulatory factors: SMG8 and SMG9 93,94. Two additional NMD factors, SMG5 and SMG7, form a heterodimer that plays a role in UPF1 dephosphorylation 95. SMG6, the effector of the NMD pathway, cleaves transcripts in the vicinity of the NMD-triggering premature termination codons (PTCs) and renders them for degradation 96,97. In addition to these well characterized factors, a few additional components of the NMD pathway have been identified (as reviewed in 98). Several of them, such as NBAS, DHX34, GNL2, SEC13, PNRC2, MOV10, and RUVBL1/2 are required by or affect human NMD (as reviewed in 98).

To study targets of NMD, it is imperative to inhibit the NMD pathway to reveal the hidden transcriptome. Considering mammalian cell tissue culture, in general, there are two approaches to arrest the NMD pathway: a chemical treatment with translation 24 inhibitors or a knockdown of essential NMD factors. Both approaches reveal transcripts that would be degraded by NMD. However, utilizing either approach offers advantages and disadvantages. Main advantages of using translation inhibitors are ease of use and rapid execution, which does not extend beyond several hours, allowing for rapid capture of NMD-targeted transcripts. The main disadvantage of this approach is cellular toxicity and induction of autophagy 99. The main advantage of using targeted gene knockdown is the possible higher specificity. However, this may be hampered by the fact that many NMD factors have been found to have secondary cellular roles (as described in 100). In addition, gene knockdown requires days to complete, during which the NMD–targeted transcriptome might be altered. Also, even when successful, gene knockdown never reduces protein level to zero in the entire treated cell population, allowing for the NMD pathway to be active at a low level (as reviewed in 101).

Two NMD models have emerged: the first proposes that mRNAs with long 3’ UTRs are targeted by NMD; the second, the EJC model, states that mRNAs with premature termination codons 50 to 55 nucleotides upstream from EJC (PTC50) are targeted by NMD. However, many mRNAs with long 3’ UTR are not targeted by NMD and many mRNA transcripts with PTC50s do not seem to increase in abundance when NMD is inhibited 102103104. Even though neither model can predict with certainty which transcript is an NMD target, transcripts that contain PTC50 have been shown to be predictive of 75% of NMD targets 105.

NMD has been shown to play a role in embryonic development and tissue-specific cell differentiation 106(and as reviewed in 107). NMD efficiency varies by cell type 108,109,110. To study cell line-specific NMD targets and to obtain a confident set of NMD targets, researchers often select well established cell lines such as HeLa. For over 60 years, cervical cancer derived HeLa cells have been the workhorse of molecular biology; however, over the years, HeLa developed numerous duplications and deletions that affect gene expression compared to the normal human genome 111. To study the targets of NMD, in addition to HeLa cells, I select GM12878 cells. Even though GM12878 are cancerous since they are immortalized via Epstein Barr transfection, these lymphoblastoid cells are characterized by normal karyotype. In addition to normal human genome, the advantage of using GM12878 is that they are the original cells of the HapMap project and are identified as a tier 1 cell line by the ENCODE project. In addition to karyotype differences, there are other differences that distinguish these two cell lines. HeLa cells are epithelial, grow as a monolayer, and are very easy to transfect. Whereas the much smaller transformed B-lymphocytes are cultured in suspension, have a very high tendency to clump and are difficult to transfect. Transient gene knockdown of HeLa cells has been reported numerous times 112. In contrast, a protocol for transient knockdown of genes in GM12878 cells was not available at the time of the experiments in this study were conducted. Here I utilize established protocols to knockdown SMG6 and UPF1 in HeLa cells. I also establish an optimized protocol to knockdown SMG6 in GM12878 cells.

In this study, we aim to obtain a confident set of NMD targets derived from HeLa and GM12878 cells. We define confident NMD targets as transcripts that contain a PTC

25 more than 50 nucleotides upstream from an EJC that are upregulated at least two-fold in cells with NMD inhibited. In addition, these genes are required to express transcripts with normal stop codons (non-PTC50) that do not increase in abundance when NMD is inhibited. To increase the power of discovery and to add statistical power, Dr. Hu merge sequencing data from gene knockdown and translation inhibitors in HeLa and GM12878. Dr. Hu and I also explore cell line specific NMD targets.

Results

Gene knockdown efficiency is higher in HeLa than in GM12878 cells

To obtain a confident set of NMD targets, we aim to inhibit the NMD pathway in HeLa and GM12878 cells. In the past 60 years HeLa cells have accrued a multitude of mutations, duplications, and chromosomal rearrangements, resulting in aberrant karyotype 113,111. In contrast, GM12878 cells, the EBV transformed B-lymphocytes, are characterized by normal karyotype. In addition to karyotype differences, HeLa cells are epithelial, grow as a monolayer, and are very easy to transfect. Whereas the much smaller GM12878 cells grow in suspension and have a very high tendency to clump, and are difficult to transfect. To inhibit the NMD pathway, I selected two core NMD factors: UPF1 and SMG6. UPF1 has been targeted for silencing in many NMD-related studies 114,115,116,104,117,118. However, UPF1, in addition to regulating NMD, plays additional roles that are NMD independent, as summarized in 100, and hence NMD targets identified via UPF1 knockdown may include false positives. Therefore, I aim to knockdown a second core NMD factor, SMG6, which has been identified as an effector of the NMD pathway, and its knockdown likely distinguishes NMD-targeted transcripts from non-NMD-targeted transcripts more accurately than the UPF1 knockdown. However, SMG6 also has additional roles in replication and maintenance of ends 119. As a parallel method independent of gene knockdown, I use yet another method to inhibit NMD; I apply translation inhibition via cycloheximide (CHX) 120,121. The drawback of cycloheximide is that as an inhibitor of translation, it indiscriminately affects all RNA transcripts. This has a broad effect on the entire transcriptome and likely introduces noise during NMD-target identification. However, cycloheximide offers a fast approach and has been used widely in NMD-inhibition studies (see, for example 122,123,124,125,126). Here, we aim to discern the overlap in NMD targets identified via CHX treatment and UPF1 or SMG6 knockdown to find out if either technique is preferred in NMD target identification.

26 Table 2.1. Experimental approaches employed to achieve strong transient gene knockdown in GM12878 cells.Table 1 Transfected Transfection method Transfection outcome particle Lipofection: Lipofectamine LTX with Plus UPF1 shRNA Cell death in response to puromycin selection Reagent with puromycin selection Western Blot with anti-UPF1 antibody: no UPF1 siRNA Nanoparticle mediated transfection UPF1 knockdown detected Western Blot with anti-UPF1 antibody: no UPF1 siRNA Lipofection: :Lipofectamine RNAiMAX Reagent UPF1 knockdown detected Western Blot with anti-UPF1 antibody: no UPF1 siRNA Lipofection: ITERFERin UPF1 knockdown detected GFP Electroporation (Solution R, Program Y-001) Transfection efficiency below 30% GFP Electroporation (Solution T, Program T-001) Transfection efficiency below 30% Double electroporation (Solution V, Program U- GFP Transfection efficiency at 60% 009) Western Blot with anti-UPF1 and anti-SMG6 Double electroporation (Solution V, Program U- SMG6 siRNA antibody: UPF1 knockdown reaches 95% and 009) SMG6 knockdown reaches 50%

For NMD factor knockdown, I employ different experimental techniques in HeLa and GM12878 cells. Transient gene knockdown of HeLa cells has been reported 112. To knockdown NMD factors in HeLa cells, I use shRNAs specific against UPF1 and SMG6. The shRNA plasmids are a generous gift from Professor Muhlemann and here I follow the published protocol 127 (also described in Materials and Methods). A protocol for transient gene knockdown in GM12878 cells was not publicly available at the time of this study (after completion of the GM12878 transfection experiments, a method has been published by 128). To obtain strong transient gene knockdown in GM12878 cells, I undertook several strategies and optimization steps (Table 2.1). I determined that lipid mediated transfection, which was successfully applied to introduce shRNA-expressing plasmids into HeLa cells, does not perform well with GM12878 cells. Empirical studies confirm that two rounds of electroporation, twenty-four hours apart, successfully introduce siRNAs into these cells. This method results in a knockdown of UPF1 and SMG6 by over 90% as assessed by Western Blot and by nearly 50% as assessed by qPCR (Figure 2.1). Overall, the level of gene knockdown is lower for GM12878 cells due to lack of antibiotic selection, a technique that was applied to HeLa cell culture to eliminate non-transfected cells. To validate the strength of the NMD pathway inhibition, I assess expression of an SRSF6 NMD-targeted isoform in NMD inhibited cells and compare it to expression in control cells. In cells with inhibited NMD, this transcript seems more abundant since it is not degraded by the NMD pathway (Figure 2). Interestingly, even though both UPF1 and SMG6 are knocked down to similar levels in GM12878 cells, only SMG6 knockdown successfully inhibits the NMD pathway. Hence, UPF1 knockdown in GM12878 cells is not considered in this study. No such discrepancy is observed in Hela cells; UPF1 and SMG6 are both inhibited at similar levels (>92%) as indicated by Western Blot (Figure 2.1). According to PCR, UPF1 knockdown (90% average) is stronger than that of SMG6 (73% average) (Figure 2.1). Also, on average, inhibition of the NMD pathway, as verified by upregulation of the NMD-targeted SRSF6 transcript, is stronger in HeLa replicates with inhibited UPF1 than 27 in replicates with inhibited SMG6 (Figure 2.2). I also verify that CHX treatment inhibits the NMD pathway, however, upregulation of NMD-targeted SRSF6 isoform, on average, is lower when CHX is applied than when NMD factors are knocked down in both cell lines (Figure 2.2).

A B

Knocdown 94.8% 95.1% 94.3% 92.1% level 1 UPF1 0.9 Actin 0.8 HeLa 0.7 Knocdown 92.9% 98.9% 96.5% 96.1% level%%%%%%%%%%%%%%%%%%%%%%%%%%>?,%%%%%%%%%%%%%%,?,%%%%%%%%%%%%%%%%%%%%@?1%%%%%%%%%%%%%%%%%%@?A%0.6 SMG6 0.5 %%*'+,%%%%%%-.,%%%%*'+/%%%-./%%%%%*'+0%%%%%-.0%%%%%*'+1%%%-.1% Actin 0.4 0.3 Knocdown 100% 95.1% 95.5% level 0.2 UPF1 0.1 0 Actin Expression change (knockdown / control) UPF1 SMG6 UPF1 SMG6 knockdown knockdown knockdown knockdown GM12878 Knocdown 95.0% 94.7% 96.8% HeLa GM12878 level

SMG6

Actin

Figure 2.1 Experimental validation of SMG6 and UPF1 knockdown in HeLa and GM12878 cells by qPCR and western blots. (A) Western Blot with antibodies specific to UPF1 and SMG6. Percent knockdown is calculated based on knockdown signal relative to control as presented side by side in the figure. Actin is used as a loading control. (B) qPCR results showing gene silencing efficiency of shRNA and siRNA sequences targeting UPF1 or SMG6 in HeLa and GM12878 cells. A non-specific shRNA or siRNA was used as control. qPCR results are normalized to TATA box binding protein (TBP). All replicates are shown as individual points and the average is marked as a horizontal line.

28

Figure 2.2. Validation of NMD inhibition through SRSF6 isoform expression alterations. (A) Schematic representation of the SRSF6 isoforms. SRSF6 can be spliced into productive (comprising of blue exons) and unproductive (containing a yellow exon in addition to blue ones) isoforms. Stop codons are marked as stop signs. The PTC50 stop codon is positioned in the yellow exon. Amplicons span exon-junctions and are marked with arrows. They amplify the following portions of the SRSF6 gene: (a) amplifies non-PTC50 transcripts, (b) amplifies PTC50 transcripts, and (c) amplifies all SRSF6 transcripts. (B) qPCR results show the extent of SRSF6 upregulation in HeLa cells treated with shRNA against UPF1 and SMG6. The upregulation is compared to expression in cells treated with non-specific shRNA. GM12878 cells are treated with siRNA against SMG6 and UPF1. As a control, non-specific siRNA is used. HeLa and GM12878 are treated with cycloheximide (CHX) and a control is treated with DMSO. All qPCR results are normalized to TBP. All replicates are presented as individual points and the average is marked as a horizontal line.

NMD target identification with ANOVA-like strategy

Identifying confident NMD targets poses several challenges. Often, differential gene expression is used to identify such targets; when NMD is inhibited, transcripts that seem to increase in abundance, as compared to control condition, are branded as NMD targets. However, this approach likely introduces noise, since, as mentioned above, many NMD factors and chemical agents that inhibit NMD have secondary effects that are NMD independent. Much research has been done to identify common characteristics of NMD-sensitive transcripts. Two prevailing models have been proposed. One model, the exon-junction complex (EJC) model, proposes that transcripts with a PTC 50 to 55 nucleotides upstream of an exon-exon junction are NMD targets 129. Another model, the long 3’ UTR model, claims that transcripts with long 3’ UTRs are targets of NMD 103. As more studies are completed, it is beginning to emerge that the EJC model, while not 100% predictive, contains strong signal for NMD; whereas the long 3’ UTR model does not (C. French unpublished). Here, to identify strict set of NMD targets, we define a confident NMD target as follows: (1) the transcript contains a 29 PTC 50 to 55 nucleotides upstream from an exon-exon junction (PTC50), (2) the expression of this transcript is at least 2 times higher in NMD inhibited cells than in control cells, (3) the corresponding gene expresses at least one transcript with a normal termination codon (non-PTC50) at FPKM>0 that does not increase above 1.2 times in cells with inhibited NMD. Following this definition, in GM12878 cells Dr. Hu identify 1754 NMD targets when SMG6 is knocked down, and 2957 when cells undergo CHX treatment, among 29,543 expressed transcripts. In HeLa cells, Dr. Hu identify 2406 NMD targets when SMG6 is knocked down, 2866 transcripts when UPF1 is knocked down, and 2595 when cells undergo CHX treatment, among 26,781 expressed transcripts (Figure 2.3, panel C). Overall, gene knockdown and CHX identify a similar number of NMD targets.

It is likely that knocking down UPF1 or SMG6, two core NMD factors, provides a more accurate identification of NMD targets than CHX treatment, due to direct involvement of these factors in the NMD pathway. For more comprehensive comparison, Dr. Hu examine the overlap of NMD-targets (PTC50 isoforms) along the overlap of isoforms that we do not designate as PTC50, but which are upregulated when NMD is inhibited (we call these isoforms non-PTC50). The intersection of SMG6 KD and CHX treatment-identified NMD targets in GM12878 cells, is 382, 2.6 times higher than the overlap of upregulated non-PTC50 transcripts (Figure 2.3, panel C). The intersection of SMG6 KD and CHX-identified NMD targets in HeLa cells is 476 which is 2.3 times higher than the overlap between upregulated non-PTC50 transcripts (Figure 2.3, panel C). Since SMG6 and UPF1 act on the same pathway, we expect greater NMD target overlap in NMD targets identified via knockdown of these two genes, than between gene knockdown and CHX treatment. Indeed, in HeLa, this overlap is 1025 and is 1.4 times higher than the overlap between upregulated non-PTC50 transcripts (Figure 2.3, panel C). This greater overlap suggests that gene knockdown is a better method to identify NMD targets than CHX treatment.

To better compare the overlap of NMD targets between gene knockdown and CHX treatment, Dr. Hu analyze the change in expression of all expressed transcripts upon NMD inhibition (Figure 2.3, panel A), and calculate correlation coefficients, both Spearman and Pearson, between compared conditions for PTC50 isoforms, all non- PTC50 isoforms, and for all expressed isoforms (Table 2.2). For all comparisons, the Spearman coefficient is higher than the Pearson coefficient. The correlation is weak (Spearman 0.0 – 0.2) for non-PTC50 isoforms in HeLa and GM12878 when gene knockdown is compared to CHX. However, it increases to moderate (Spearman ~0.4) when only PTC50 isoforms are taken into consideration. The correlation is strong (Spearman 0.7) for non-PTC50 when UPF1 knockdown is compared to SMG6 knockdown and increases to very strong (Spearman 0.9) for PTC50 isoforms. It is worth noting that the weaker SMG6 knockdown in GM12878 than HeLa is reflected in the weaker upregulation of PTC50 isoforms (red dots are skewed towards y-axis) (Figure 2.3, panel A and B). The correlation coefficients indicate that CHX and NMD factor knockdown, while not strongly correlated, both result in NMD signal.

30

Figure 2.3. Impact of SMG6/UPF1 knock down and CHX treatment on PTC50 and non-PTC50 isoforms. (A) Scatter plot comparing all isoform expression ratios (control/NMD inhibition) in the following (from left to right): (1) SMG6 knockdown versus CHX treatment in GM12878 cells, (2) SMG6 knockdown versus CHX treatment in HeLa cells, and (3) SMG6 versus UPF1 knockdown in HELA cells. Red points represent PTC50 isoforms, and black points represent non-PTC50 isoforms. Plots are truncated at a ratio of 2.0. (B) Density heatmap of isoform expression ratios (control/NMD inhibition) of PTC50 isoforms. The order from left to right is the same as displayed in panel A. (C) Venn diagrams presenting the overlap of upregulated non-PTC50 isoforms and PTC50 isoforms in GM12878 and HeLa obtained via UPF1 knockdown, SMG6 knockdown, or CHX treatment.

31 As mentioned, there is very strong correlation in expression changes between transcripts derived from HeLa UPF1 and HeLa SMG6 knockdowns (Spearman 0.9, Table 2.2). However, this result is skewed towards stronger correlation since the same control is used in both experiments (expression of isoforms from HeLa cells treated with non-specific shRNA). In contrast, the controls for gene knockdown and CHX treatment are derived from independent samples: the control for SMG6 knockdown is the expression of isoforms from cells treated with non-specific shRNA (HeLa) or siRNA (GM12878); the control for CHX treatment is the expression of isoforms from HeLa or GM12878 treated with DMSO. Thus, we expect much lower correlation between isoforms whose expression is quantified from gene knockdown and CHX treatment, compared to isoforms whose expression is quantified from gene knockdowns in the same cell line.

Table 2.2. SpearmanTable and2 Pearson correlation coefficients of isoform expression ratio (control/NMD inhibition) between different treatments in HeLa and GM12878 cells.

(1) HeLa CHX treatment versus UPF1 knockdown non-PTC50 PTC Total Spearman Pearson Spearman Pearson Spearman Pearson 0.201 0.108 0.397 0.258 0.312 0.163

(2) HeLa CHX treatment versus SMG6 knockdown non-PTC50 PTC Total Spearman Pearson Spearman Pearson Spearman Pearson 0.153 0.068 0.405 0.267 0.29 0.133

(3) HeLa UPF1 knocksdown versus SMG6 knockdown non-PTC50 PTC Total Spearman Pearson Spearman Pearson Spearman Pearson 0.65 0.388 0.869 0.592 0.738 0.444

(4) GM12878 CHX treatment versus SMG6 knockdown non-PTC50 PTC Total Spearman Pearson Spearman Pearson Spearman Pearson 0.134 0.079 0.415 0.237 0.278 0.109

Since both approaches, gene knockdown and CHX treatment, give moderate to strong NMD signal, Dr. Hu opt to combine these two methods to detect cell line specific NMD targets and to obtain a confident set of NMD targets that is common to HeLa and GM12878.

The principal component analysis of the studied transcriptomes, split into the two cell lines, shows distinctiveness between different experimental treatments and considerable clustering among replicates in HeLa cells, and to a lesser extent in GM12878 cells (Figure 2.4, top panels). Expression analysis of PTC50 isoforms confirms that they are highly upregulated when NMD is inhibited, a trend that is not observed among non-PTC50 isoforms. For non-PTC50 isoforms, a similar number increases and decreases upon NMD inhibition (Figure 2.4, central and bottom panels).

32 Dr. Hu identified 1889 NMD isoforms (from 948 genes) in GM12878 cells and 943 isoforms (from 763 genes) in HeLa cells.

33

Figure 2.4. Differentially expressed PTC50 and non-PTC50 isoforms upon NMD inhibition in HeLa and GM12878 cells. 34 (A) Principal component analysis based on isoform expressions. (B) Histogram of PTC50 and non-PTC50 isoforms upon NMD inhibition. Blue curve represents non-PTC50 isoforms and red curve represents PTC50 isoforms. PTC50 isoforms are up-regulated. (C) Volcano plot depicting range of isoform expression fold-change (log2[expression ratio]) induced by NMD inhibition and their corresponding p-values (-log10[ p-value]). PTC and non- PTC isoforms are marked in blue and red respectively. (D) Number of upregulated and downregulated isoforms in HeLa and GM12878 cells upon NMD inhibition. Gray bars represent all isoforms. Blue bars represent non-PTC50 isoforms and red bars represent PTC50 isoforms.

A number of mammalian genes have been shown to express NMD-targeted transcripts. Many of these genes have been used to validate NMD inhibition. It has been observed that when NMD is inhibited, these genes are upregulated 116,130,131,132,133,134. Here Dr. Hu and I select a number of genes that have been shown to be targets of NMD and show, through our RNA-seq data, that they are upregulated upon NMD inhibition (Figures 2.5, 2.6, 2.7). In general, all genes showed upregulation when NMD was inhibited in at least one experimental condition. Our selection contains many genes that express PTC50 isoforms; however, there are several that do not, such as: ATF4, MAFF, GADD34, CHOP, and P8. In many instances, non-PTC50 isoforms are significantly upregulated when NMD is inhibited, as illustrated by CHOP and GADD34 (Figure 2.7). In a few instances, these isoforms are upregulated very strongly upon CHX treatment, and to a lesser extent after NMD factor knockdown, as illustrated by non-PTC50 isoforms of ATF3, GADD34, and even SRSF2 (Figures 2.5, 2.6, 2.7). Several genes respond differently to NMD inhibition in GM12878 and in HeLa cells. For example, PTC50 isoforms of RP9P are upregulated when MND factors are knocked down in HeLa cells, but not in GM12878. Hence, more studies will be required to resolve these discrepancies.

35 Figure 4.1 ATF3 10.0 ***

400 SCR *** 7.5 *** *** SMG6 KD 300 DMSO GM12878 CHX 5.0 200 SCR SMG6 KD

2.5 UPF1 KD

100 HeLa * Normalized Expression * DMSO

Mean expression (FPKM) *** CHX 0 0.0 *** * *

TCONS_00007556 TCONS_00007557 TCONS_00007560 TCONS_00007561

TCONS_00007556TCONS_00007557TCONS_00007560TCONS_00007561 ATF4 SMAD4

300 4 ** 2 4 200 ** 3 *** *** 2 1 2 *** 100 ***

Normalized Expression 1 Normalized Expression *** * * Mean expression (FPKM) Mean expression (FPKM)

0 0 0 0 TCONS_00076370 TCONS_00076383

TCONS_00076370TCONS_00076383 TCONS_00107298TCONS_00107300TCONS_00107301 TCONS_00107298 TCONS_00107300 TCONS_00107301

5 SRSF2 * 150 * ** 4 *

100 3 * ** *** * 2 *** * 50 ** 1 * *** * Normalized Expression * ** Mean expression (FPKM) * *** * ** 0 0

TCONS_00074844 TCONS_00074845 TCONS_00074846 TCONS_00074847 TCONS_00074848 TCONS_00074849 TCONS_00074850 TCONS_00074844TCONS_00074845TCONS_00074846TCONS_00074847TCONS_00074848TCONS_00074849TCONS_00074850

Figure 2.5. Expression changes of genes which have previously been identified as NMD sensitive. For each gene, the left panel shows the mean expression of expressed isoforms. PTC 50 isoforms are displayed in red and decorated with a stop sign. Non-PTC50 isoforms are depicted in gray. The right panel shows the relative expression (normalized by the mean expression) of each isoform in all experimental conditions, which are labeled in different colors. The corresponding experimental and control condition are located side by side. Two-sided student t- 36 tests were used to detect any significant difference between experimental and control conditions. Data are presented as meanFigure ±SD. 4.2 *p <0.05, ** p<0.01 and *** p<0.001.

SRSF3 *

4 SCR 150 * SMG6 KD 3 DMSO GM12878 100 CHX * ** SCR 2 SMG6 KD 50 * * UPF1 KD 1 * * HeLa ** *

Normalized Expression * DMSO

Mean expression (FPKM) CHX 0 0

TCONS_00135698 TCONS_00135701 TCONS_00135702

TCONS_00135698TCONS_00135701TCONS_00135702 MAFF IRE1a (ERN1)

8 2.0 4

10 6 1.5 3

4 1.0 2 5

2 0.5 1 Normalized Expression Mean expression (FPKM) Normalized Expression Mean expression (FPKM) 0 0 0.0 0 TCONS_00107207 TCONS_00074128 TCONS_00074131

TCONS_00107207 TCONS_00074128TCONS_00074131 BIP (HSPA5) RP9P 125

7.5 100 1.5 2

75 1.0 5.0

50 1 0.5 2.5

25 Normalized Expression Mean expression (FPKM) Mean expression (FPKM) Normalized Expression

0 0.0 0 0.0 TCONS_00168118 TCONS_00152839 TCONS_00152841

TCONS_00168118 TCONS_00152839TCONS_00152841

Figure 2.6. Expression changes of genes which have previously been identified as NMD sensitive. Legend same as in Figure 2.5.

37 Figure 4.3 GADD34 CHOP 50 5 100 *** * 3 SCR 40 4 75 *** ** SMG6 KD DMSO 30 3

2 GM12878 CHX 50 20 2 ** SCR SMG6 KD 1 25 *** 10 UPF1 KD * 1 HeLa Normalized Expression Normalized Expression *** DMSO Mean expression (FPKM) Mean expression (FPKM)

0 0 0 0 CHX TCONS_00081873 TCONS_00040472 TCONS_00040475

TCONS_00081873 TCONS_00040472TCONS_00040475 PTBP1

100 * 3 * 75

2 50

1 25 Normalized Expression Mean expression (FPKM)

0 0 TCONS_00078289 TCONS_00078290 TCONS_00078291 TCONS_00078292 TCONS_00078293 TCONS_00078294

TCONS_00078289TCONS_00078290TCONS_00078291TCONS_00078292TCONS_00078293TCONS_00078294 Figure 2.7. Expression changes of genes which have previously been identified as NMD sensitive. Legend same as in Figure 2.5.

It has been reported that many NMD factors such as UPF2, SMG1, SMG5, SMG6, and SMG7 are up-regulated when UPF1 is knocked down in HeLa cells 135. In addition, it has been shown that UPF1 increases moderately when NMD is inhibited through knock down of several NMD factors such as UPF2, UPF3b, SMG6, and SMG7 135. Here Dr. Hu and I validate that a SMG6 knockdown in HeLa and GM12878 cells significantly increases gene expression of UPF1, UPF2, SMG1, SMG5, and SMG7 (Figure 2.8). We also evaluate the change in gene expression of other NMD factors, such as SMG8, SMG9, UPF3A, UPF3B, NBAS, DHX34, GNL2, SEC13, RUVBL1/2, and MOV10. When SMG6 is knocked down, significant gene upregulation is observed for SMG9 and DHX34 (Figures 2.8 and 2.9). Modest upregulation is observed for UPF3A and NBAS. No change in gene expression is observed for SMG8 and modest decrease is observed for SEC13. In addition, we observe differential gene expression changes in HeLa and GM12878 cells. There is modest gene expression decline in HeLa cells for GNL2, RUVBL1, RUVBL2, and PNRC2, and no gene expression change for these factors in GM12878 cells (Figure 2.9). This difference is likely due to the difference in the level of SMG6 knockdown which is stronger in HeLa than in GM12878 cells (Figure 2.1). To find 38 out if these gene expression changes might be driven by transcriptional upregulation, we analyze expression changes of the highest expressed transcripts with normal termination codons in the control condition of each evaluated gene upon SMG6 knockdown. The pattern of the transcript changes closely resembles the pattern of the gene expression changes (Tables in Figures 2.8 and 2.9).

39

Figure 2.8. Gene expression alterations of NMD factors upon SMG6 knockdown in GM12878 and HeLa. Significant difference upon SMG6 knockdown is calculated by one-way ANOVA with cell line as a covariant. Data are presented as mean ±SD. *p <0.05, ** p<0.01 and *** p<0.001. 40

Figure 2.9. Major isoform expression alterations of NMD factors upon SMG6 knockdown in GM12878 and HeLa.

41 Instead of gene expression alteration (Figure 2.8), we examined the expression change of the major isoforms of each NMD factor. Significant difference upon SMG6 knockdown is calculated by one-way ANOVA with cell line as a covariant. Data are presented as mean ±SD. *p <0.05, ** p<0.01 and *** p<0.001.

Comparison of NMD targets between the two cell lines

To ascertain the differences in NMD targets between GM12878 and HeLa cells, Dr. Hu compared the overlap between1889 NMD targeted isoforms (from 948 genes) identified in GM12878 cells and 943 NMD targeted isoforms (from 763 genes) identified in HeLa cells. The overlap is 313 isoforms and 335 genes (Figure 2.10). An NMD isoform expression heatmap reveals two patterns: in the majority of cases, isoforms that are NMD targets in one cell line are not expressed in the other cell line; in a minority of cases, there are a number of isoforms that are NMD targets in one cell line, however the same isoforms are upregulated in the other cell line but below the 2x threshold that we use to qualify an isoform as an NMD target (Figure 2.10).

42 Figure 6 (A) Gene Venn Isoform Venn

613 335 428 813 313 630

GM12878 HeLa GM12878 HeLa (B)

N=313

N=813

N=630 CHX1 CHX2 CHX3 DMSO1 DMSO2 DMSO3 SMG6 KD1 SMG6 KD2 SMG6 DK3 SCR1 SCR2 SCR3 CHX1 CHX2 CHX3 CHX4 DMSO1 DMSO2 DMSO3 DMSO4 SMG6 KD1 SMG6 KD2 SMG6 KD3 SMG6 KD4 SCR1 SCR2 SCR3 SCR4

GM12878 HeLa −2 −1 0 1 2

NMD-targeted transcript expression log10 FPKM

Figure 2.10. Comparison of NMD targeted isoforms identified in GM12878 and HeLa cells. (A) Overlap of NMD targeted genes and isoforms. (B) Heatmap of NMD targeted isoforms classified into three categories: (1) common NMD targets, (2) NMD targets unique to GM12878, (3) NMD targets unique to HeLa. Red color represents relative increase in abundance, blue color represents relative decrease. In each category, isoforms are ordered by their mean expression across all experimental conditions. Experimental conditions are indicated below the heatmap (three replicates for GM12878 and four replicated for HeLa).

A comparison of expression of NMD isoforms that are cell line specific to NMD isoforms that are common to HeLa and GM12878 cells reveals that cell line specific

43 NMD isoforms tend to express at a lower level than shared NMD targeted isoforms (Figure 2.11). Filtering steps that (1) remove cell line specific NMD isoforms which are upregulated in the other cell line above 1 FPKM, and (2) remove isoforms that are expressed at FPKM levels below 1 in NMD inhibited samples, results in 29 NMD targeted isoforms specific to GM12878 and 15 NMD targeted isoforms specific to HeLa cells (Figure 2.12). Close analysis of all cell line specific NMD isoforms reveals that all of them have been classified as NMD targets in error due to noise in isoform quantification, not enough RNA-seq reads, and wrong annotation (Table 2.3). For example, PTC50 isoform of UAP1L1, supposedly a GM12787 specific NMD target, is likely misidentified due to a low number of reads at the beginning of the gene (Figure 2.13). In another example, a PTC50 isoform of BRD9 that is supposedly a HeLa specific NMD target, a closer look at the genome browser shows that there is a problem with the annotation. Here three out of 19 identified PTC50 isoforms are connected to the adjacent gene. In addition, there are not enough reads to properly quantify the particular PTC50 isoform that has been denoted as HeLa specific NMD target (highlighted in red in Figure 2.13). Hence, in HeLa and GM12878 cells, cell line-specific NMD targets that are expressed in both cell lines are likely noise.

Figure 2.11. Comparison of expression of NMD targeted genes and isoforms in GM12878 and HeLa cells. Common PTC50 genes/isoforms are presented in purple; PTC50 genes/isoforms specific to GM12878 are presented in green; and PTC50 genes/isoforms specific to HeLa are presented in yellow. Density plots for GM12878 are plotted along the x-axis and for HeLa along the y-axis. Density plots are coded in the same colors as above and black lines represent all expressed genes/isoforms either in HeLa or GM12878. 44

Figure 2.12. Detection of stringent cell line-specific NMD targets. Dr. Hu identified 1087 NMD-targeted isoforms in GM12878 cells and 916 confident NMD-targeted transcripts in HeLa cells. To detect more confident cell line-specific NMD targets, Dr. Hu filtered the 1087 and 916 isoforms by an additional two steps: (1) Filter out cell specific NMD isoforms that are also upregulated but do not exceed the threshold (expression ratio of 2) in the opposite cell line (expression ratio >1.2). (2) Filter out isoforms derived from cell line specific expression (FPKM <0.5 in the opposite cell line when NMD is inhibited).

45

Figure 2.13. Example of mis-identified cell line specific NMD target. (A) The UAP1L1 gene is an example of a misidentified GM12878 specific NMD target due to limited read numbers near the 5’ end, subsequently leading to inaccurate Expectation-Maximization isoform quantification. (B) The BRD9 gene is an example of a mis-identified specific NMD target due to wrong or incomplete isoform annotations. The splicing events differentiating PTC50 and non-PTC50 NMD isoforms are highlighted.

46 Table 2.3. In-depth investigation of the detected stringent cell line specific NMD targets.

GM12878 transcript_id gene_id gene name reason TCONS_00001278 XLOC_000265 SRRM1 noise in isoform quantification TCONS_00002648 XLOC_000480 UROD noise in isoform quantification TCONS_00002670 XLOC_000485 AKR1A1 noise in isoform quantification TCONS_00016278 XLOC_003220 NVL not enough reads / wrong annotation TCONS_00041471 XLOC_008295 SART3 noise in isoform quantification TCONS_00042235 XLOC_008390 RSRC2 wrong annotation TCONS_00044062 XLOC_008784 UPF3A noise in isoform quantification TCONS_00049857 XLOC_009994 CNIH1 noise in isoform quantification / not enough reads TCONS_00051294 XLOC_010278 CINP wrong annotation / noise in isoform quantification TCONS_00056165 XLOC_011535 SPPL2A not enough reads / wrong annotation TCONS_00056171 XLOC_011535 SPPL2A not enough reads / wrong annotation TCONS_00064566 XLOC_013280 NAE1 wrong annotation TCONS_00068918 XLOC_014186 DHX40 noise in isoform quantification / wrong annotation TCONS_00074499 XLOC_015228 FAM104A noise in isoform quantification / wrong annotation TCONS_00083069 XLOC_016952 RPL23AP79not enough reads / wrong annotation TCONS_00101669 XLOC_020733 RTEL1 noise in isoform quantification TCONS_00111145 XLOC_022621 NKTR noise in isoform quantification / wrong annotation TCONS_00112605 XLOC_022918 ABHD10 noise in isoform quantification TCONS_00113988 XLOC_023248 MFN1 wrong annotation TCONS_00121462 XLOC_024644 ENOPH1 wrong annotation TCONS_00122219 XLOC_024779 HSPA4L wrong annotation TCONS_00132003 XLOC_026859 GIN1 noise in isoform quantification TCONS_00136245 XLOC_027808 CDC5L wrong annotation TCONS_00137570 XLOC_028131 LTV1 noise in isoform quantification / wrong annotation TCONS_00152599 XLOC_031326 OSBPL3 noise in isoform quantification / wrong annotation TCONS_00156836 XLOC_032237 MAK16 wrong annotation TCONS_00162039 XLOC_033427 KDM4C not enough reads / noise in isoform quantification TCONS_00165170 XLOC_034073 UAP1L1 not enough reads / wrong annotation TCONS_00168962 XLOC_034907 FBXW5 noise in isoform quantification

HeLa transcript_id gene_id gene name reason TCONS_00007537 XLOC_001499 PPP2R5A not enough reads / noise in isoform quantification TCONS_00025760 XLOC_005105 SLC39A13 noise in isoform quantification TCONS_00027777 XLOC_005534 KIAA1731 noise in isoform quantification TCONS_00029734 XLOC_005908 TMEM9B wrong annotation TCONS_00046711 XLOC_009354 PSMC6 wrong annotation TCONS_00058230 XLOC_012081 VIMP wrong annotation / noise in isoform quantification TCONS_00077290 XLOC_015775 ABHD3 wrong annotation TCONS_00111794 XLOC_022723 PARP3 not enough reads / noise in isoform quantification TCONS_00120692 XLOC_024467 KLHl5 not enough reads / wrong annotation TCONS_00128850 XLOC_026172 UBE2D2 wrong annotation / noise in isoform quantification TCONS_00130243 XLOC_026458 BRD9 wrong annotation / noise in isoform quantification TCONS_00131960 XLOC_026842 CHD1 wrong annotation TCONS_00138183 XLOC_028230 C6orf120 wrong annotation TCONS_00139903 XLOC_028681 SRPK1 noise in isoform quantification TCONS_00170066 XLOC_035224 RBM10 wrong annotation

47

Shared NMD targets are associated with RNA processing

It has been reported that targets of NMD are associated with a broad range of cellular processes 136,137. The set of common NMD targets identified in our studies is significantly enriched in genes involved in biological processes that regulate RNA, such as mRNA processing, RNA splicing, and mRNA export from nucleus (Figure 2.14). Dr. Hu and I also observe enrichment in genes involved in protein biology, such as protein sumoylation and protein folding (Figure 2.14). The most enriched molecular function is poly(A) RNA binding and the highest enriched cellular component is nucleoplasm (Figure 2.14). We observe enrichment in the following KEGG pathways: spliceosome and RNAFigure transport. 10

Gene count 0 20 40 60 80 100 120

*mRNA splicing, via spliceosome *RNA splicing *mRNA processing *mRNA export from nucleus RNA export from nucleus RNA splicing, via transesterification reactions termination of RNA polymerase II transcription mRNA 3’−end processing protein sumoylation maturation of LSU−rRNA viral transcription Biological Process sister chromatid cohesion protein folding RNA processing protein peptidyl−prolyl isomerization intracellular transport of virus regulation of cellular response to heat *poly(A) RNA binding nucleotide binding RNA binding chromatin binding Hsp70 protein binding Function Molecular peptidyl−prolyl cis−trans isomerase activity *nucleoplasm nuclear speck nucleolus spliceosomal complex nucleus U12−type spliceosomal complex large ribosomal subunit catalytic step 2 spliceosome condensed chromosome kinetochore

Cellular Component viral nucleocapsid intracellular ribonucleoprotein complex chromosome, centromeric region *Spliceosome RNA transport KEGG 0 1 2 3 4 5 6 7 8 Enrichment score -log10(p-value)

Figure 2.14. Gene Ontology and KEGG pathway enrichment of genes with NMD targets shared in GM12878 and HeLa cells. Bar plot of enriched GO terms and the KEGG pathway (p-value <0.05). Genes expressed in both the two cell lines were used as the background in the one-sided hypergeometric test. Enrichment p-values are shown in red and gene numbers in each term or pathway are displayed in blue. GO terms or pathways with FDR <0.05 are marked with red asterisks.

48 Sashimi plots of shared NMD targets reveal that the same molecular event is observed in NMD targets derived from the two different cell lines. For example, NMD targeted isoforms of RHBDD2 exhibit inclusion of an alternative exon harboring PTC50 in a representative SMG6 knockdown replicate obtained from HeLa and GM12878 cells. Whereas, the productive isoform of the RHBDD2 gene lacks this exon (Figure 2.15). The NMD targeted isoforms of RHBDD2 are expressed in low levels in cells with functioning NMD and increase in abundance when NMD is inhibited, as observed by an increase in junction reads supporting inclusion of the exon harboring PTC50. The RHBDD2 example is simple, however, and in many instances more than one transcript with a different NMD-promoting event is observed in HeLa and GM12878. For example, CBWD2 expresses five NMD-targeted isoforms (Figure 2.15). There are two different alternative exons that can be included in the NMD-targeted isoforms. Both events are observed in representative SMG6 knockdown replicates obtained from HeLa and GM12878 cells (Figure 2.15). As with RHBDD2, NMD isoforms of CBWD2 are expressed at low levels in cells with functioning NMD and are revealed in HeLa and GM12878 cells when SMG6 is knocked down (Figure 2.15). We observe the same NMD features in HeLa and GM12878, and hence, we can conclude that despite genomic aberrations, HeLa can be used to study NMD.

49

Figure 2.15. Examples of NMD target common to GM12878 and HeLa cells. (A) Sashimi plots depicting RNA sequencing reads and exon junction reads of RHBDD2 and (B) CBWD2. Each sashimi plot is derived either from representative SMG6 KD or control RNA-seq replicates derived from GM12878 or HeLa cells. Isoform models are displayed below the sashimi plots and are colored as black for non-PTC50 isoforms and red for PTC50 isoforms. Splicing events differentiating non-PTC50 and PTC50 isoforms are highlighted and enlarged.

50 Cell line-specific NMD targets are few and play a role in the biology of the assayed cell line

Our NMD target analysis reveals that there are very few PTC50 genes that are only expressed in GM12878 (37 out of 2202 genes expressed only in GM12878) or HeLa (25 out of 2576 genes expressed only in HeLa) (Figure 2.16). Biological process GO enrichment shows that in GM12878 cells, these NMD targets are enriched in immune response and cell motility. The enrichment molecular function GO category is receptor activity, and the enrichment cellular component GO category is plasma membrane. GM12878 cells are B-lymphocytes 138 that participate in innate and adaptive immune response 139, and hence genes related to immune response correspond to the function of the cell. GO enrichment of HeLa specific NMD targets is less abundant and less specific than that of GM12878. Significant GO enrichment categories of the biological process include system development and lipid related terms (such as lipid binding). Significant GO enrichment categories of the molecular function are lipid binding and phospholipase activity (Figure 2.17). FIgure 12

GM12878 specific GM12878 specific HeLa specific HeLa specific NMD targeted genes NMD targeted genes NMD targeted genes NMD targeted genes

466 37 190 25 HeLa GM12878

9949 2202 9949 2576 genes expressed in genes expressed genes expressed in genes expressed GM12878 and HeLa only in GM12878 GM12878 and HeLa only in HeLa

p value = 2.6e-12 p value = 6.3e-4

Figure 2.16. Significant difference between proportions of NMD targets and non-NMD targets expressed and identified in GM12878 and HeLa. P-values are calculated by Fisher’s exact test.

A closer look at the genome browser for cell line-specific NMD targets shows that they are indeed not expressed in both cell lines. In the two examples provided, one for HeLa and one for GM12878, GM12878-specific PTC50 isoform includes an alternative exon with PTC50 (GPR132 Figure 2.18), while the two PTC50 isoforms specific to HeLa contain introns in the 3’ UTR which are spliced and render these isoforms likely for NMD degradation (NPTX1 Figure 2.18). Therefore, even though the cell line-specific NMD targets identified here are very few, different cell lines offer NMD targets that are unique to the studied system.

51 Figure 13 (A) GM12878 Gene count 0 5 10 15 20 25 *immune response immune system process regulation of multicellular organismal process positive regulation of multicellular organismal process leukocyte migration cell motility negative regulation of interleukin−10 production positive regulation of cell migration cell adhesion positive regulation of cell motility biological adhesion positive regulation of cellular component movement locomotion innate immune response negative regulation of immune system process signal transduction negative regulation of leukocyte activation positive regulation of locomotion negative regulation of cell activation cell migration negative regulation of interleukin−12 production regulation of cytokine production adaptive immune response defense response Process Biological regulation of leukocyte activation positive regulation of leukocyte activation positive regulation of T cell proliferation negative regulation of multicellular organismal process positive regulation of cell activation regulation of cell activation regulation of lymphocyte proliferation cell surface receptor signaling pathway regulation of mononuclear cell proliferation negative regulation of cytokine production involved in immune response negative regulation of interferon−gamma production regulation of leukocyte proliferation movement of cell or subcellular component regulation of cellular component movement protein localization to cell surface positive regulation of cytokine secretion integrin−mediated signaling pathway heterophilic cell−cell adhesion via plasma membrane cell adhesion molecules receptor activity molecular transducer activity transmembrane signaling receptor activity transmembrane receptor activity Function Molecular transcription coactivator binding *plasma membrane *plasma membrane part membrane part Cellular

Component 0 2 4 6 8 Enrichment score -log10(p-value) (B) HeLa Gene count 0 4 8 12 system development anatomical structure development organic acid biosynthetic process

Process carboxylic acid biosynthetic process Biological lipid binding phospholipase activity lipase activity lysophospholipase activity

Function acetylgalactosaminyltransferase activity Molecular carboxylic ester hydrolase activity

0 1 2 3 4 Enrichment score -log10(p-value)

Figure 2.17. Gene Ontology Enrichment Analysis of cell line specific expressed NMD targets. (A) Enrichment of GM12878-specific expressed NMD targets. (B) Enrichment of HeLa-specific expressed NMD targets. Genes expressed in the corresponding cell line were used as the background in the one-sided hypergeometric test. GO terms with an enrichment p-value <0.05 are displayed and those with an FDR <0.05 are marked with red asterisks. Enrichment p-values are shown in red and gene numbers in each term or pathway are displayed in blue.

52

Figure 2.18. Examples of NMD target unique to GM12878 or HeLa cells. Sashimi plots depicting RNA sequencing reads and exon junction reads of (A) GPR132, an NMD target unique to GM12878 and (B) NPTX1, an NMD target unique to HeLa. Each sashimi plot is derived either from representative SMG6 KD or control RNA-seq replicates derived from GM12878 or HeLa cells. Isoform models are displayed below

53 the sashimi plots and are colored as black for non-PTC50 isoforms and red for PTC50 isoforms. Splicing events differentiating non-PTC50 and PTC50 isoforms are highlighted and enlarged.

Discussion

Review of main findings

Using NMD factor knockdown and CHX treatment to inhibit NMD in two different cell lines, the goals of this study are: (1) to establish if NMD targets identified through NMD factor inhibition overlap with NMD targets obtained upon CHX treatment, (2) to identify a confident set of NMD targets, and (3) to identify and characterize cell line-specific NMD targets. Applying the EJC model to identify PTC50 isoforms in cells with inhibited NMD, we find out that approximately 8.5% of all expressed isoforms in GM12878 and HeLa cells contain a PTC50 and are upregulated at least two-fold when NMD is inhibited either via NMD factor knockdown or CHX treatment in either GM12878 or HeLa cells. The overlap of NMD targets between a SMG6 knockdown and cells treated with CHX is approximately 19%, whereas the overlap of NMD targets identified in HeLa cells with SMG6 knocked down and in cells with UPF1 knocked down is nearly 40%. This much stronger overlap suggests that gene knockdown is likely a better method to identify NMD targets, however we are unable to validate if this overlap is more accurate or if we observe more generous but artefactual overlap due to a common control used for both experiments. Nonetheless, upon NMD inhibition, by NMD factor knockdown or CHX treatment, we observe dramatic upregulation of PTC50 isoforms; hence we conclude that both methods are capable of identifying targets of NMD. To establish a confident set of NMD targets, we select 313 PTC50 isoforms that have been identified upon SMG6 knockdown and CHX treatment in both HeLa and GM12878 cells. Finally, we discover that cell line specific NMD isoforms have lower expression when compared to the confident set of NMD targets and are mainly expressed in the cell line in which they are found.

Comparison of CHX treatment and SMG6 knockdown

The NMD pathway degrades transcripts with a PTC. To identify and characterize this hidden transcriptome, the NMD pathway needs to be inhibited. Generally, there are two ways to inhibit the NMD pathway: either via knockdown/knockout of core NMD factors or via exposure to compounds that stall translation, such as CHX or Actinomycin D. Both methods have advantages and disadvantages. One disadvantage is gene knockdown/knockout requires days, during which the transcriptome might change dramatically. One advantage is the experiment can be performed in vivo if conditional knockout is deployed, and thus, NMD targets can be identified in organisms/tissues and during development rather than in-vitro in selected cell lines. In contrast, chemical exposure is completed within hours and likely does not affect the transcriptome dramatically, but this type of experiment can only be performed in cell culture. We observe a similar number of PTC50 isoforms identified upon SMG6 knockdown and CHX treatment in both HeLa and GM12878. However, the PTC50 isoform overlap between these treatments is small (in the range of 20%). This low overlap becomes much higher (>50%) when we relax the PTC50 upregulation requirement from 2x to 54 1.2x and above. In the remaining pool of 50% PTC50 isoforms that are specific to a given condition, we observe a large number of isoforms that do not seem real but arise from problems that are inherent to the RNA-seq experiment and the bioinformatics tools. For example, we observe isoforms that span adjacent genes, we also see isoforms that are quantified but at very low levels and there are not enough reads to support their existence, not to mention upregulation. In addition, our study suffers from low sequencing depth, which in general gives a signal but does not allow for detailed isoform analysis. Therefore, in this study, we are unable to conclusively explain why the other 50% of PTC50 isoforms behave differently upon SMG6 KD and CHX treatment.

Comparison of our strategy to that of other publications

In many NMD studies, targets of NMD are identified indirectly as isoforms that increase in abundance after NMD is inhibited. This upregulation is compared to control conditions. Generally, the fold increase of NMD targets ranges from 1.5 to 1.9 as compared to expression in control conditions, for examples see 135,116,117,115,114,113,112,137,140 . A different approach is used by 116 who applied a meta- analysis approach in which they combine transcriptome profiling of knockdowns and rescues of three NMD factors UPF1, SMG6, and SMG7 to identify NMD targets. A direct approach is used by 141 who identify NMD targets bound by SMG6, whereas 104 identify NMD targets bound by UPF1. In this study, we employ an indirect method of NMD target identification. We inhibit core NMD factors, UPF1 and SMG6, and independently expose cells to CHX. We apply all three treatments to HeLa and GM12878 cells to find a confident set of NMD targets that is common to all conditions tested. In addition to this varied experimental approach, we require that NMD targets harbor a premature termination codon 50 to 55 nucleotides upstream of the exon junction complex. We also require that the NMD targeted gene expresses at least one isoform with a normal stop codon that does not increase in abundance when NMD is inhibited. Our method is in line with other approaches since it identifies ~10% of all expressed isoforms as NMD targets (other approaches range from 8% to 20%). However, due to the imposed requirements besides NMD target upregulation, NMD targets identified in our study do not overlap very well with NMD targets identified elsewhere.

Some of the known NMD targets do not show up in our data (they do not follow the PTC 50nt rule)

We require that all NMD targets identified in this study harbor PTC50. As a consequence, a number of isoforms that have been empirically established to increase when NMD is inhibited and therefore branded as NMD targets in many studies, yet do not harbor PTC50, are not among NMD targets identified in our study. Notable examples are GADD34, CHOP (DDIT3), and ATF4. Our analysis identifies one isoform for GADD34, six isoforms for CHOP, out of which only two are expressed at reasonable levels, and five isoforms for ATF4, out of which three are expressed at reasonable levels and are presented in Figures 2.5 and 2.7. GADD34 seems to exhibit the highest upregulation when NMD is inhibited via CHX and more muted when NMD factors are

55 knocked down in HeLa cells. There does not seem to be any significant increase in GM12878 cells when SMG6 is knocked down (Figure 2.7). The two representative isoforms of CHOP show strong upregulation in HeLa cells upon NMD factor knockdown and CHX treatment. In GM12878 cells the upregulation, while less pronounced, is detectable. Finally, the three representative isoforms of ATF4 show mixed behavior. The highest expressed representative isoform shows strong and significant upregulation when NMD is inhibited in GM12878 and HeLa cells under all 5 conditions tested. Whereas, the other two isoforms, which have much smaller expression, seem to decrease in HeLa cells when exposed to SMG6 KD or CHX treatment, and also decrease in GM12878 cells when treated with CHX, but increase their expression in GM12878 upon SMG6 KD. These examples suggest that our method of NMD target identification misses obvious targets of NMD. However, a gene may be upregulated following NMD inhibition either because it is a direct NMD target, or due to secondary effects. Since the abovementioned genes do not harbor a PTC50, they might not be direct targets of NMD. In the course of our analysis, we did not evaluate upregulated isoforms for other known NMD-triggering features 142 such as a long 3′ untranslated region (3′UTR), or an upstream open reading frame (uORF).

Materials and methods

Cell culture

HeLa cells were cultured under standard conditions in DMEM with 0.1 mM non- essential amino acids (NEAA) and 10% fetal bovine serum (FBS). GM12878 cells were procured from NIGMS Human Genetic Cell Repository (Coriell Institute) and were cultured under standard conditions in RPMI 1640 with 15% FBS, NaPyr, and 1% HEPES.

Transfections

Transfection of HeLa cells

The shRNA transfection of HeLa cells was performed using Lipofectamine® LTX with Plus™ Reagent (Invitrogen) using standard protocol and shRNA against UPF1, SMG6, or shRNA against a non-specific target. The shRNA plasmids were a generous gift from Professor Muhlemann 127. The two target sequences of human UPF1 are: 5′-GAGAATCGCCTACTTCACT-3′ (pSUPERpuro-hUpf1/I) and 5′-GATGCAGTTCCGCTCCATT-3′ (pSUPERpuro-hUpf1/II); and the two target sequences of human SMG6 are 5′-GGGTCACAGTGCTGAAGTA-3′ (pSUPERpuro-hSmg6/I) and 5′-GCTGCAGGTTACTTACAAG-3′ (pSUPERpuro- hSmg6/II). The non-specific target sequence is 5’-ATTCTCCGAACGTGTCACG-3’ (pSUPERpuro-SCR). Briefly, 7x105 cells were seeded in a 3.5 cm plate, and on the following day they were transfected with 2 µg of each shRNA against a single target or with 4 µg of non- specific shRNA. After 24 h, cells were selected with 1.5 µg puromycin for 48 hours.

56 Protein extracts and total RNA were prepared after 24 h of further incubation in antibiotic free media.

Transfection of GM12878 cells

The siRNA transfection of GM12878 cells was performed using two rounds of electroporation with Nucleofector® V-Solution (Lonza) using the program U-009, each round 24 hours apart. The siRNAs were obtained from Ambion; UPF1: s11926, s11927, s11928; SMG6: s23488, s23489, s23490, and Silencer® Select Negative Control No. 1 siRNA. Briefly, 1x106 cells were seeded in a 3.5-cm plate, and the following day the cells were transfected with a combination of 6 µM siRNA of each silencer to a particular gene (UPF1 or SMG6) or with 18µM Negative Control siRNA. After 24 h, cells were transfected again with a combination of 6 µM siRNA of each silencer to a particular gene (UPF1 or SMG6) or with 18µM negative control siRNA. Protein extracts and total RNA were prepared after 48 h of further incubation. Cycloheximide (CHX) treatment Incubated cells with 300 μg/ml of CHX diluted in 2 ml of cell culture media for three hours. The Control samples were treated with 6 µl of DMSO in 2ml of cell culture media.

RNA extraction

RNA extraction from HeLa cells

Cells were washed with PBS and incubated with 0.5 ml of 0.05 1X Trypsin solution at 37ºC and 5% CO2 for 5 min. Trypsin was inactivated with serum; cells were harvested by pelleting. The RNA was harvested using Qiagen RNeasy Mini Kit by following the standard protocol. Used turbo DNase (Ambion) to digest DNA. Assessed the purity of RNA on Bioanalyzer (Agilent).

RNA extraction from GM12878 cells

Harvested cells by pelleting then extracted total RNA as with HeLa cells.

Western blot analysis

The cells used for harvesting proteins were derived from the same sample as the cells used for harvesting RNA. Cells were lysed in RIPA buffer. The whole cell lysate was denatured in SDS loading buffer and separated on discontinuous 4%/7.5% SDS– PAGE. Proteins were then transferred to the ImmobilonÒ-FL PVDF transfer membrane (Merc Millipore) and probed with antibodies. As primary antibodies, monoclonal anti-b- actin mouse antibody (AM4302 Ambion) was used at a dilution of 1:10,000, polyclonal anti-UPF1 (Thermofisher) was used at a dilution of 1:1000, and mouse polyclonal anti- SMG6 (anti-EST1 ab87539 Abcam) was used at a dilution of 1:1000. For secondary antibodies, we used donkey anti-rabbit IRDye800CW (LiCor), donkey anti-mouse IRDye680RD (LiCor), IRDye800 conjugated anti-mouse IgG (Rockland), and Alexa

57 Fluor 680 donkey anti-goat IgG (Invitrogen) at a dilution of 1:2,500 - 1:5,000 and detected with LiCor Odyssey.

RT-qPCR

The conversion of RNA into cDNA was performed with iScript cDNA Synthesis Kit (BioRad) following standard protocol. The measurement of relative mRNA levels by reverse transcription quantitative polymerase chain reaction (RT-qPCR) was performed using Fast SYBR Green Master Mix (Applied Biosystems) following standard protocol. The primer sequences were displayed as follows:

gene sequence 5'-CCCCTGTGGTGTGAGGCGCGTGTTC- 3' SRSF6 c region 5'-CCTTCTCCCGGACGTTGTAGCTCAG- 3' 5'- GCTACGGAAGCCGCAGTGGTGGAGG-3' SRSF6 a region 5'-ATCTTGCCAACTGCACCGACTAGAA- 3' 5'-GCTACGGAAGCCGCATGACCAATGG- 3' SRSF6 b region 5'-GGCCACAAAACACGCAAGGTAACAG- 3' 5'-TGTTTCTTGGCGTGTGAAGATAACC- 3' TBP 5'-AGAAACCCTTGCGCTGGAACTCGTC- 3' 5’-GAGAATCGCCTACTTCACT-3’, UPF1 5’-GATGCAGTTCCGCTCCATT-3’ 5’-GGGTCACAGTGCTGAAGTA-3’, SMG6 5’-GCTGCAGGTTACTTACAAG-3’

cDNA Library Preparation and RNA Sequencing

The cDNA library preparation and sequencing of the transcriptome were performed with the help of the Research Technology Support Facility Genomics Core at Michigan State University. The cDNA libraries from the total RNA samples were prepared using Illumina TruSeq RNA sample prep kit (Illumina). cDNA was size selected using Pippin Prep size selection for 500-550bp inserts. All samples were pooled and sequenced on 2 lanes of 2x250bp using the Rapid Run mode on Illumina HiSeq 2500.

Customized isoform annotation

58 Dr. French first mapped the raw reads to the human genome (GRCh37 assembly) using HISAT2 v2.0.5 143 with the “--rna-strandness RF --dta-cufflinks --sp 1000,1000” option. UCSC gene models 144 were provided to inform the known splicing junctions. The statistics of all the alignments were shown in Table 1A, and Table 1B. Alignment results (sam format) were converted to bam format and then sorted using Samtools v1.2 145 with command line “samtools view –bS” and “samtools sort” respectively. Isoforms of each condition were assembled using Stringtie v1.2.0 with UCSC gene models as guidance. Next, UCSC gene models and all customized gene models were merged using cuffmerge 146 with default parameters. Moreover, we removed those isoforms with unsupported splice junctions. Specifically, we first merged sorted alignment results of all experiment conditions using “samtools merge” with default parameters. Next, we assigned each splice junction a Shannon entropy score based on offset and depth of spliced reads 147. Isoforms with a low-quality junction (entropy score <1 and not present in UCSC annotation) were then filtered out. Finally, Dr. French built the customized isoform annotation set for all of our experimental conditions.

59 Table 0.1 Table 2.4. Summary of RNA-seq data for HeLa.

HeLa Number of Overall Treatment Replicate Runs reads alignment rate Run1 2511555 92.38% 1 Run2 2475531 92.51% Run1 2587053 89.64% 2 Run2 2616941 89.48% SCR Run1 2216023 90.40% 3 Run2 2239681 90.20% Run1 2053671 90.06% 4 Run2 2030422 90.27% Run1 2945355 88.25% 1 Run2 2914863 88.53% Run1 3158579 86.03% 2 Run2 3198695 85.73% SMG6 KD Run1 3184436 84.42% 3 Run2 3147247 84.72% Run1 1743550 89.02% 4 Run2 1729641 89.33% Run1 2587004 87.81% 1 Run2 2623817 87.63% Run1 2181197 90.02% 2 Run2 2210661 89.91% UPF1 KD Run1 2393087 90.55% 3 Run2 2359714 90.71% Run1 1878700 90.89% 4 Run2 1901548 90.74% Run1 2785926 89.94% 1 Run2 2758295 90.24% Run1 3630457 89.41% 2 Run2 3588442 89.69% DMSO Run1 2745474 89.29% 3 Run2 2715730 89.57% Run1 714905 89.60% 4 Run2 710300 89.90% Run1 2837468 88.11% 1 Run2 2808581 88.27% Run1 3142653 93.85% 2 Run2 3098858 93.96% CHX Run1 2673321 91.59% 3 Run2 2647566 91.79% Run1 2570642 91.83% 4 Run2 2599495 91.19%

60 Table 0.2 Table 2.5. Summary of RNA-seq data for GM12878. GM12878 Number of Overall Treatment Replicate Runs reads alignment rate Run1 3250244 90.42% 1 Run2 3301611 90.31% SCR paired Run1 3177352 90.90% with SMG6 2 Run2 3225135 90.84% KD Run1 2753849 90.39% 3 Run2 2713672 90.48% Run1 3261742 88.30% 1 Run2 3308832 88.23% Run1 3872583 90.14% SMG6 KD 2 Run2 3799149 90.19% Run1 3601323 91.10% 3 Run2 3653715 91.05% Run1 2757517 90.27% 1 Run2 2712886 90.36% SCR paired Run1 3090654 91.52% with UPF1 2 Run2 3044170 91.60% KD Run1 2543331 87.66% 3 Run2 2578471 87.55% Run1 2885293 87.44% 1 Run2 2837685 87.53% Run1 3567143 90.76% UPF1 KD 2 Run2 3623950 90.67% Run1 3076742 88.45% 3 Run2 3127115 88.39% Run1 2884706 91.91% 1 Run2 2848129 92.07% Run1 3238591 91.40% DMSO 2 Run2 3182258 91.48% Run1 8710310 92.29% 3 Run2 8572653 92.38% Run1 1473743 90.93% 1 Run2 1494456 90.74% Run1 2395242 62.96% CHX 2 Run2 2363591 63.22% Run1 2048370 91.43% 3 Run2 2024038 91.54%

Annotation of PTC50 isoforms

Dr. French and Dr. Hu detected the coding sequence (CDS) for each isoform using a strategy described by Hansen et al 148. A premature stop codon (PTC50nt) is defined as a stop codon locating at least 50 nucleotides upstream of the last exon-exon junction of the transcript. Those isoforms with premature stop codons were defined as PTC50 isoforms. Gene expression, isoform expression and isoform read count were calculated using RSEM v1.3.0 149. Specifically, Dr. French first built the bowtie2 150 alignment database 61 for customized transcriptome sequences, constructed based on the customized isoform annotation and GRCh37 sequences. Next, we ran RSEM for all experimental conditions with “--paired-end --bowtie2 --no-bam-output --estimate-rspd” options. Gene expressions and isoform expressions were measured by FPKM (Fragments Per Kilobase per Million reads) and were then normalized across all experimental conditions and replicates using quantile normalization strategy. Expected read counts for each isoform output by RSEM were used for subsequent differential expression analysis.

Identification of NMD targets

Differentially expressed isoforms between NMD inhibition and normal condition in HELA or GM12878 were detected using edgeR 151 with different treatment (CHX or SMG6 knockdown) as a covariate. In detail, we first removed those isoforms with FPKM <0.5 in less than the number of NMD inhibition conditions (8 for HELA and 6 for GM12878). The raw read counts were then normalized with calcNormFactors function in edgeR. Model parameters were estimated with estimateGLMCommonDisp, estimateGLMTrendedDisp and estimateGLMTagwiseDisp functions and p-values were calculated with glmFit and glmLRT functions. NMD targets were defined as those PTC50 isoforms meeting the following criteria: 1) p-value <0.05; 2) expression ratio >2; and 3) two-fold higher expression ratio than expression ratio of any non-PTC50 isoforms of the same gene or expression ratio of any non-PTC50 isoforms of the same gene below 1.2. This section was performed by Dr. French and Hu.

Enrichment analysis for genes with NMD targets

To assess the enrichment of Gene Ontology (GO) terms, we used David 6.8 152( https://david.ncifcrf.gov/ accessed on or after September 18, 2017) and GOrilla 153,154 (http://cbl-gorilla.cs.technion.ac.il/ accessed on or after January 18, 2017). We used the two unranked lists of genes (target and background lists). For each treatment, the target list was the set of confident NMD targets and the background list was a set of all expressed genes with an FPKM over 1. The P-value threshold was pre-set at 10-3. This section was performed by Dr. Hu and I.

62 Chapter 3 References 1. Gerstein, M. B. et al. Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91–100 (2012). 2. Sveen, A., Kilpinen, S., Ruusulehto, A., Lothe, R. A. & Skotheim, R. I. Aberrant RNA splicing in cancer; expression changes and driver mutations of splicing factor genes. Oncogene (2015). doi:10.1038/onc.2015.318 3. Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008). 4. Johnson, J. M. et al. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302, 2141–2144 (2003). 5. International Human Genome Sequencing Consortium et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). 6. Nilsen, T. W. & Graveley, B. R. Expansion of the eukaryotic proteome by alternative splicing. Nature 463, 457–463 (2010). 7. Cieply, B. & Carstens, R. P. Functional roles of alternative splicing factors in human disease. Wiley Interdiscip. Rev. RNA 6, 311–326 (2015). 8. Pimentel, H. et al. A Dynamic Alternative Splicing Program Regulates Gene Expression In A Differentiation Stage-Specific Manner During Terminal Erythropoiesis. Blood 122, 3413–3413 (2013). 9. Li, Q. et al. The splicing regulator PTBP2 controls a program of embryonic splicing required for neuronal maturation. eLife 3, (2014). 10. Xue, Y. et al. Direct Conversion of Fibroblasts to Neurons by Reprogramming PTB- Regulated microRNA Circuits. Cell 152, 82–96 (2013). 11. Lee, K.-S. et al. RNA-binding protein Muscleblind-like 3 (MBNL3) disrupts myocyte enhancer factor 2 (Mef2) {beta}-exon splicing. J. Biol. Chem. 285, 33779–33787 (2010). 12. Giampietro, C. et al. The alternative splicing factor Nova2 regulates vascular development and lumen formation. Nat. Commun. 6, 8479 (2015). 13. Yabas, M., Elliott, H. & Hoyne, G. F. The Role of Alternative Splicing in the Control of Immune Homeostasis and Cellular Differentiation. Int. J. Mol. Sci. 17, (2015). 14. Tazi, J., Bakkour, N. & Stamm, S. Alternative splicing and disease. Biochim. Biophys. Acta BBA - Mol. Basis Dis. 1792, 14–26 (2009). 15. Faustino, N. A. & Cooper, T. A. Pre-mRNA splicing and human disease. Genes Dev. 17, 419–437 (2003). 16. Singh, B. & Eyras, E. The role of alternative splicing in cancer. Transcription 8, 91– 98 (2016). 17. Jensen, K. B. et al. Nova-1 regulates neuron-specific alternative splicing and is essential for neuronal viability. Neuron 25, 359–371 (2000).

63 18. Yang, Y. Y. L., Yin, G. L. & Darnell, R. B. The neuronal RNA-binding protein Nova- 2 is implicated as the autoantigen targeted in POMA patients with dementia. Proc. Natl. Acad. Sci. U. S. A. 95, 13254–13259 (1998). 19. Cook, C., Zhang, Y.-J., Xu, Y., Dickson, D. W. & Petrucelli, L. TDP-43 in Neurodegenerative Disorders. Expert Opin. Biol. Ther. 8, 969–978 (2008). 20. Pesiridis, G. S., Lee, V. M.-Y. & Trojanowski, J. Q. Mutations in TDP-43 link glycine-rich domain functions to amyotrophic lateral sclerosis. Hum. Mol. Genet. 18, R156–R162 (2009). 21. Mitchell, P. J. & Tjian, R. Transcriptional Regulation in Mammalian Cells by Sequence-Specific DNA Binding Proteins. Science 245, 371–378 (1989). 22. Genomic footprinting and sequencing of human beta-globin locus. Tissue specificity and cell line artifact. - PubMed - NCBI. Available at: https://www.ncbi.nlm.nih.gov/pubmed?term=(((Reddy%5BAuthor%20- %20First%5D)%20AND%20(%221994%22%5BDate%20- %20Completion%5D%20%3A%20%221994%22%5BDate%20- %20Completion%5D)))%20AND%20globin%5BTitle%2FAbstract%5D. (Accessed: 22nd November 2017) 23. Struhl, K. Fundamentally different logic of gene regulation in eukaryotes and prokaryotes. Cell 98, 1–4 (1999). 24. Yamamoto, K. R. Steroid receptor regulated transcription of specific genes and gene networks. Annu. Rev. Genet. 19, 209–252 (1985). 25. Jangi, M. & Sharp, P. A. BUILDING ROBUST TRANSCRIPTOMES WITH MASTER SPLICING FACTORS. Cell 159, 487–498 (2014). 26. Lareau, L. F., Inada, M., Green, R. E., Wengrod, J. C. & Brenner, S. E. Unproductive splicing of SR genes associated with highly conserved and ultraconserved DNA elements. Nature 446, 926–929 (2007). 27. Soergel, D. A. W., Lareau, L. F. & Brenner, S. E. Regulation of Gene Expression by Coupling of Alternative Splicing and NMD. (Landes Bioscience, 2013). 28. Lewis, B. P., Green, R. E. & Brenner, S. E. Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans. Proc. Natl. Acad. Sci. U. S. A. 100, 189–192 (2003). 29. Sun, S., Zhang, Z., Sinha, R., Karni, R. & Krainer, A. R. SF2/ASF autoregulation involves multiple layers of post-transcriptional and translational control. Nat. Struct. Mol. Biol. 17, 306–312 (2010). 30. Sureau, A., Gattoni, R., Dooghe, Y., Stévenin, J. & Soret, J. SC35 autoregulates its expression by promoting splicing events that destabilize its mRNAs. EMBO J. 20, 1785–1796 (2001). 31. Rossbach, O. et al. Auto- and cross-regulation of the hnRNP L proteins by alternative splicing. Mol. Cell. Biol. 29, 1442–1451 (2009).

64 32. Spellman, R., Llorian, M. & Smith, C. W. J. Crossregulation and functional redundancy between the splicing regulator PTB and its paralogs nPTB and ROD1. Mol. Cell 27, 420–434 (2007). 33. Wollerton, M. C., Gooding, C., Wagner, E. J., Garcia-Blanco, M. A. & Smith, C. W. J. Autoregulation of Polypyrimidine Tract Binding Protein by Alternative Splicing Leading to Nonsense-Mediated Decay. Mol. Cell 13, 91–100 (2004). 34. Jangi, M., Boutz, P. L., Paul, P. & Sharp, P. A. Rbfox2 controls autoregulation in RNA-binding protein networks. Genes Dev. 28, 637–651 (2014). 35. Darnell, R. B. HITS-CLIP: panoramic views of protein-RNA regulation in living cells. Wiley Interdiscip. Rev. RNA 1, 266–286 (2010). 36. Spitzer, J. et al. PAR-CLIP (Photoactivatable Ribonucleoside-Enhanced Crosslinking and Immunoprecipitation): a step-by-step protocol to the transcriptome-wide identification of binding sites of RNA-binding proteins. Methods Enzymol. 539, 113–161 (2014). 37. Huppertz, I. et al. iCLIP: protein-RNA interactions at nucleotide resolution. Methods San Diego Calif 65, 274–287 (2014). 38. Van Nostrand, E. L. et al. Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nat. Methods 13, 508–514 (2016). 39. Dredge, B. K. & Jensen, K. B. NeuN/Rbfox3 Nuclear and Cytoplasmic Isoforms Differentially Regulate Alternative Splicing and Nonsense-Mediated Decay of Rbfox2. PLoS ONE 6, e21585 (2011). 40. Jia, R. et al. HnRNP L is important for the expression of oncogene SRSF3 and oncogenic potential of oral squamous cell carcinoma cells. Sci. Rep. 6, 35976 (2016). 41. Jumaa, H. & Nielsen, P. J. The splicing factor SRp20 modifies splicing of its own mRNA and ASF/SF2 antagonizes this regulation. EMBO J. 16, 5077–5085 (1997). 42. Guo, J., Jia, J. & Jia, R. PTBP1 and PTBP2 impaired autoregulation of SRSF3 in cancer cells. Sci. Rep. 5, 14548 (2015). 43. Lebedeva, S. et al. Transcriptome-wide analysis of regulatory interactions of the RNA-binding protein HuR. Mol. Cell 43, 340–352 (2011). 44. Änkö, M.-L. et al. The RNA-binding landscapes of two SR proteins reveal unique functions and binding to diverse RNA classes. Genome Biol. 13, R17 (2012). 45. Consortium, T. E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). 46. Sloan, C. A. et al. ENCODE data at the ENCODE portal. Nucleic Acids Res. 44, D726–732 (2016). 47. Chabot, B., Blanchette, M., Lapierre, I. & Branche, H. L. An intron element modulating 5’ splice site selection in the hnRNP A1 pre-mRNA interacts with hnRNP A1. Mol. Cell. Biol. 17, 1776–1786 (1997). 65 48. Blanchette, M. & Chabot, B. Modulation of exon skipping by high-affinity hnRNP A1-binding sites and by intron elements that repress splice site utilization. EMBO J. 18, 1939–1952 (1999). 49. Bruun, G. H. et al. Global identification of hnRNP A1 binding sites for SSO-based splicing modulation. BMC Biol. 14, (2016). 50. Huelga, S. C. et al. Integrative genome-wide analysis reveals cooperative regulation of alternative splicing by hnRNP proteins. Cell Rep. 1, 167–178 (2012). 51. A Hierarchical Network of Transcription Factors Governs Androgen Receptor- Dependent Prostate Cancer Growth. Available at: http://www.sciencedirect.com/science/article/pii/S1097276507003735. (Accessed: 20th November 2017) 52. Yu, H. & Gerstein, M. Genomic analysis of the hierarchical structure of regulatory networks. Proc. Natl. Acad. Sci. 103, 14724–14731 (2006). 53. Wang, Q. et al. A Hierarchical Network of Transcription Factors Governs Androgen Receptor-Dependent Prostate Cancer Growth. Mol. Cell 27, 380–392 (2007). 54. Fic, W., Juge, F., Soret, J. & Tazi, J. Eye development under the control of SRp55/B52-mediated alternative splicing of eyeless. PloS One 2, e253 (2007). 55. Long, J. C. & Caceres, J. F. The SR protein family of splicing factors: master regulators of gene expression. Biochem. J. 417, 15 (2009). 56. Chen, M. & Manley, J. L. Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches. Nat. Rev. Mol. Cell Biol. 10, 741–754 (2009). 57. Lee, Y. & Rio, D. C. Mechanisms and Regulation of Alternative Pre-mRNA Splicing. Annu. Rev. Biochem. 84, 291–323 (2015). 58. Li, X., Song, J. & Yi, C. Genome-wide Mapping of Cellular Protein–RNA Interactions Enabled by Chemical Crosslinking. Genomics Proteomics Bioinformatics 12, 72–78 (2014). 59. Stoilov, P., Daoud, R., Nayler, O. & Stamm, S. Human tra2-beta1 autoregulates its protein concentration by influencing alternative splicing of its pre-mRNA. Hum. Mol. Genet. 13, 509–524 (2004). 60. Nakaya, T., Alexiou, P., Maragkakis, M., Chang, A. & Mourelatos, Z. FUS regulates genes coding for RNA-binding proteins in neurons by binding to their highly conserved introns. RNA N. Y. N 19, 498–509 (2013). 61. Bonomi, S. et al. HnRNP A1 controls a splicing regulatory circuit promoting mesenchymal-to-epithelial transition. Nucleic Acids Res. 41, 8665–8679 (2013). 62. Valacca, C. et al. Sam68 regulates EMT through alternative splicing-activated nonsense-mediated mRNA decay of the SF2/ASF proto-oncogene. J. Cell Biol. 191, 87–99 (2010).

66 63. McGlincy, N. J. & Smith, C. W. J. Alternative splicing resulting in nonsense- mediated mRNA decay: what is the meaning of nonsense? Trends Biochem. Sci. 33, 385–393 (2008). 64. Zhou, Y., Liu, S., Liu, G., Oztürk, A. & Hicks, G. G. ALS-associated FUS mutations result in compromised FUS alternative splicing and autoregulation. PLoS Genet. 9, e1003895 (2013). 65. Sun, Y. et al. Autoregulation of RBM10 and cross-regulation of RBM10/RBM5 via alternative splicing-coupled nonsense-mediated decay. Nucleic Acids Res. (2017). doi:10.1093/nar/gkx508 66. Sanford, J. R. et al. Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts. Genome Res. 19, 381–394 (2009). 67. Bruun, G. H. et al. Global identification of hnRNP A1 binding sites for SSO-based splicing modulation. BMC Biol. 14, (2016). 68. Zarnack, K. et al. Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of Alu elements. Cell 152, 453–466 (2013). 69. Konig, J. et al. iCLIP - Transcriptome-wide Mapping of Protein-RNA Interactions with Individual Nucleotide Resolution. J. Vis. Exp. JoVE (2011). doi:10.3791/2638 70. Shankarling, G., Cole, B. S., Mallory, M. J. & Lynch, K. W. Transcriptome-wide RNA interaction profiling reveals physical and functional targets of hnRNP L in human T cells. Mol. Cell. Biol. 34, 71–83 (2014). 71. Hoell, J. I. et al. RNA targets of wild-type and mutant FET family proteins. Nat. Struct. Mol. Biol. 18, 1428–1431 (2011). 72. Yeo, G. W. et al. An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nat. Struct. Mol. Biol. 16, 130–137 (2009). 73. Uniacke, J. et al. An oxygen-regulated switch in the protein synthesis machinery. Nature 486, 126–129 (2012). 74. Corioni, M., Antih, N., Tanackovic, G., Zavolan, M. & Krämer, A. Analysis of in situ pre-mRNA targets of human splicing factor SF1 reveals a function in alternative splicing. Nucleic Acids Res. 39, 1868–1879 (2011). 75. Wang, Z. et al. iCLIP Predicts the Dual Splicing Effects of TIA-RNA Interactions. PLOS Biol. 8, e1000530 (2010). 76. Hafner, M. et al. Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 141, 129–141 (2010). 77. Ascano, M. et al. FMRP targets distinct mRNA sequence elements to regulate protein expression. Nature 492, 382–386 (2012). 78. Kishore, S. et al. A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nat. Methods 8, 559–564 (2011). 79. Ping, X.-L. et al. Mammalian WTAP is a regulatory subunit of the RNA N6- methyladenosine methyltransferase. Cell Res. 24, 177–189 (2014). 67 80. Pandit, S. et al. Genome-wide Analysis Reveals SR Protein Cooperation and Competition in Regulated Splicing. Mol. Cell 50, 223–235 (2013). 81. Änkö, M.-L. et al. The RNA-binding landscapes of two SR proteins reveal unique functions and binding to diverse RNA classes. Genome Biol. 13, R17 (2012). 82. Wang, E. T. et al. Transcriptome-wide regulation of pre-mRNA splicing and mRNA localization by muscleblind proteins. Cell 150, 710–724 (2012). 83. Zhang, C. et al. Integrative modeling defines the Nova splicing-regulatory network and its combinatorial controls. Science 329, 439–443 (2010). 84. Zagore, L. L. et al. RNA Binding Protein Ptbp2 Is Essential for Male Germ Cell Development. Mol. Cell. Biol. 35, 4030–4042 (2015). 85. Licatalosi, D. D. et al. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature 456, 464–469 (2008). 86. Charizanis, K. et al. Muscleblind-like 2-mediated alternative splicing in the developing brain and dysregulation in myotonic dystrophy. Neuron 75, 437–450 (2012). 87. Polymenidou, M. et al. Long pre-mRNA depletion and RNA missplicing contribute to neuronal vulnerability from loss of TDP-43. Nat. Neurosci. 14, 459–468 (2011). 88. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). 89. Shannon, P. et al. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. 13, 2498–2504 (2003). 90. Karousis, E. D., Nasif, S. & Mühlemann, O. Nonsense-mediated mRNA decay: novel mechanistic insights and biological impact. Wiley Interdiscip. Rev. RNA (2016). doi:10.1002/wrna.1357 91. Carter, M. S. et al. A regulatory mechanism that detects premature nonsense codons in T-cell receptor transcripts in vivo is reversed by protein synthesis inhibitors in vitro. J. Biol. Chem. 270, 28995–29003 (1995). 92. Chamieh, H., Ballut, L., Bonneau, F. & Le Hir, H. NMD factors UPF2 and UPF3 bridge UPF1 to the exon junction complex and stimulate its RNA helicase activity. Nat. Struct. Mol. Biol. 15, 85–93 (2008). 93. Yamashita, A., Kashima, I. & Ohno, S. The role of SMG-1 in nonsense-mediated mRNA decay. Biochim. Biophys. Acta 1754, 305–315 (2005). 94. Yamashita, A. et al. SMG-8 and SMG-9, two novel subunits of the SMG-1 complex, regulate remodeling of the mRNA surveillance complex during nonsense-mediated mRNA decay. Genes Dev. 23, 1091–1105 (2009). 95. Ohnishi, T. et al. Phosphorylation of hUPF1 induces formation of mRNA surveillance complexes containing hSMG-5 and hSMG-7. Mol. Cell 12, 1187–1200 (2003).

68 96. Eberle, A. B., Lykke-Andersen, S., Mühlemann, O. & Jensen, T. H. SMG6 promotes endonucleolytic cleavage of nonsense mRNA in human cells. Nat. Struct. Mol. Biol. 16, 49–55 (2008). 97. Huntzinger, E., Kashima, I., Fauser, M., Saulière, J. & Izaurralde, E. SMG6 is the catalytic endonuclease that cleaves mRNAs containing nonsense codons in metazoan. RNA N. Y. N 14, 2609–2617 (2008). 98. Hug, N., Longman, D. & Cáceres, J. F. Mechanism and regulation of the nonsense- mediated decay pathway. Nucleic Acids Res. 44, 1483–1495 (2016). 99. Wengrod, J. et al. Inhibition of Nonsense-Mediated RNA Decay Activates Autophagy. Mol. Cell. Biol. 33, 2128–2135 (2013). 100. Isken, O. & Maquat, L. E. The multiple lives of NMD factors: balancing roles in gene and genome regulation. Nat. Rev. Genet. 9, 699–712 (2008). 101. Mocellin, S. & Provenzano, M. RNA interference: learning gene knock-down from cell physiology. J. Transl. Med. 2, 39 (2004). 102. Boehm, V., Haberman, N., Ottens, F., Ule, J. & Gehring, N. H. 3′ UTR Length and Messenger Ribonucleoprotein Composition Determine Endocleavage Efficiencies at Termination Codons. Cell Rep. 9, 555–568 (2014). 103. Hogg, J. R. & Goff, S. P. Upf1 senses 3’UTR length to potentiate mRNA decay. Cell 143, 379–389 (2010). 104. Hurt, J. A., Robertson, A. D. & Burge, C. B. Global analyses of UPF1 binding and function reveal expanded scope of nonsense-mediated mRNA decay. Genome Res. 23, 1636–1650 (2013). 105. Lindeboom, R. G. H., Supek, F. & Lehner, B. The rules and impact of nonsense- mediated mRNA decay in human cancers. Nat. Genet. 48, ng.3664 (2016). 106. Weischenfeldt, J. et al. NMD is essential for hematopoietic stem and progenitor cells and for eliminating by-products of programmed DNA rearrangements. Genes Dev. 22, 1381–1396 (2008). 107. Nasif, S., Contu, L. & Mühlemann, O. Beyond quality control: The role of nonsense- mediated mRNA decay (NMD) in regulating gene expression. Semin. Cell Dev. Biol. (2017). doi:10.1016/j.semcdb.2017.08.053 108. Linde, L., Boelz, S., Neu-Yilik, G., Kulozik, A. E. & Kerem, B. The efficiency of nonsense-mediated mRNA decay is an inherent character and varies among different cells. Eur. J. Hum. Genet. EJHG 15, 1156–1162 (2007). 109. Zetoune, A. B. et al. Comparison of nonsense-mediated mRNA decay efficiency in various murine tissues. BMC Genet. 9, 83 (2008). 110. Lou, C.-H. et al. Nonsense-Mediated RNA Decay Influences Human Embryonic Stem Cell Fate. Stem Cell Rep. 6, 844–857 (2016). 111. Frattini, A. et al. High variability of genomic instability and gene expression profiling in different HeLa clones. Sci. Rep. 5, 15377 (2015).

69 112. Nikcevic, G., Kovacevic-Grujicic, N. & Stevanovic, M. Improved transfection efficiency of cultured human cells. Cell Biol. Int. 27, 735–737 (2003). 113. Mittelman, D. & Wilson, J. H. The fractured genome of HeLa cells. Genome Biol. 14, 111 (2013). 114. McGlincy, N. J. et al. Expression proteomics of UPF1 knockdown in HeLa cells reveals autoregulation of hnRNP A2/B1 mediated by alternative splicing resulting in nonsense-mediated mRNA decay. BMC Genomics 11, 565 (2010). 115. Tani, H. et al. Identification of hundreds of novel UPF1 target transcripts by direct determination of whole transcriptome stability. Rna Biol. 9, 1370–1379 (2012). 116. Colombo, M., Karousis, E. D., Bourquin, J., Bruggmann, R. & Mühlemann, O. Transcriptome-wide identification of NMD-targeted human mRNAs reveals extensive redundancy between SMG6- and SMG7-mediated degradation pathways. RNA 23, 189–201 (2017). 117. Rufener, S. C. & Mühlemann, O. eIF4E-bound mRNPs are substrates for nonsense-mediated mRNA decay in mammalian cells. Nat. Struct. Mol. Biol. 20, nsmb.2576 (2013). 118. Feng, Q. et al. A feedback loop between nonsense-mediated decay and the retrogene DUX4 in facioscapulohumeral muscular dystrophy. eLife 4, (2015). 119. Sealey, D. C. F., Kostic, A. D., LeBel, C., Pryde, F. & Harrington, L. The TPR- containing domain within Est1 homologs exhibits species-specific roles in telomerase interaction and telomere length homeostasis. BMC Mol. Biol. 12, 45 (2011). 120. Carter, M. S. et al. A regulatory mechanism that detects premature nonsense codons in T-cell receptor transcripts in vivo is reversed by protein synthesis inhibitors in vitro. J. Biol. Chem. 270, 28995–29003 (1995). 121. Thermann, R. et al. Binary specification of nonsense codons by splicing and cytoplasmic translation. EMBO J. 17, 3484–3494 (1998). 122. Ishigaki, Y., Li, X., Serin, G. & Maquat, L. E. Evidence for a pioneer round of mRNA translation: mRNAs subject to nonsense-mediated decay in mammalian cells are bound by CBP80 and CBP20. Cell 106, 607–617 (2001). 123. Durand, S. et al. Inhibition of nonsense-mediated mRNA decay (NMD) by a new chemical molecule reveals the dynamic of NMD factors in P-bodies. J. Cell Biol. 178, 1145–1160 (2007). 124. Dang, Y. et al. Inhibition of Nonsense-mediated mRNA Decay by the Natural Product Pateamine A through Eukaryotic Initiation Factor 4AIII. J. Biol. Chem. 284, 23613–23621 (2009). 125. Carter, M. S. et al. A regulatory mechanism that detects premature nonsense codons in T-cell receptor transcripts in vivo is reversed by protein synthesis inhibitors in vitro. J. Biol. Chem. 270, 28995–29003 (1995).

70 126. Noensie, E. N. & Dietz, H. C. A strategy for disease gene identification through nonsense-mediated mRNA decay inhibition. Nat. Biotechnol. 19, 434–439 (2001). 127. Paillusson, A., Hirschi, N., Vallan, C., Azzalin, C. M. & Mühlemann, O. A GFP- based reporter system to monitor nonsense-mediated mRNA decay. Nucleic Acids Res. 33, e54 (2005). 128. Muller, R. Y., Hammond, M. C., Rio, D. C. & Lee, Y. J. An Efficient Method for Electroporation of Small Interfering RNAs into ENCODE Project Tier 1 GM12878 and K562 Cell Lines. J. Biomol. Tech. JBT 26, 142–149 (2015). 129. Le Hir, H., Gatfield, D., Izaurralde, E. & Moore, M. J. The exon–exon junction complex provides a binding platform for factors involved in mRNA export and nonsense-mediated mRNA decay. EMBO J. 20, 4987–4997 (2001). 130. Silva, A. L., Ribeiro, P., Inácio, Â., Liebhaber, S. A. & Romão, L. Proximity of the poly(A)-binding protein to a premature termination codon inhibits mammalian nonsense-mediated mRNA decay. RNA 14, 563–576 (2008). 131. Johnson, J. K., Waddell, N., kConFab Investigators & Chenevix-Trench, G. The application of nonsense-mediated mRNA decay inhibition to the identification of breast cancer susceptibility genes. BMC Cancer 12, 246 (2012). 132. Popp, M. W.-L. & Maquat, L. E. Organizing Principles of Mammalian Nonsense- Mediated mRNA Decay. Annu. Rev. Genet. 47, 139–165 (2013). 133. Wang, D. et al. Inhibition of Nonsense-Mediated RNA Decay by the Tumor Microenvironment Promotes Tumorigenesis. Mol. Cell. Biol. 31, 3670–80 (2011). 134. Martin, L. et al. Identification and Characterization of Small Molecules That Inhibit Nonsense-Mediated RNA Decay and Suppress Nonsense p53 Mutations. Cancer Res. 74, 3104–3113 (2014). 135. Yepiskoposyan, H., Aeschimann, F., Nilsson, D., Okoniewski, M. & Mühlemann, O. Autoregulation of the nonsense-mediated mRNA decay pathway in human cells. RNA N. Y. N 17, 2108–2118 (2011). 136. REHWINKEL, J., LETUNIC, I., RAES, J., BORK, P. & IZAURRALDE, E. Nonsense- mediated mRNA decay factors act in concert to regulate common mRNA targets. RNA 11, 1530–1544 (2005). 137. Mendell, J. T., Sharifi, N. A., Meyers, J. L., Martinez-Murillo, F. & Dietz, H. C. Nonsense surveillance regulates expression of diverse classes of mammalian transcripts and mutes genomic noise. Nat. Genet. 36, 1073–1078 (2004). 138. GM12878. 139. LeBien, T. W. & Tedder, T. F. B lymphocytes: how they develop and function. Blood 112, 1570–1580 (2008). 140. Rehwinkel, J., Raes, J. & Izaurralde, E. Nonsense-mediated mRNA decay: target genes and functional diversification of effectors. Trends Biochem. Sci. 31, 639–646 (2006).

71 141. Schmidt, S. A. et al. Identification of SMG6 cleavage sites and a preferred RNA cleavage motif by global analysis of endogenous NMD targets in human cells. Nucleic Acids Res. 43, 309–323 (2015). 142. Schweingruber, C., Rufener, S. C., Zünd, D., Yamashita, A. & Mühlemann, O. Nonsense-mediated mRNA decay — Mechanisms of substrate mRNA recognition and degradation in mammalian cells. Biochim. Biophys. Acta BBA - Gene Regul. Mech. doi:10.1016/j.bbagrm.2013.02.005 143. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015). 144. Tyner, C. et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 45, D626–D634 (2017). 145. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinforma. Oxf. Engl. 25, 2078–2079 (2009). 146. Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013). 147. Brooks, A. N. et al. Conservation of an RNA regulatory map between Drosophila and mammals. Genome Res. 21, 193–202 (2011). 148. Hansen, K. D. et al. Genome-wide identification of alternative splice forms down- regulated by nonsense-mediated mRNA decay in Drosophila. PLoS Genet. 5, e1000525 (2009). 149. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011). 150. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). 151. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinforma. Oxf. Engl. 26, 139–140 (2010). 152. Dennis, G. et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 4, R60 (2003). 153. Eden, E., Navon, R., Steinfeld, I., Lipson, D. & Yakhini, Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 10, 48 (2009). 154. Eden, E., Lipson, D., Yogev, S. & Yakhini, Z. Discovering Motifs in Ranked Lists of DNA Sequences. PLOS Comput. Biol. 3, e39 (2007).

72