Identification of Novel Branch Points Reveals Insights into RNA Processing

by

Genevieve Michelle Gould

B.A. Molecular and Cell Biology with an emphasis in Genetics, Genomics, and Development University of California, Berkeley (2009)

Submitted to the Department of Biology in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2015

© Massachusetts Institute of Technology 2015. All rights reserved.

Signature of Author ...... Department of Biology August 31, 2015

Certified by ...... Christopher B. Burge Professor of Biology Thesis Supervisor

Accepted by...... Michael Hemann Associate Professor of Biology Co-Chair, Biology Graduate Committee

1

2 Identification of Novel Branch Points Reveals Insights into RNA Processing

by

Genevieve Michelle Gould

Submitted to the Department of Biology on August 31, 2015 in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Biology

Abstract

Pre-mRNA splicing is a ubiquitous process necessary for the production of functional eukaryotic mRNAs. The branch point (BP) sequence is one of three key nucleotide sequences required for pre-mRNA splicing, however, in metazoa it has been less comprehensively studied than the 5' splice site (5'SS) and 3' splice site (3'SS) due to the relative difficulty of identifying each sequence element. 5'SS and 3'SS are readily identified by aligning spliced cDNAs, ESTs, or RNA-Seq reads to the genome, while lower throughput techniques such as primer extension are usually required to map BPs, with some exceptions. To understand how the BP affects splicing outcomes, we developed an experimental method to locate BPs on a genome-wide scale. Applying our method to Saccharomyces cerevisiae (S. cerevisiae), one of the only eukaryotes for which most BPs are known, allowed us to assess the sensitivity and specificity of our method. We enriched for RNA lariats by isolating RNA from debranching enzyme null yeast and purified circular RNAs (including lariats) from linear RNAs using a 2D PAGE gel. This was followed by a custom library preparation protocol that produced insert ends that identified the BP and 5'SS of individual lariats. Using this method, we located known BPs and discovered a substantial number of novel BPs both in annotated introns and other genomic regions. We attempted to verify these novel introns using RNA-seq and Lariat-seq and surprisingly observed considerable amounts of alternative splicing (AS) in S. cerevisiae beyond the previously known stress-regulated intron retention events and handful of alterative splice sites. Additionally, we observed several introns with 2 BPs and one intron with 3 BPs. In the LSM2 transcript, we showed alternative BP usage was associated with alternative splice site usage, where one of the mRNA isoforms contains a premature termination codon and leads to nonsense-mediated mRNA decay of the transcript. This suggests AS may control expression levels in yeast as is known to be the case in metazoans. Preliminary application of our method to Drosophila melanogaster showed recursive splicing, a phenomenon known only to occur in introns larger than 10Kb, to occur in a 383nt intron.

Thesis supervisor: Christopher B. Burge Title: Professor of Biology

3 Acknowledgements

I’d like to begin by thanking my advisor, Chris Burge, for allowing me to join his lab and pursue a risky project that let me combine my desire to perform both experimental and computational biology research. The Burge lab has been a great environment for me to learn and grow. Thank you Chris for being receptive to my requests over the years, agreeing to meet with me regularly to discuss my research and allowing me to present my findings at several scientific venues.

To my committee members, Phil Sharp and Tom RajBhandary, thank you for all of your helpful advice over the years. Also, Robin Reed, thank you for agreeing to serve on my thesis committee and for providing me with the HeLa Nuclear Extracts that were essential to the success of my research.

Next, thank you to all the members of the Burge lab, past and present, who have made the lab a great environment for doing research. I appreciate all you have taught me through sharing your own knowledge of techniques and through your efforts critiquing my presentations and writing over the years. Special thanks to Nicole for encouraging me to purify yeast DBR1 which was the key to getting my protocol to work, to Athma for patiently helping me learn R, Alex, Jason, Noah, Charles, Maria, Peter F. and Peter S. for teaching me new Python tricks, Matt for insightful suggestions on ways to plot data, Eric for initial ideas pertaining to my project, Jess for talking some sense into me when trying to get last minute experiments to work the night before group meeting, Reut for helpful conversations over her late-morning breakfast in the dry lab, Joe for being always being upbeat and being a wonderfully motivated guy to work with, and to Jennifer, Dan, Caitlin, Razvan, Robin, Yarden, Albert, Rob, Vincent, Monica, Yevgenia, Chetan, Abby, Cassie, Ritu, Dima, Daniel, Phil, and Brad for making my time in the lab so memorable.

Thank you to my collaborators Boris, Yuchun, and Joe for countless conversations and questions; they have been some of the best parts of grad school.

I’d also like to thank all of my friends in the building, especially all of my 2nd and 3rd floor neighbors for making the lab a lively place to do science, providing moral support, and organizing fun extracurricular activities.

Thank you to my classmates. It’s been great bouncing ideas off of you and it has been comforting to know I always have good friends nearby. I believe the bonds we have formed will last a lifetime and I look forward to learning of everyone’s future accomplishments. Also, thank you to my BBS friends. It’s been fun to observe the differences between the MIT and Harvard Biology PhD programs over the years and it’s been wonderful having more friends in the area who understand the time requirements of research. Also thank you to my roommates, past and present, who have always been there for me when I needed to unwind at the end of the day.

Thanks to MIT’s extracurricular activities, I’ve been able to maintain a work-life balance. Thank you to the friendly staff and volunteers at the MIT Sailing Pavilion, members of the MIT Figure Skating Club, and volunteers at the MIT Rock Wall for creating positive outlets.

4 Thank you to my friends from home. Even though some of you admitted you probably wouldn’t understand what I was studying, you were always willing to give it a try and wanted to catch up anyway. Thank you to my college friends, especially the Cal Sailing Team, who still make the time to get together even though we are now scattered across the globe. And to those Cal Sailors whom I discuss scientific topics with from afar, I look forward to our future conversations about scientific breakthroughs, and what the general public thinks of them.

Thank you to Mike Eisen for allowing me to experience what computational biology was all about first hand. If I hadn’t worked in your lab, I wouldn’t have come to grad school. Thank you to my additional mentors outside the lab, Kim Hamad-Schifferli, Frank Solomon, and Alan Grossman, who have provided me with valuable advice over the years.

I would like to especially thank my best friend, Dr. Lauren Barclay, for always being there for me. As we both know, grad school can be trying at times, and having my best friend nearby, who was going through a PhD herself, was the best thing I could have asked for. Thanks for making time to catch up and getting me out of the lab to enjoy New England!

I’d like to thank my high school biology teacher for instilling in me my love of biology. Mr. Van Loo was an excellent teacher who really worked hard to make the subject matter he was teaching interesting and memorable. I’ll never forget when he dressed up a hockey player to demonstrate the Calvin Cycle, bringing a puck of “carbon” in the open “stomata” door to show us where the carbon went and what happened to it once it entered the “cell” classroom, or the time when he had a student volunteer stand on a chair, hold a couple of branches, and try, to no avail, to drink water through a long straw from a water bottle on the floor to demonstrate why transpiration was important for plants to transport water from their roots to their branches. He made biology fun and accessible. It was also through his course that I learned about the UC Davis Young Scholars Program and ended up having my first of many research experiences.

I’d like to thank my extended family in the Boston area that made Cambridge a home away from home for me. It’s been great spending time with you, especially since we lived so far apart while I was growing up. I’ve really enjoyed all of our great meals together, Red Sox games, trips to the Cape and other outings. Also, thank you for opening your home to me after the Boston Marathon bombing. A special thank you to the officers who protect MIT, especially Officer Sean Collier.

To my grandparents, thank you for always wanting to hear about my latest endeavors. To my “little” brother, thanks for being born after me, you would have been a tough act to follow. I’ve enjoyed all of our fun East Coast visits and appreciate all your advice over the years. Finally, I would like to thank my mom and dad for everything they have given me. Without them, there’s no way I would be where I am today. They have always been there for me, from the endless hours of practicing vocabulary words and spelling in elementary school, to coaching my soccer teams, to caring for me after injuries from said soccer, to taking me on unforgettable family vacations. You taught me to be persistent and it has definitely paid off. Thank you for your invaluable love and support, I couldn’t have done it without you.

-Genny Gould

5 Table of Contents

Abstract ...... 3 Acknowledgements ...... 4 Table of Contents ...... 6 Chapter 1: Introduction ...... 9 Overview ...... 10 Pre-mRNA splicing ...... 10 Spliceosomal splicing and self splicing ...... 10 Consequences of alternative splicing ...... 14 Unconventional intron removal ...... 14 Branch points ...... 15 Discovery ...... 15 BP identification ...... 16 BP characteristics: motifs and locations ...... 20 Functional roles of BPs: effects of location, mutations, and altered recognition ...... 22 Lariats ...... 24 Sources of lariats ...... 24 Lariat turnover: debranching ...... 25 RNAs processed from lariats ...... 26 Lariats versus circular RNAs ...... 27 Sequencing technologies ...... 29 Thesis overview ...... 30 References ...... 31 Chapter 2: Identification of New Branch Points and Unconventional Introns in Saccharomyces cerevisiae ...... 39 Abstract ...... 40 Introduction ...... 41 Results ...... 44 Branch-seq accurately identifies locations of 75% of expressed, annotated BPs ...... 44 Branch-seq identifies novel BP and associated 5'SS ...... 47 Over 100 additional introns and splice sites in the yeast genome ...... 52 New splice sites have distinctive features and conservation ...... 53 New splice sites have distinctive features and conservation ...... 56 AT-AC splice sites are used in yeast ...... 57 Multi-BP introns occur in at least twelve and can impact ...... 58 Changes in splicing among growth conditions ...... 61 Discussion ...... 64 Methods ...... 67 Data access ...... 89 Acknowledgements ...... 89 Author contributions ...... 89 Supplemental figures ...... 90 Tables ...... 101 References ...... 102 Chapter 3: Conclusions ...... 107 Implications ...... 108

6 Future directions ...... 110 BP sequencing approaches ...... 110 Advice for future development of BP sequencing approaches ...... 111 Additional applications of BP sequencing ...... 112 Final remarks ...... 113 References ...... 114 Appendix I: Branch-seq Protocol ...... 115 Part 1: Branch-seq protocol ...... 116 Pre-protocol steps: ...... 116 Branch-seq protocol: ...... 117 Part 2: Advice for future BP sequencing protocols ...... 124 Figures ...... 128 References ...... 132 Appendix II: Supplemental Tables to Chapter 2 ...... 133 Table II-S1. Branch-seq BP peaks paired 5'SS motifs...... 134 Table II-S2. GEM-BP and winBP peaks ...... 135 Table II-S3. GTATGT motif frequency at 5'SS and generally in introns...... 152 Table II-S4. Branch-seq CPMs...... 153 Table II-S5. SacCer 3 coordinates of lariat junction reads ...... 164 Table II-S6. Novel splice junctions with entropy ≥ 2 bits...... 166 Appendix III: BP Identification in Metazoans ...... 171 Abstract ...... 172 Introduction ...... 172 Methods ...... 173 Results ...... 175 Knockdown of ldbr does not result in a noticeable accumulation of lariat RNA ...... 175 Fly Branch-seq reads largely do not map to the fly genome ...... 177 Fly Branch-seq reads identify the first recursive splice site in a short intron ...... 178 Discussion ...... 182 Supplemental note ...... 183 Acknowledgments ...... 185 References ...... 185

7

8

Chapter 1: Introduction

9 Overview

In eukaryotes, most intron containing pre-mRNAs require splicing in the nucleus before they are exported to the cytoplasm (Hocine, Singer, & Grünwald, 2010). Pre-mRNA splicing is the ubiquitous process by which intervening sequences, introns, are removed from pre-mRNAs and exonic sequences are joined together as part of the mRNA maturation process (Padgett, Konarska, Grabowski, Hardy, & Sharp, 1984). This is accomplished through two successive transesterification reactions that can produce constitutive or alternative splicing patterns. During alternative splicing, the exons of one pre-mRNA are joined together in different combinations to produce two or more distinct mRNAs, termed isoforms. mRNA isoforms may differ in their translation, stability, or localization (Hocine et al., 2010). These mRNA isoforms often code for different contributing greatly to the diversity of the proteome (Matlin, Clark, & Smith, 2005). Though the value of alternative splicing has been appreciated for some time, much remains to be understood about how splicing is regulated.

Pre-mRNA splicing Spliceosomal splicing and self splicing

The branch point (BP) sequence is one of three key nucleotide sequences required for splicing of precursors to mRNAs. It is typically located near the 3' end of the intron, between the two other required sequences, the 5' splice site (5'SS) and the 3' splice site

(3'SS), which identify the ends of the intron. All three of these sequences are absolutely

10 required for spliceosome-mediated splicing because they participate in the chemistry of splicing.

The spliceosome is comprised of an array of RNAs and proteins that assemble on the pre-mRNA in a step-wise manner. During assembly of the spliceosome, the 5'SS is recognized by the U1 small nuclear ribonucleoprotein (snRNP) through complementarity between the U1 small nuclear RNA (snRNA) and the 5'SS sequence. The BP is first recognized by the BP binding protein (BBP) (yeast) or splicing factor 1/mammalian BBP

(SF1/mBBP) (mammals), and the polypyrimidine tract and 3'SS are recognized by U2AF2 and U2AF1, respectively (Fig. 1-1A). The U2 snRNP subsequently replaces SF1/BBP and the

U2 snRNA base pairs with the BP sequence, forming a structure in which the BP nucleotide, typically an adenosine embedded inside the BP motif, is bulged from the RNA duplex

(Langford & Gallwitz, 1983; Query, Moore, & Sharp, 1994; Wahl, Will, & Lührmann, 2009), preparing the RNA for the first transesterification reaction of splicing. The base pairing of the U2 snRNA with the BP is stabilized by the SF3a and SF3b complex components of the

U2 snRNP (Gozani, Feld, & Reed, 1996). Next, the pre-assembled U4/U6.U5 tri-snRNP is recruited to the splicing complex and then the U1 and U4 snRNPs are released. Once this step occurs, the first splicing reaction creates an unusual 2'-5' RNA linkage between the 2'

OH of the BP nucleotide and the 5'SS. This reaction results in the formation of a lariat structure attached to the downstream exon, leaving the upstream exon with a free 3' OH.

The second transesterification reaction joins the two exons together and frees the lariat

(Meyer, Plass, Pérez-Valle, Eyras, & Vilardell, 2011; Padgett et al., 1984). The lariat is rapidly debranched and degraded in most cases (Chapman & Boeke, 1991; Corvelo,

Hallegger, Smith, & Eyras, 2010; Folco & Reed, 2014; Ruskin & Green, 1985), making BP

11 identification difficult. In contrast, splice site identification is relatively straightforward because spliced alignment of cDNAs to the genome reveals the locations of splice sites.

Figure 1-1: Intron removal. (A) Two steps of splicing for spliceosome mediated splicing showing 5'SS, 3'SS, BP, U1 snRNP, U2 snRNP, BBP, and U2AF. Adapted from (Alberts et al., 2007). (B) Recursive splicing. Ratchet introns are involved in splicing of large Drosophila introns, including Ubx intron 1, kuz intron 3, and osp introns 1 and 2. Adapted from (Burnette, Miyamoto-Sato, Schaub, Conklin, & Lopez, 2005). (C) Nested intron splicing. Some mammalian introns including introns in the human gene EPB41 contain nested introns. Adapted from (Parra, Tan, Mohandas, & Conboy, 2008). 5'SS white dotted line. 3'SS grey dotted line.

12 Not all introns are removed by the major spliceosome. For one, the minor spliceosome often splices out introns that have /AT 5'SS and AC/ 3'SS (where “/” represents the boundary with exonic sequence), though it has been shown that both the major and minor spliceosomes can splice introns with /GT-AG/ or /AT-AC/ termini

(Dietrich, Incorvaia, & Padgett, 1997). The only snRNP shared between the major and minor spliceosome is the U5 snRNP, with the minor spliceosome containing the U11, U12,

U4atac, and U6atac snRNPs that are functionally analogous to the major spliceosomal snRNPs described above (reviewed by (Patel & Steitz, 2003)). Second, introns may be removed by self splicing, as is the case in Group I and Group II introns. Group I introns use a free nucleotide, typically a guanosine, as the nucleophile for the first step of splicing. In contrast, Group II intron splicing is quite similar to that of spliceosomal introns in that the

2'OH of a nucleotide embedded in the intron sequence itself, often an adenosine, is used as the nucleophile in the first step of splicing (Bonen & Vogel, 2001). Additionally, Group II introns usually conform to a particular secondary structure that consists of an elaborate series of stem loops (Sharp, 1991). Group I and Group II introns can often self splice in vitro in the absence of proteins, but the efficiency may be augmented in vivo by specific proteins that are generally unrelated to spliceosomal proteins (Cech, 1990). A third type of intron removal occurs in eukaryotic and archaeal tRNAs where introns can be removed by a series of RNA cleavage and ligation steps that differ from spliceosomal splicing and self splicing

(reviewed by (Abelson, Trotta, & Li, 1998; Phizicky & Hopper, 2010). Bacterial tRNA introns can be removed by self splicing (Biniszkiewicz, Cesnaviciene, & Shub, 1994; Kuhsel,

Strickland, & Palmer, 1990; Reinhold-Hurek & Shub, 1992).

13 Consequences of alternative splicing

Alternative splicing of pre-mRNAs can have many different functional consequences at both the RNA and protein levels. For instance, SRP75 mRNA is destabilized by splicing in an extremely well conserved (“ultra-conserved”) exon (Lareau, Inada, Green, Wengrod, &

Brenner, 2007; Ni et al., 2007). Localization of RNAs can be altered by splicing as well, as is the case of oskar in Drosophila, where splicing causes the mRNA to localize to the posterior pole of the oocyte, whereas the unspliced mRNA can be seen diffusely throughout the ooplasm (Hachet & Ephrussi, 2004). Similarly, splicing can change the localization of the protein, as in the case of Nop30, where splicing of the mRNA alters the C-terminus of the protein, changing the protein’s localization between the nucleus and cytoplasm (Stoss,

Schwaiger, Cooper, & Stamm, 1999).

Alternative splicing is regulated at the tissue and organism level and is important for development. While gene expression is tissue-specific, alternative splicing is conserved in only a subset of tissues and is often organism- or lineage-specific (Barbosa-Morais et al.,

2012; Merkin, Russell, Chen, & Burge, 2012). Additionally, the splicing of certain introns can contribute to the proper timing of gene expression that is critical for development, as is the case of Hes7 in mouse somite segmentation (Takashima, Ohtsuka, González, Miyachi, &

Kageyama, 2011).

Unconventional intron removal

Though BPs are typically located near the 3' ends of introns, BPs located far away from the 3'SS have been observed in the cases of recursive splicing in flies and humans and

14 nested intron splicing in humans (Burnette et al., 2005; Duff et al., 2015; Hatton,

Subramaniam, & Lopez, 1998; Sibley et al., 2015). Recursive splicing is achieved by splicing the 5'SS to a sequence inside the intron that resembles a 3'SS immediately adjacent to a second 5'SS (Fig. 1-1B). Splicing continues in this fashion until the next exon is reached. To date, recursive splicing has only been observed in fly and human introns that are larger than 10 kbp in length (Burnette et al., 2005; Duff et al., 2014). In nested intron splicing, a central segment of a large intron is initially removed, followed by splicing of the remainder of the intron using the normal 5'SS and 3'SS (Fig. 1-1C) (Ott, Tamada, Bannai, Nakai, &

Miyano, 2003; Parra et al., 2008).

Branch points Discovery

In 1982 Wallace and Edmunds discovered branched RNA that contained a 2' to 5' phosphodiester bond. They observed that branching occurred in the nuclear RNA fraction as opposed to the cytoplasmic RNA fraction and observed that the branched nucleotide is often an adenosine (Wallace & Edmonds, 1983).

Shortly thereafter, in 1983, the BP motif in budding yeast was proposed after a detailed deletion analysis of the 3' end of the actin intron identified a region of the intron near the 3'SS that was necessary for splicing. Comparison to the three other budding yeast introns sequenced at the time revealed the presence of the same TACTAAC motif near the

3' ends of all four introns (Langford & Gallwitz, 1983). After the sequence of the

15 Saccharomyces cerevisiae (S. cerevisiae) genome was released in 1996, researchers sought to comprehensively identify yeast introns and test those predictions (Davis, 2000; Spingola,

Grate, Haussler, & Ares, 1999). Additionally, BPs were computationally predicted in annotated yeast introns based on a combination of their unusually strong motif (Fig. 1-2A) and location relative to the 3'SS (Davis, 2000; Meyer et al., 2011). Computational BP predictions in S. cerevisiae have been limited to annotated introns, however, additional yeast introns are still being discovered today using genome-wide assays (Kawashima,

Douglass, Gabunilas, Pellegrini, & Chanfreau, 2014; Z. Zhang, Hesselberth, & Fields, 2007).

Thus, any BPs that fell outside of the intron annotations at the time would have been missed.

BP identification

Historically, BPs have been much more challenging to identify than splice sites in a high-throughput manner. Splice site identification can be accomplished by aligning a cDNA back to its parent genome to determine the missing intronic sequence. BPs on the other hand are best identified from lariat RNAs. The short half-lives and unusual branched structure of lariats requires additional methods to pinpoint BP locations. Traditionally, BPs have been experimentally verified using more laborious low-throughput techniques such as primer extension, in vitro splicing, and RT-PCR across the lariat 5'SS-BP junction

(Padgett et al., 1985; Vogel, Hess, & Börner, 1997; Wahl et al., 2009). To identify a BP using primer extension, a gene specific primer is designed to prime reverse transcription (RT) starting in the 3' exon. RT often stops at the branched nucleotide, revealing the location of the BP based on the product size and sequence (Fig. 1-2B). In vitro splicing can be used to 16

17 Figure 1-2: BP characteristics (A) BP and SS motifs. Figure from (Lim & Burge, 2001). (B) Classical methods for experimental BP identification. (C) Re-splicing results in a BP inside of a CDS. Adapted from (Kameyama, Suzuki, & Mayeda, 2012) (D) Number of known 5'SS, 3'SS, BP based on estimates from Hg18 and (Gao, Masuda, Matsuura, & Ohno, 2008; Mercer et al., 2015; Taggart, DeSimone, Shih, Filloux, & Fairbrother, 2012). (E) Mutually exclusive splicing of α-tropomyosin as a result of unusual BP location near a 5'SS. (F) BP mutations in the XPC gene are associated with xeroderma pigmentosum (Khan et al., 2004).

splice a gene of interest and typically is combined with mutagenesis experiments of the presumptive BP region, or primer extension on the splicing products, to locate the BP nucleotide. RT-PCR across the lariat 5'SS-BP junction identifies a BP by the juxtaposition of the 5'SS sequence to the BP sequence in the PCR product (Fig. 1-2B). Because RT rarely crosses the 5'SS-BP junction, gene specific primers have traditionally been used to amplify such RT products for sequencing.

Application of such techniques has allowed identification of dozens of human BPs and revealed discrepancies with computational BP predictions. In 2008, the first large scale experimental study identified ~100 human BPs using RT-PCR on 293T cell RNA (Gao et al.,

2008). This approach targeted 52 introns using nested PCR primer pairs, similar to Figure

1-2B (top), and found that only 50% of their sequenced BPs agreed with those generated by a predictive algorithm, demonstrating the value of experimental BP validation (Gao et al., 2008). While BP prediction algorithms commonly use proximity to the 3'SS as a predictive feature, a number of studies have found examples of distant BPs located more than 100 nucleotides away from the 3'SS (Grossman et al., 1998; Hallegger, Sobala, & Smith,

2010). Additionally, BPs located very far from the 3'SS have been observed in the cases of recursive splicing and nested intron splicing (see above). Existing algorithms would also fail to predict a BP if it were located in a coding sequence (CDS), as occurs in re-splicing of

18 specific mRNAs in cancer (Kameyama et al., 2012) (Fig. 1-2C). Distant BPs, unannotated introns, CDS BPs, and poor agreement between predictive algorithms and experientially validated BP locations support the utility of an untargeted experimental approach to identify BPs genome-wide.

Alternative high-throughput approaches to identify BP locations have only been developed recently. All of these approaches have been made possible by recent advances in sequencing technologies that allow routine sequencing of millions of short, heterogeneous cDNA fragments. When these fragments, termed “reads”, are generated from mRNAs, the collection of reads yields information about relative gene expression and splicing levels.

This type of data, known as RNA sequencing (RNA-seq), is generated by selecting poly(A) tailed RNAs or depleting ribosomal RNAs from total RNA in order to isolate mRNAs.

Fragmentation of the mRNAs followed by random hexamer priming creates cDNA fragments for sequencing. Once sequenced, the short reads are aligned back to the reference genome for downstream computational analyses. Generally, a small fraction of the reads will not align, or “map”, to the genome. These unmapped reads arise from a combination of technical and biological sources.

In the last few years, new computational analyses of RNA-seq data have been used to identify BPs. In 2012, Taggart and colleagues identified split reads that cross the 5'SS to

BP junction in existing RNA-seq data from reads that do not map to the genome contiguously. This approach resulted in the identification of ~900 human BPs (Taggart et al., 2012). A drawback of this approach is the extremely low efficiency: out of 1.2 billion reads analyzed, only 2,118 (0.0002%) crossed the 5'SS to BP junction. Increasing the fraction of lariat junctions reads to total reads would make this split read mapping

19 approach more appealing for global identification of BPs. The following year, Awan and colleagues developed a method that addressed this enrichment problem. Their method,

Lariat-seq, specifically sequences lariat RNAs. Using Lariat-seq, they discovered novel introns and splicing events in Schizosaccharomyces pombe (S. pombe) and identified ~900

BPs using a variation of the split read mapping strategy originally developed by Taggart et al. (Awan, Manfredo, & Pleiss, 2013). A year later, Bitton and colleagues came up with a variation on the computational split read mapping algorithm to find BPs, termed LaSSO.

Applied to the human dataset used by Taggart et al., LaSSO found a largely different set of

BPs than the study by Taggart and colleagues (Bitton et al., 2014). These discrepancies indicate that it is likely more BP locations remain to be gleaned from existing RNA-seq datasets through further development of computational algorithms.

More recently, a novel targeted BP sequencing approach found ~60,000 human BPs

(Mercer et al., 2015). The success of this method was largely due to the strategies used to enrich for informative 5'SS to BP traversing reads. However, the targeted approach used in this study made use of oligonucleotide probes designed to map near annotated 5' and 3' ends of introns and thus was unlikely to find the unusual BPs discussed above. Based on the number of constitutive and alternative 5' and 3'SS known, it is certain that many tens of thousands of mammalian BPs remain to be discovered (Fig. 1-2D).

BP characteristics: motifs and locations

Years of study using both experimental and computational techniques have revealed consensus motifs of the three required splicing sequences. Among budding yeast, worms, flies, plants, and human, yeast has the strongest BP motif (Fig. 1-2A). The BP motif is highly 20 constrained in S. cerevisiae with ~90% of annotated BPs matching the TACTAAC motif perfectly (Spingola et al., 1999), contrasted with metazoans and plants which have a highly degenerate BP motif (yUnAy) (where y = C or U and n = any base) (Chapman & Boeke,

1991; Folco & Reed, 2014; Gao et al., 2008; Lim & Burge, 2001; Ruskin & Green, 1985).

Budding yeast also contain the most information in their 5'SS motifs and the least information at their 3'SS compared to these other organisms (Fig. 1-2A). These differences contribute to the accuracy of splicing predictions across different organisms. Previous work found that Drosophila melanogaster and Caenorhabditis elegans short introns contain most of the information necessary for their recognition by the splicing machinery. S. cerevisiae introns also contain much of this information, but not enough to clearly identify the 3'SS, whereas human and plant introns do not contain enough information in their splice site motifs to accurately predict splicing outcomes (Lim & Burge, 2001).

Known BPs are typically located near the 3' ends of introns. In S. cerevisiae, BPs are often easily found 20-45 nt upstream of the 3'SS due to the strong BP motif and short intron size (Meyer et al., 2011). These properties have allowed computational prediction of a BP in every S. cerevisiae intron, but not in other organisms. Nevertheless, in 2010 a computational study predicted human BPs using sequence conservation, predicted U2 snRNA binding stability, and intronic position (Corvelo et al., 2010). This study found that

BP strength and distance to the 3'SS correlate strongly with alternative splicing, suggesting a role for the BP in determining splicing outcomes. Interestingly, in budding yeast, when the BP to 3'SS distance is larger than ~45 nt there is typically secondary structure that reduces the effective distance between the BP and 3'SS (Meyer et al., 2011). For

21 experimentally mapped human BPs, the BP tends to be close to the 3'SS (Gao et al., 2008;

Mercer et al., 2015; Taggart et al., 2012), similar to the majority of yeast BPs.

Functional roles of BPs: effects of location, mutations, and altered recognition

While it is clear a BP is required for every splicing reaction, the degree to which BP selection determines alternative splicing outcomes has not been well studied. However, a few examples illustrate some of the functional consequences of BP usage. Work from our lab suggests that BP positioning plays a role in 3'SS selection for the special case where alternative 3'SS are 3 nucleotides apart, known as NAGNAGs (Bradley, Merkin, Lambert, &

Burge, 2012). This work showed that the putative BP is located farther upstream in the intron when the upstream NAG is favored compared to the case when the downstream NAG is predominantly used. Additionally, steric effects have been shown to influence the outcome of splicing events, as in the case of α-tropomyosin. The BP upstream of the second mutually exclusive exon in α -tropomyosin is located very close to the 5'SS of the competing exon, preventing splicing of the intervening intron due to steric hindrance of splicing components (Smith & Nadal-Ginard, 1989) (Fig. 1-2E).

BP mutations can alter splicing events both in vivo and in vitro, implying constraint on what sequence can be selected as a BP in the intron. For instance, yeast splicing reporters with mutated BPs show greatly reduced levels of splicing (Rain, 1997; Vijayraghavan et al.,

1986). Similarly, in cases of genetic diseases, BP mutations have been shown to cause exon skipping or intron retention, defined as the events where a single exon is alternatively

22 spliced out of the mRNA or a single intron is included in the mRNA, respectively. BP mutations have been linked to disease phenotypes in Fish-eye disease, X-linked hydrocephalus, Ehlers-Danlos syndrome, hemophilia B, xeroderma pigmentosum, tuberous sclerosis, familial hypercholesterolemia, Niemann-Pick disease, extrapyramidal movement disorder, and allele-dependent production of soluble DQ (Královicová, Lei, & Vorechovský,

2006). More specifically, a familial case of xeroderma pigmentosum (XP), an autosomal recessive disease associated with a 1000-fold increase of skin cancer frequency, is caused by BP mutations in the XPC gene. These mutations cause exon skipping that creates a non-

functional DNA repair enzyme (Khan et al., 2004) (Fig. 1-2F).

Mutations in core splicing factors that recognize the BP and the 3' end of the intron have recently been observed in many cancers. In several blood cancers SF3B, which is involved in BP recognition, and U2AF, which is involved in polypyrimidine tract and 3'SS recognition, have been observed to be hotspots of mutations (Hahn & Scott, 2012).

Independent studies have identified SF3B1 among the top genes containing somatic mutations in chronic lymphocytic leukemia (CLL) samples (Quesada et al., 2012; Wan &

Wu, 2013; L. Wang et al., 2011; X. Wu, Tschumper, & Jelinek, 2013). In secondary acute myeloid leukemia (sAML), recurrent mutations in U2AF1 have been identified (Graubert et al., 2012). These and other studies have documented changes in pre-mRNA splicing in mutant samples, implicating pre-mRNA splicing in myelodysplastic syndromes (MDS)

(DeBoever et al., 2015). Interestingly, the anti-tumor splicing drugs Spliceostatin A (SSA) and E7107 have been shown to interfere with normal functions of SF3B. More specifically,

SSA and E7107 disrupt proper recognition of the BP by the U2 snRNP and U2 snRNA, respectively, and alter the outcome of splicing (Corrionero, Miñana, & Valcárcel, 2011;

23 Folco, Coil, & Reed, 2011). SF3B is also the binding target of Pladienolide B, another anti- tumor compound that inhibits splicing and is structurally similar to E7107 (Effenberger et al., 2014; Kotake et al., 2007).

Lariats

The location of the BP is defined by the unusual 2'-5' linkage between the BP nucleotide and the 5'SS present in lariat RNA, making lariats the key to identifying BP locations.

Sources of lariats

There are several different sources of RNA lariats. For one, lariats can be produced in vitro using a deoxyribozyme. In this case, an in vitro synthesized linear RNA is mixed with a partially complementary DNA oligo. Pairing of the RNA and DNA facilitates branch formation by positioning various parts of the RNA near each other spatially so that a nucleophilic attack can occur (Y. Wang & Silverman, 2005). Second, in vitro self splicing of a

Group II intron can be used to produce a lariat by placing an in vitro transcribed RNA under the correct temperature and buffer conditions (Costa, Fontaine, Loiseaux-de Goër, &

Michel, 1997). Third, in vitro splicing using HeLa nuclear extracts can be used to produce lariats spliced by the spliceosome (Folco & Reed, 2014; Padgett, Hardy, & Sharp, 1983).

In the cases of self splicing and in vitro splicing, lariat RNA is not the only product of the splicing reaction; ligated exons and splicing intermediates are also produced. To remove the linear RNA products and leave lariat RNA intact, an exonuclease, such as RNase

24 R, can be added to the reaction (Suzuki, 2006). RNase R is a processive 3' to 5' exonuclease that requires 7 nt of single-stranded RNA at the 3' end of an RNA to initiate digestion of its substrates (Vincent & Deutscher, 2006). In an in vitro splicing reaction, most of what is left after treatment with RNase R will be lariat RNA (Fig. I-1).

Lariat turnover: debranching

The debranching enzyme, DBR1, rapidly linearizes lariat RNA in vivo so that the RNA can be degraded and the nucleotides can be recycled. The debranching enzyme was discovered in 1991 in a screen for factors required for Ty1 retrotransposition (Chapman &

Boeke, 1991). The study found that the DBR1 gene is required for Ty1 transposition and inadvertently found DBR1 is required to debranch lariats. Characterization of the DBR1 gene revealed that one copy of DBR1 was sufficient to debranch the yeast actin lariat in vivo, but a homozygous dbr1 deletion resulted in accumulation of the lariat. Subsequently,

DBR1 has been implicated in HIV replication (Ye, De Leon, Yokoyama, Naidu, & Camerini,

2005). In human cells, 80% knockdown of DBR1 did not significantly affect cell viability but did lead to a decrease in HIV cDNA and protein production.

Debranching enzyme is a highly conserved protein from yeast to human. It has been shown that ectopic expression of human DBR1 can complement S. cerevisiae and S. pombe dbr1 nulls (Kim et al., 2000). S. cerevisiae dbr1∆ mutants have slight growth defects

(Chapman & Boeke, 1991) and S. pombe dbr1∆ mutants have filamentous growth defects

(Mösch & Fink, 1997). Recently a homolog of DBR1, DRN1, has been reported to aid in the process of debranching lariats (Garrey et al., 2014).

25 Detailed studies of the debranching enzyme revealed its reaction condition requirements and target sequence preferences for fast debranching activity. DBR1 is optimally active between 30-37˚C, prefers purines at the 2' position relative to the BP nucleotide, and requires more than an H or OH group at the 3' position to debranch a lariat

(Nam et al., 1994; Ooi et al., 2001). Low concentrations of divalent cations will enhance the activity of DBR1. However, it is not always necessary to add cations to the debranching reaction, perhaps because the enzyme tightly binds two metal ions that may remain bound during DBR1 purification. Inhibitors of debranching include high concentrations of KCl,

RNasin, and yeast tRNA (Ooi et al., 2001). Additionally, the catalytic residues of DBR1 have been identified (Findlay, Boyle, Hause, Klein, & Shendure, 2014; Khalid, Damha, Shuman, &

Schwer, 2005).

RNAs processed from lariats

It is apparent that there are many varieties of functional RNAs that are derived from intronic lariat RNAs. For one, snoRNAs are often processed from debranched lariats by exonucleases (Bachellerie, Cavaillé, & Hüttenhofer, 2002; Kiss & Filipowicz, 1995;

Tycowski, Shu, & Steitz, 1993). snoRNAs guide RNA modifications, including 2'-O methylation and pseudouridylation. Interestingly, the spacing between snoRNAs and BPs is critical for proper snoRNA processing (Hirose, Shu, & Steitz, 2003; Vincenti, De Chiara,

Bozzoni, & Presutti, 2007).

Many additional types of RNAs are processed from introns, but the relevance of BP position in the maturation of these RNAs is less clear. For instance, sno-lncRNAs, long non- coding RNAs (lncRNAs) flanked on each end by a snoRNA, are processed out of introns, two 26 of which have been implicated in the pathogenesis of Prader-Willi Syndrome (Yin et al.,

2012). Additionally, microRNAs (miRNAs) can be processed out of debranched introns.

These miRNAs/introns, called mirtrons, structurally look like pre-miRNAs once the lariat has been debranched, allowing the mirtron to enter the miRNA processing pathway without Drosha-mediated cleavage (Ruby, Jan, & Bartel, 2007). In another example, in the yeast Cryptococcus neoformans, stalled splicing coupled to lariat debranching has been shown to produce siRNAs that silence transposons (Dumesic & Madhani, 2014; Dumesic et al., 2013). As a final example, debranching is necessary for class switch recombination of the antibodies expressed by B cells because an intronic RNA processed from a lariat guides activation-induced cytidine deaminase (AID) to immunoglobulin switch region DNA (Zheng et al., 2015). BP position might affect the stability of these intron derived RNAs. In S. cerevisiae, when lariats accumulate their tails are digested (Chapman & Boeke, 1991). If lariat tails are digested in metazoans as they are in S. cerevisiae, BP position could determine whether the intron encoded RNA will be protected in a lariat loop or subject to degradation in a lariat tail.

Lariats versus circular RNAs

Despite the circular nature of the lariat loop, there are important differences between a lariat RNA and a circular RNA. Circular RNAs are defined as having only 3' to 5' linkages of the sugar-phosphate backbone whereas lariat RNAs contain many 3' to 5' linkages, but also contain a single 2'-5' RNA linkage. The BP nucleotide in a lariat is attached to three other nucleotides, whereas every base in a truly circular RNA is only attached to two other nucleotides. Topographically, lariats and circles seem quite similar, 27 and though the literature has deceptively used the term “circular intronic RNA (ciRNA)” to describe lariat RNA (Y. Zhang et al., 2013), to enzymes circles and lariats are quite different.

It is difficult for RT to traverse the 2'-5' RNA linkage at the BP and often results in incorporation of a mismatched nucleotide at the BP or skipping of a base altogether at the

BP (Bitton et al., 2014; Gao et al., 2008; Taggart et al., 2012).

Circular shaped RNAs have known functions. Circular RNA sponge for miR-7 (ciRS-

7) binds the miRNA miR-7 (Hansen et al., 2013a), affecting the expression of many oncogenes (Hansen, Kjems, & Damgaard, 2013b). Similarly, the circular Sry transcript in mouse (Capel et al., 1993) produced from “head to tail” splicing of the Sry pre-mRNA, has been shown to interact with miR-138 (Hansen et al., 2013a). Additionally, ciRNAs have been shown to regulate the expression level of their parent transcript (Y. Zhang et al.,

2013).

A variety of techniques can be employed to prove that an RNA is circular (Jeck &

Sharpless, 2014). First, circular and lariat RNAs are resistant to RNase R digestion. Because

RNase R cannot digest all linear RNAs, digestion by RNase R should not be the only evidence used to prove that a given RNA is circular. Another way to distinguish circular

RNA from linear RNA is to perform an RNase H digestion on the RNA with an oligo somewhere in the middle of the circle. After digestion, circular RNAs will be linearized and appear as only one band on a gel whereas linear RNAs will be broken into two smaller RNA fragments. A third option is to look for retarded mobility of circular RNAs in a gel, since their shape makes the circular RNAs appear to run slower than their linear counterparts

(Chapman & Boeke, 1991). Additionally, DBR1 should be able to linearize lariat RNAs but

28 not circular RNAs. Sequence confirmation around the lariat/circle can also help to prove that a molecule is circular, especially if the sequence traverses the circle multiple times.

Sequencing technologies

Rapid developments in sequencing technology in the last decade have led to many advances in genomics research. The cost of sequencing is decreasing faster than Moore’s

Law predicts (G. E. Moore, 1965), making quick adoption and wide use of sequencing technology feasible. This availability of sequencing technology has prompted the development of many assays to measure quantities and locations of nucleic acids in cells. In the area of RNA biology, techniques have been developed to measure gene expression levels, relative splice isoform abundance (Pan, Shai, Lee, Frey, & Blencowe, 2008; E. T.

Wang et al., 2008), locations of ribosomes on mRNAs (Ingolia, Ghaemmaghami, Newman, &

Weissman, 2009), poly(A) site locations (Jan, Friedman, Ruby, & Bartel, 2011; Spies, Burge,

& Bartel, 2013), transcript initiation site locations (Arribere & Gilbert, 2013), sites of RNA modification (Carlile et al., 2014), and nascent RNAs (Core, Waterfall, & Lis, 2008; Khodor et al., 2011; Paulsen et al., 2014), just to name a few. These data are often generated in large volumes and raise new computational challenges for analysis.

Today there are many different kinds of sequencing instruments (Pareek,

Smoczynski, & Tretyn, 2011) that provide different read lengths and depths of sequencing.

Illumina sequencers produce shorter reads, typically in the range of 40-250 nt, and sequence millions of fragments per run. The Ion Torrent instrument uses an alternative method for reading DNA bases and generates similar read lengths to the Illumina platforms

29 (Salipante et al., 2014). Other technologies, such as machines developed by PacBio and

Oxford Nanopore, sequence fewer fragments per run but allow for long read sequencing, averaging 5-15 Kb (Goodwin, Gurtowski, Ethe-Sayers, & Deshpande, 2015; PacBio, 2014), which will likely be important for accurately identifying full length mRNA isoforms and sequencing around circular RNAs to confirm their shape (Tilgner et al., 2015; You et al.,

2015).

Thesis overview

When I began my PhD in 2009, many of the aforementioned sequencing technologies were in their infancy while others were becoming popular. Though there were several technical challenges to overcome, for the first time it seemed feasible to sequence

BPs on a genome wide scale. This combination of factors, along with my interest in how sequence elements contribute to gene regulation and my desire to perform both experimental and computational research, led me to study the role of the BP sequence in regulation of splicing. Chapter II of this thesis describes my findings regarding yeast BPs and novel splicing events in S. cerevisiae. Chapter III contains suggested future applications of BP sequencing methods. In the first half of Appendix I, I describe the Branch-seq protocol in detail including secondary tips that will be helpful for anyone performing the protocol in the future. In the second half of Appendix I, I offer suggestions for further development of the current Branch-seq protocol and for development of alternative BP sequencing methods. Appendix II contains the supplemental tables from Chapter 2. Finally, Appendix

III describes my application of Branch-seq to metazoans, focusing on Drosophila, and the first report of recursive splicing in a short intron.

30 References

Abelson, J., Trotta, C. R., & Li, H. (1998). tRNA Splicing. Journal of Biological Chemistry, 273(21), 12685–12688. doi:10.1074/jbc.273.21.12685 Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., & Walter, P. (2007). Molecular Biology of the Cell. Garland Science. Arribere, J. A., & Gilbert, W. V. (2013). Roles for transcript leaders in translation and mRNA decay revealed by transcript leader sequencing. Genome Research, 23(6), 977–987. doi:10.1101/gr.150342.112 Awan, A. R., Manfredo, A., & Pleiss, J. A. (2013). Lariat sequencing in a unicellular yeast identifies regulated alternative splicing of exons that are evolutionarily conserved with humans., 110(31), 12762–12767. doi:10.1073/pnas.1218353110 Bachellerie, J.-P., Cavaillé, J., & Hüttenhofer, A. (2002). The expanding snoRNA world. Biochimie, 84(8), 775–790. doi:10.1016/S0300-9084(02)01402-5 Barbosa-Morais, N. L., Irimia, M., Pan, Q., Xiong, H. Y., Gueroussov, S., Lee, L. J., et al. (2012). The evolutionary landscape of alternative splicing in vertebrate species. Science (New York, N.Y.), 338(6114), 1587–1593. doi:10.1126/science.1230612 Biniszkiewicz, D., Cesnaviciene, E., Shub, D. A. (1994). Self-splicing group I intron in cyanobacterial initiator methionine tRNA: evidence for lateral transfer of introns in bacteria. The EMBO Journal, 13(19), 4629. Bitton, D. A., Rallis, C., Jeffares, D. C., Smith, G. C., Chen, Y. Y. C., Codlin, S., et al. (2014). LaSSO, a strategy for genome-wide mapping of intronic lariats and branch points using RNA-seq. Genome Research, 24(7), 1169–1179. doi:10.1101/gr.166819.113 Bonen, L., & Vogel, J. (2001). The ins and outs of group II introns. TRENDS in Genetics, 17(6), 322–331. doi:10.1016/S0168-9525(01)02324-1 Bradley, R. K., Merkin, J., Lambert, N. J., & Burge, C. B. (2012). Alternative Splicing of RNA Triplets Is Often Regulated and Accelerates Proteome Evolution, 10(1), e1001229. doi:10.1371/journal.pbio.1001229 Burnette, J. M., Miyamoto-Sato, E., Schaub, M. A., Conklin, J., & Lopez, A. J. (2005). Subdivision of large introns in Drosophila by recursive splicing at nonexonic elements. Genetics, 170(2), 661–674. doi:10.1534/genetics.104.039701 Capel, B., Swain, A., Nicolis, S., Hacker, A., Walter, M., Koopman, P., et al. (1993). Circular transcripts of the testis-determining gene Sry in adult mouse testis. Cell, 73(5), 1019– 1030. Carlile, T. M., Rojas-Duran, M. F., Zinshteyn, B., Shin, H., Bartoli, K. M., & Gilbert, W. V. (2014). Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature, 515(7525), 143–146. doi:10.1038/nature13802 Cech, T. R. (1990). Self-splicing of group I introns. Annual Review of Biochemistry, 59, 543– 568. doi:10.1146/annurev.bi.59.070190.002551 Chapman, K. B., & Boeke, J. D. (1991). Isolation and characterization of the gene encoding yeast debranching enzyme. Cell, 65(3), 483–492. doi:10.1016/0092-8674(91)90466-C Core, L. J., Waterfall, J. J., & Lis, J. T. (2008). Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science (New York, N.Y.), 322(5909), 1845–1848. doi:10.1126/science.1162228 Corrionero, A., Miñana, B., & Valcárcel, J. (2011). Reduced fidelity of branch point

31 recognition and alternative splicing induced by the anti-tumor drug spliceostatin A. Genes & Development, 25(5), 445–459. doi:10.1101/gad.2014311 Corvelo, A., Hallegger, M., Smith, C. W. J., & Eyras, E. (2010). Genome-wide association between branch point properties and alternative splicing, 6(11), e1001016. doi:10.1371/journal.pcbi.1001016 Costa, M., Fontaine, J. M., Loiseaux-de Goër, S., & Michel, F. (1997). A group II self-splicing intron from the brown alga Pylaiella littoralis is active at unusually low magnesium concentrations and forms populations of molecules with a uniform conformation. Journal of Molecular Biology, 274(3), 353–364. Davis, C. A. (2000). Test of intron predictions reveals novel splice sites, alternatively spliced mRNAs and new introns in meiotically regulated genes of yeast. Nucleic Acids Research, 28(8), 1700–1706. doi:10.1093/nar/28.8.1700 DeBoever, C., Ghia, E. M., Shepard, P. J., Rassenti, L., Barrett, C. L., Jepsen, K., et al. (2015). Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1 -mutated Cancers, 11(3), e1004105. doi:10.1371/journal.pcbi.1004105 Dietrich, R. C., Incorvaia, R., & Padgett, R. A. (1997). Terminal Intron Dinucleotide Sequences Do Not Distinguish between U2- and U12-Dependent Introns. Molecular Cell, 1(1), 151–160. doi:10.1016/S1097-2765(00)80016-7 Duff, M. O., Olson, S., Wei, X., Garrett, S. C., Osman, A., Bolisetty, M., et al. (2015). Genome- wide identification of zero nucleotide recursive splicing in Drosophila. Nature, 521(7552), 376–379. doi:10.1038/nature14475 Duff, M. O., Olson, S., Wei, X., Osman, A., Plocik, A., Bolisetty, M., et al. (2014). Genome-wide Identification of Zero Nucleotide Recursive Splicing in Drosophila. bioRxiv. doi:10.1101/006163 Dumesic, P. A., & Madhani, H. D. (2014). Recognizing the enemy within: licensing RNA- guided genome defense. Trends in Biochemical Sciences, 39(1), 25–34. doi:10.1016/j.tibs.2013.10.003 Dumesic, P. A., Natarajan, P., Chen, C., Drinnenberg, I. A., Schiller, B. J., Thompson, J., et al. (2013). Stalled spliceosomes are a signal for RNAi-mediated genome defense. Cell, 152(5), 957–968. doi:10.1016/j.cell.2013.01.046 Effenberger, K. A., Anderson, D. D., Bray, W. M., Prichard, B. E., Ma, N., Adams, M. S., et al. (2014). Coherence between cellular responses and in vitro splicing inhibition for the anti-tumor drug pladienolide B and its analogs. Journal of Biological Chemistry, 289(4), 1938–1947. doi:10.1074/jbc.M113.515536 Findlay, G. M., Boyle, E. A., Hause, R. J., Klein, J. C., & Shendure, J. (2014). Saturation editing of genomic regions by multiplex homology-directed repair. Nature, 513(7516), 120– 123. doi:10.1038/nature13695 Folco, E. G., & Reed, R. (2014). In vitro systems for coupling RNAP II transcription to splicing and polyadenylation. Methods in Molecular Biology (Clifton, NJ), 1126, 169–177. doi:10.1007/978-1-62703-980-2_13 Folco, E. G., Coil, K. E., & Reed, R. (2011). The anti-tumor drug E7107 reveals an essential role for SF3b in remodeling U2 snRNP to expose the branch point-binding region. Genes & Development, 25(5), 440–444. doi:10.1101/gad.2009411 Gao, K., Masuda, A., Matsuura, T., & Ohno, K. (2008). Human branch point consensus sequence is yUnAy, 36(7), 2257–2267. doi:10.1093/nar/gkn073 32 Garrey, S. M., Katolik, A., Prekeris, M., Li, X., York, K., Bernards, S., et al. (2014). A homolog of lariat-debranching enzyme modulates turnover of branched RNA. RNA (New York, N.Y.), 20(8), 1337–1348. doi:10.1261/rna.044602.114 Goodwin, S., Gurtowski, J., Ethe-Sayers, S., & Deshpande, P. (2015). Oxford Nanopore Sequencing and de novo Assembly of a Eukaryotic Genome. bioRxiv. Gozani, O., Feld, R., & Reed, R. (1996). Evidence that sequence-independent binding of highly conserved U2 snRNP proteins upstream of the branch site is required for assembly of spliceosomal complex A. Genes & Development, 10(2), 233–243. Graubert, T. A., Shen, D., Ding, L., Okeyo-Owuor, T., Lunn, C. L., Shao, J., et al. (2012). Recurrent mutations in the U2AF1 splicing factor in myelodysplastic syndromes. Nature Genetics, 44(1), 53–57. doi:10.1038/ng.1031 Grossman, J. S., Meyer, M. I., Wang, Y. C., Mulligan, G. J., Kobayashi, R., & Helfman, D. M. (1998). The use of antibodies to the polypyrimidine tract binding protein (PTB) to analyze the protein components that assemble on alternatively spliced pre-mRNAs that use distant branch points. RNA (New York, N.Y.), 4(6), 613–625. Hachet, O., & Ephrussi, A. (2004). Splicing of oskar RNA in the nucleus is coupled to its cytoplasmic localization. Nature, 428(6986), 959–963. doi:10.1038/nature02521 Hahn, C. N., & Scott, H. S. (2012). Spliceosome mutations in hematopoietic malignancies. Nature Genetics, 44(1), 9–10. doi:10.1038/ng.1045 Hallegger, M., Sobala, A., & Smith, C. W. J. (2010). Four exons of the serotonin receptor 4 gene are associated with multiple distant branch points. RNA (New York, N.Y.), 16(4), 839–851. doi:10.1261/rna.2013110 Hansen, T. B., Jensen, T. I., Clausen, B. H., Bramsen, J. B., Finsen, B., Damgaard, C. K., & Kjems, J. (2013a). Natural RNA circles function as efficient microRNA sponges. Nature, 495(7441), 384–388. doi:10.1038/nature11993 Hansen, T. B., Kjems, J., & Damgaard, C. K. (2013b). Circular RNA and miR-7 in Cancer. Cancer Research, 73(18), 5609–5612. doi:10.1158/0008-5472.CAN-13-1568 Hatton, A. R., Subramaniam, V., & Lopez, A. J. (1998). Generation of alternative Ultrabithorax isoforms and stepwise removal of a large intron by resplicing at exon- exon junctions. Molecular Cell, 2(6), 787–796. Hirose, T., Shu, M.-D., & Steitz, J. A. (2003). Splicing-dependent and -independent modes of assembly for intron-encoded box C/D snoRNPs in mammalian cells. Molecular Cell, 12(1), 113–123. Hocine, S., Singer, R. H., & Grünwald, D. (2010). RNA processing and export. Cold Spring Harbor Perspectives in Biology, 2(12), a000752. doi:10.1101/cshperspect.a000752 Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S., & Weissman, J. S. (2009). Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling. Science (New York, N.Y.), 324(5924), 218–223. doi:10.1126/science.1168978 Jan, C. H., Friedman, R. C., Ruby, J. G., & Bartel, D. P. (2011). Formation, regulation and evolution of Caenorhabditis elegans 3'UTRs. Nature, 469(7328), 97–101. doi:10.1038/nature09616 Jeck, W. R., & Sharpless, N. E. (2014). Detecting and characterizing circular RNAs. Nature Biotechnology, 32(5), 453–461. doi:10.1038/nbt.2890 Kameyama, T., Suzuki, H., & Mayeda, A. (2012). Re-splicing of mature mRNA in cancer cells promotes activation of distant weak alternative splice sites. Nucleic Acids Research, 40(16), 7896–7906. doi:10.1093/nar/gks520 33 Kawashima, T., Douglass, S., Gabunilas, J., Pellegrini, M., & Chanfreau, G. F. (2014). Widespread use of non-productive alternative splice sites in Saccharomyces cerevisiae. PLoS Genetics, 10(4), e1004249. doi:10.1371/journal.pgen.1004249 Khalid, M. F., Damha, M. J., Shuman, S., & Schwer, B. (2005). Structure-function analysis of yeast RNA debranching enzyme (Dbr1), a manganese-dependent phosphodiesterase. Nucleic Acids Research, 33(19), 6349–6360. doi:10.1093/nar/gki934 Khan, S. G., Metin, A., Gozukara, E., Inui, H., Shahlavi, T., Muniz-Medina, V., et al. (2004). Two essential splice lariat branchpoint sequences in one intron in a xeroderma pigmentosum DNA repair gene: mutations result in reduced XPC mRNA levels that correlate with cancer risk. Human Molecular Genetics, 13(3), 343–352. doi:10.1093/hmg/ddh026 Khodor, Y. L., Rodriguez, J., Abruzzi, K. C., Tang, C.-H. A., Marr, M. T., & Rosbash, M. (2011). Nascent-seq indicates widespread cotranscriptional pre-mRNA splicing in Drosophila. Genes & Development, 25(23), 2502–2512. doi:10.1101/gad.178962.111 Kim, J.-W., Kim, H.-C., Kim, G.-M., Yang, J.-M., Boeke, J. D., & Nam, K. (2000). Human RNA lariat debranching enzyme cDNA complements the phenotypes of Saccharomyces cerevisiae dbr1 and Schizosaccharomyces pombe dbr1 mutants. Kiss, T., & Filipowicz, W. (1995). Exonucleolytic processing of small nucleolar RNAs from pre-mRNA introns. Genes & Development, 9(11), 1411–1424. Kotake, Y., Sagane, K., Owa, T., Mimori-Kiyosue, Y., Shimizu, H., Uesugi, M., et al. (2007). Splicing factor SF3b as a target of the antitumor natural product pladienolide. Nature Chemical Biology, 3(9), 570–575. doi:10.1038/nchembio.2007.16 Královicová, J., Lei, H., & Vorechovský, I. (2006). Phenotypic consequences of branch point substitutions. Human Mutation, 27(8), 803–813. doi:10.1002/humu.20362 Kuhsel, M., Strickland, R., & Palmer, J. (1990). An ancient group I intron shared by eubacteria and chloroplasts. Science (New York, N.Y.), 250(4987), 1570–1573. doi:10.1126/science.2125748 Langford, C. J., & Gallwitz, D. (1983). Evidence for an intron-contained sequence required for the splicing of yeast RNA polymerase II transcripts. Cell, 33(2), 519–527. doi:10.1016/0092-8674(83)90433-6 Lareau, L. F., Inada, M., Green, R. E., Wengrod, J. C., & Brenner, S. E. (2007). Unproductive splicing of SR genes associated with highly conserved and ultraconserved DNA elements. Nature, 446(7138), 926–929. doi:10.1038/nature05676 Lim, L. P., & Burge, C. B. (2001). A computational analysis of sequence features involved in recognition of short introns, 98(20), 11193–11198. doi:10.1073/pnas.201407298 Matlin, A. J., Clark, F., & Smith, C. W. J. (2005). Understanding alternative splicing: towards a cellular code. Nature Reviews. Molecular Cell Biology, 6(5), 386–398. doi:10.1038/nrm1645 Mercer, T. R., Clark, M. B., Andersen, S. B., Brunck, M. E., Haerty, W., Crawford, J., et al. (2015). Genome-wide discovery of human splicing branchpoints. Genome Research, 25(2), 290–303. doi:10.1101/gr.182899.114 Merkin, J., Russell, C., Chen, P., & Burge, C. B. (2012). Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science (New York, N.Y.), 338(6114), 1593– 1599. doi:10.1126/science.1228186 Meyer, M., Plass, M., Pérez-Valle, J., Eyras, E., & Vilardell, J. (2011). Deciphering 3'ss selection in the yeast genome reveals an RNA thermosensor that mediates alternative 34 splicing. Molecular Cell, 43(6), 1033–1039. doi:10.1016/j.molcel.2011.07.030 Moore, G. E. (1965). Moore: Cramming more components onto integrated circuits,... - Google Scholar. Electronics Magazine. Mösch, H. U., & Fink, G. R. (1997). Dissection of filamentous growth by transposon mutagenesis in Saccharomyces cerevisiae. Genetics, 145(3), 671–684. Nam, K., Hudson, R. H., Chapman, K. B., Ganeshan, K., Damha, M. J., & Boeke, J. D. (1994). Yeast lariat debranching enzyme. Substrate and sequence specificity. The Journal of Biological Chemistry, 269(32), 20613–20621. Ni, J. Z., Grate, L., Donohue, J. P., Preston, C., Nobida, N., O'Brien, G., et al. (2007). Ultraconserved elements are associated with homeostatic control of splicing regulators by alternative splicing and nonsense-mediated decay. Genes & Development, 21(6), 708–718. doi:10.1101/gad.1525507 Ooi, S. L., Dann, C., Nam, K., Leahy, D. J., Damha, M. J., & Boeke, J. D. (2001). RNA lariat debranching enzyme. Methods in Enzymology, 342, 233–248. Ott, S., Tamada, Y., Bannai, H., Nakai, K., & Miyano, S. (2003). Intrasplicing--analysis of long intron sequences. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 339–350. PacBio. (2014, October 15). New Chemistry Boosts Average Read Length to 10 kb – 15 kb for PacBio® RS II. Blog.Pacificbiosciences.com. Retrieved May 31, 2015, from http://blog.pacificbiosciences.com/2014/10/new-chemistry-boosts-average-read.html Padgett, R. A., Hardy, S. F., & Sharp, P. A. (1983). Splicing of adenovirus RNA in a cell-free transcription system. Proceedings of the National Academy of Sciences of the United States of America, 80(17), 5230–5234. Padgett, R. A., Konarska, M. M., Aebi, M., Hornig, H., Weissmann, C., & Sharp, P. A. (1985). Nonconsensus branch-site sequences in the in vitro splicing of transcripts of mutant rabbit beta-globin genes, 82(24), 8349–8353. Padgett, R. A., Konarska, M. M., Grabowski, P. J., Hardy, S. F., & Sharp, P. A. (1984). Lariat RNA's as intermediates and products in the splicing of messenger RNA precursors. Science (New York, NY), 225(4665), 898–903. Pan, Q., Shai, O., Lee, L. J., Frey, B. J., & Blencowe, B. J. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics, 40(12), 1413–1415. doi:10.1038/ng.259 Pareek, C. S., Smoczynski, R., & Tretyn, A. (2011). Sequencing technologies and genome sequencing. Journal of Applied Genetics, 52(4), 413–435. doi:10.1007/s13353-011- 0057-x Parra, M. K., Tan, J. S., Mohandas, N., & Conboy, J. G. (2008). Intrasplicing coordinates alternative first exons with alternative splicing in the protein 4.1R gene. The EMBO Journal, 27(1), 122–131. doi:10.1038/sj.emboj.7601957 Patel, A. A., & Steitz, J. A. (2003). Splicing double: insights from the second spliceosome. Nature Reviews. Molecular Cell Biology, 4(12), 960–970. doi:10.1038/nrm1259 Paulsen, M. T., Veloso, A., Prasad, J., Bedi, K., Ljungman, E. A., Magnuson, B., et al. (2014). Use of Bru-Seq and BruChase-Seq for genome-wide assessment of the synthesis and stability of RNA. Methods (San Diego, Calif.), 67(1), 45–54. doi:10.1016/j.ymeth.2013.08.015 Phizicky, E. M., & Hopper, A. K. (2010). tRNA biology charges to the front. Genes & Development, 24(17), 1832–1860. doi:10.1101/gad.1956510 35 Query, C. C., Moore, M. J., & Sharp, P. A. (1994). Branch nucleophile selection in pre-mRNA splicing: evidence for the bulged duplex model. Genes & Development, 8(5), 587–597. doi:10.1101/gad.8.5.587 Quesada, V., Conde, L., Villamor, N., Ordóñez, G. R., Jares, P., Bassaganyas, L., et al. (2012). Exome sequencing identifies recurrent mutations of the splicing factor SF3B1 gene in chronic lymphocytic leukemia. Nature Genetics, 44(1), 47–52. doi:doi:10.1038/ng.1032 Rain, J. C. (1997). In vivo commitment to splicing in yeast involves the nucleotide upstream from the branch site conserved sequence and the Mud2 protein. The EMBO Journal, 16(7), 1759–1771. doi:10.1093/emboj/16.7.1759 Reinhold-Hurek, B., & Shub, D. A. (1992). Self-splicing introns in tRNA genes of widely divergent bacteria. Nature, 357(6374), 173–176. doi:10.1038/357173a0 Ruby, J. G., Jan, C. H., & Bartel, D. P. (2007). Intronic microRNA precursors that bypass Drosha processing. Nature, 448(7149), 83–86. doi:10.1038/nature05983 Ruskin, B., & Green, M. R. (1985). An RNA processing activity that debranches RNA lariats. Science (New York, NY), 229(4709), 135–140. Salipante, S. J., Kawashima, T., Rosenthal, C., Hoogestraat, D. R., Cummings, L. A., Sengupta, D. J., et al. (2014). Performance comparison of Illumina and ion torrent next-generation sequencing platforms for 16S rRNA-based bacterial community profiling. Applied and Environmental Microbiology, 80(24), 7583–7591. doi:10.1128/AEM.02206-14 Sharp, P. A. (1991). "Five easy pieces". Science (New York, N.Y.), 254(5032), 663. Sibley, C. R., Emmett, W., Blazquez, L., Faro, A., Haberman, N., Briese, M., et al. (2015). Recursive splicing in long vertebrate genes. Nature, 521(7552), 371–375. doi:10.1038/nature14466 Smith, C. W., & Nadal-Ginard, B. (1989). Mutually exclusive splicing of alpha-tropomyosin exons enforced by an unusual lariat branch point location: implications for constitutive splicing. Cell, 56(5), 749–758. Spies, N., Burge, C. B., & Bartel, D. P. (2013). 3' UTR-isoform choice has limited influence on the stability and translational efficiency of most mRNAs in mouse fibroblasts. Genome Research, 23(12), 2078–2090. doi:10.1101/gr.156919.113 Spingola, M., Grate, L., Haussler, D., & Ares, M. (1999). Genome-wide bioinformatic and molecular analysis of introns in Saccharomyces cerevisiae., 5(2), 221–234. Stoss, O., Schwaiger, F. W., Cooper, T. A., & Stamm, S. (1999). Alternative splicing determines the intracellular localization of the novel nuclear protein Nop30 and its interaction with the splicing factor SRp30c. The Journal of Biological Chemistry, 274(16), 10951–10962. Suzuki, H. (2006). Characterization of RNase R-digested cellular RNA source that consists of lariat and circular RNAs from pre-mRNA splicing. Nucleic Acids Research, 34(8), e63– e63. doi:10.1093/nar/gkl151 Taggart, A. J., DeSimone, A. M., Shih, J. S., Filloux, M. E., & Fairbrother, W. G. (2012). Large- scale mapping of branchpoints in human pre-mRNA transcripts in vivo. Nature Structural & Molecular Biology, 19(7), 719–721. doi:10.1038/nsmb.2327 Takashima, Y., Ohtsuka, T., González, A., Miyachi, H., & Kageyama, R. (2011). Intronic delay is essential for oscillatory expression in the segmentation clock. Proceedings of the National Academy of Sciences of the United States of America, 108(8), 3300–3305. doi:10.1073/pnas.1014418108 Tilgner, H., Jahanbani, F., Blauwkamp, T., Moshrefi, A., Jaeger, E., Chen, F., et al. (2015). 36 Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events. Nature Biotechnology. doi:10.1038/nbt.3242 Tycowski, K. T., Shu, M. D., & Steitz, J. A. (1993). A small nucleolar RNA is processed from an intron of the human gene encoding ribosomal protein S3. Genes & Development, 7(7A), 1176–1190. Vijayraghavan, U., Parker, R., Tamm, J., Iimura, Y., Rossi, J., Abelson, J., & Guthrie, C. (1986). Mutations in conserved intron sequences affect multiple steps in the yeast splicing pathway, particularly assembly of the spliceosome. The EMBO Journal, 5(7), 1683– 1695. Vincent, H. A., & Deutscher, M. P. (2006). Substrate recognition and catalysis by the exoribonuclease RNase R. The Journal of Biological Chemistry, 281(40), 29769–29775. doi:10.1074/jbc.M606744200 Vincenti, S., De Chiara, V., Bozzoni, I., & Presutti, C. (2007). The position of yeast snoRNA- coding regions within host introns is essential for their biosynthesis and for efficient splicing of the host pre-mRNA. RNA (New York, N.Y.), 13(1), 138–150. doi:10.1261/rna.251907 Vogel, J., Hess, W. R., & Börner, T. (1997). Precise branch point mapping and quantification of splicing intermediates. Nucleic Acids Research, 25(10), 2030–2031. Wahl, M. C., Will, C. L., & Lührmann, R. (2009). The Spliceosome: Design Principles of a Dynamic RNP Machine. Cell, 136(4), 701–718. doi:10.1016/j.cell.2009.02.009 Wallace, J. C., & Edmonds, M. (1983). Polyadenylylated nuclear RNA contains branches. Proceedings of the National Academy of Sciences of the United States of America, 80(4), 950–954. Wan, Y., & Wu, C. J. (2013). SF3B1 mutations in chronic lymphocytic leukemia. Blood, 121(23), 4627–4634. doi:10.1182/blood-2013-02-427641 Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., et al. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature, 456(7221), 470–476. doi:doi:10.1038/nature07509 Wang, L., Lawrence, M. S., Wan, Y., Stojanov, P., Sougnez, C., Stevenson, K., et al. (2011). SF3B1 and other novel cancer genes in chronic lymphocytic leukemia. The New England Journal of Medicine, 365(26), 2497–2506. doi:10.1056/NEJMoa1109016 Wang, Y., & Silverman, S. K. (2005). Efficient one-step synthesis of biologically related lariat RNAs by a deoxyribozyme. Angewandte Chemie (International Ed. in English), 44(36), 5863–5866. doi:10.1002/anie.200501643 Wu, X., Tschumper, R. C., & Jelinek, D. F. (2013). Genetic characterization of SF3B1 mutations in single chronic lymphocytic leukemia cells. Leukemia, 27(11), 2264–2267. doi:10.1038/leu.2013.155 Ye, Y., De Leon, J., Yokoyama, N., Naidu, Y., & Camerini, D. (2005). DBR1 siRNA inhibition of HIV-1 replication. Retrovirology, 2(1), 63. doi:10.1186/1742-4690-2-63 Yin, Q.-F., Yang, L., Zhang, Y., Xiang, J.-F., Wu, Y.-W., Carmichael, G. G., & Chen, L.-L. (2012). Long noncoding RNAs with snoRNA ends. Molecular Cell, 48(2), 219–230. doi:10.1016/j.molcel.2012.07.033 You, X., Vlatkovic, I., Babic, A., Will, T., Epstein, I., Tushev, G., et al. (2015). Neural circular RNAs are derived from synaptic genes and regulated by development and plasticity. Nature Neuroscience, 18(4), 603–610. doi:10.1038/nn.3975 37 Zhang, Y., Zhang, X.-O., Chen, T., Xiang, J.-F., Yin, Q.-F., Xing, Y.-H., et al. (2013). Circular intronic long noncoding RNAs. Molecular Cell, 51(6), 792–806. doi:10.1016/j.molcel.2013.08.017 Zhang, Z., Hesselberth, J. R., & Fields, S. (2007). Genome-wide identification of spliced introns using a tiling microarray. Genome Research, 17(4), 503–509. doi:10.1101/gr.6049107 Zheng, S., Vuong, B. Q., Vaidyanathan, B., Lin, J.-Y., Huang, F.-T., & Chaudhuri, J. (2015). Non- coding RNA Generated following Lariat Debranching Mediates Targeting of AID to DNA. Cell, 161(4), 762–773. doi:10.1016/j.cell.2015.03.020

38

Chapter 2: Identification of New Branch Points and Unconventional Introns in Saccharomyces cerevisiae

This research is currently under review at Genome Research.

39 Abstract

Spliced messages constitute one-fourth of expressed mRNAs in the yeast

Saccharomyces cerevisiae, and most mRNAs in metazoans. Splicing requires 5' splice site

(5'SS), branch point (BP), and 3' splice site (3'SS) elements, but the role of the BP in splicing control remains poorly understood because BP identification remains difficult. We developed a high-throughput method, Branch-seq, to map BP and 5'SS of isolated RNA lariats. Applied to S. cerevisiae, Branch-seq detected 76% of expressed, annotated BPs and identified a comparable number of novel BPs. We used RNA-seq to confirm associated 3'SS locations, identifying 136 novel splice junctions, including an AT-AC intron. We show that several yeast introns use two or even three different BPs, with effects on 3'SS choice, protein coding potential and regulation via nonsense-mediated mRNA decay (NMD), and find that some novel introns are regulated in response to environmental changes. Together, these findings reveal BP-based regulation and demonstrate unanticipated complexity of splicing in yeast.

40 Introduction

Pre-mRNA splicing is required for the expression of most eukaryotic genes and is often regulated. The first step of splicing involves selection of a specific base, usually an adenine, in the pre-mRNA as the BP nucleophile and formation of an unusual 2'-5' RNA linkage between the 2' OH of the BP and the 5'SS (Wahl, Will, & Lührmann, 2009). This step is followed by ligation of the two exons and freeing of the intron in the form of a branched lariat (Padgett, Konarska, Grabowski, Hardy, & Sharp, 1984). The lariat is rapidly debranched and degraded in most cases (Chapman & Boeke, 1991; Folco & Reed, 2014;

Ruskin & Green, 1985), making BP identification difficult.

Current BP annotations suggest that yeast introns almost always have a single BP.

However, those annotations are based on computational predictions and lack a comprehensive experimental basis (Meyer, Plass, Pérez-Valle, Eyras, & Vilardell, 2011;

Spingola, Grate, Haussler, & Ares, 1999). While computational predictions of BP locations are sufficient in many cases, experimental knowledge of BP location is essential to understand the full repertoire of splicing decisions cells make. For instance, unusual BP placement is known to affect the outcome of splicing of mammalian alpha-tropomyosin.

The BP upstream of the second mutually exclusive exon in alpha-tropomyosin is located very close to the 5'SS of the competing exon, preventing splicing of the intervening intron due to steric hindrance of splicing components (Smith & Nadal-Ginard, 1989). BP position can also affect usage of “NAGNAG” alternative 3'SS separated by 3 nt, a common type of alternative splicing (AS) in mammals (Bradley, Merkin, Lambert, & Burge, 2012).

Many types of AS involve regulated use of 3'SS (e.g., alternative 3'SS, exon skipping, mutually exclusive exons, alternative last exons, intron retention). In general, the relative

41 contribution of BP recognition versus 3'SS recognition to each of these types of AS is unknown. In budding yeast, the BP is arguably more critical than the 3'SS to the first step of splicing because the 3'SS does not have to be identified by the splicing machinery until after the first step of splicing (Séraphin & Kandels-Lewis, 1993). In contrast, in metazoans

(Aebi, Hornig, Padgett, Reiser, & Weissmann, 1986) and S. pombe (Reich, VanHoy, Porter, &

Wise, 1992) recognition of the 3'SS often precedes the first step of splicing. A 3'SS without a BP is not sufficient for splicing of an intron as evidenced by splicing reporters in yeast in which mutating the annotated BP motif greatly reduces splicing of the transcript (Rain,

1997; Vijayraghavan et al., 1986). Similarly, in humans BP motif mutations can result in aberrant splicing or intron retention which are associated with several diseases

(Královicová, Lei, & Vorechovský, 2006).

Regulation of AS in yeast can occur in response to environmental cues. For example, amino acid starvation inhibits splicing of ribosomal protein genes and exposure to other stresses can decrease or increase the splicing of different subsets of genes (Pleiss,

Whitworth, Bergkessel, & Guthrie, 2007). In the case of PTC7, a serine/threonine phosphatase, AS responds to changes in the available carbon source by creating mRNA isoforms that code for unique proteins. One protein isoform localizes to the mitochondria and the other contains a transmembrane domain which causes the isoform to localize to the nuclear envelope (Juneau, Nislow, & Davis, 2009). In mammals, AS is nearly universal and has many regulatory roles such as targeting proteins to different cellular compartments, altering transcription factor binding preferences, and influencing RNA stability (Pan, Shai, Lee, Frey, & Blencowe, 2008; E. T. Wang et al., 2008). One widespread regulator of RNA stability is nonsense-mediated mRNA decay (NMD), a pathway that

42 degrades mRNAs that contain premature termination codons (PTCs). Several metazoan splicing factors autoregulate by altering splicing of transcripts from their own loci to shift toward increased production of unstable, NMD-targeted isoforms when protein levels are high (Sureau, 2001; Wollerton, Gooding, Wagner, Garcia-Blanco, & Smith, 2004). NMD also occurs in yeast (González, Wang, & Peltz, 2001), and can be coupled to splicing to regulate gene expression (Kawashima, Douglass, Gabunilas, Pellegrini, & Chanfreau, 2014).

Here we developed Branch-seq, a genome-wide technique to sequence lariat BPs and their associated 5'SS. We tested our method in S. cerevisiae, where every annotated intron has a confident BP prediction (Meyer et al., 2011; Spingola et al., 1999) allowing us to assess the accuracy and sensitivity of our method. Surprisingly, in addition to confirming the locations of most annotated BP, we also identified more than 200 novel BPs. This finding prompted us to further explore splicing patterns and regulatory consequences in yeast using additional genome-wide assays and data. These analyses uncovered unexpected complexities in yeast splicing, including introns with multiple BPs, an intron with AT-AC splice sites, and a gene that couples splicing to NMD for gene regulation, revealing a number of parallels to mammalian splicing that were not previously appreciated.

43 Results

Branch-seq accurately identifies locations of 75% of expressed, annotated BPs

Though the yeast genome sequence has been available since 1996 and studies have sought to comprehensively identify yeast introns and test those predictions (C. A. Davis,

2000; Spingola et al., 1999), genome-wide assays are still discovering additional yeast introns (Kawashima et al., 2014; Zhang, Hesselberth, & Fields, 2007). BP detection has lagged behind intron detection largely because of the short-lived nature and unique structure of lariat RNAs. BPs are typically verified using fairly laborious, low-throughput techniques such as primer extension, in vitro splicing, and RT-PCR across the lariat 5'SS-BP junction (Padgett et al., 1985; Vogel, Hess, & Börner, 1997), with alternative approaches developed only recently (Awan, Manfredo, & Pleiss, 2013; Bitton et al., 2014; Mercer et al.,

2015; Taggart, DeSimone, Shih, Filloux, & Fairbrother, 2012). To date, budding yeast BPs have not been validated using a genome-wide approach.

To experimentally locate BPs, we developed Branch-seq, an untargeted, high- throughput method for identification of lariat BPs and their associated 5'SS. Initially, lariats were stabilized in vivo by deleting DBR1, the debranching enzyme that linearizes lariats in the default intron decay pathway (Chapman & Boeke, 1991). In the first step of

Branch-seq, lariats were enriched from dbr1∆ total RNA using a denaturing two- dimensional (2D) polyacrylamide gel. Because the mobility of lariat RNAs (and other circular RNAs) is retarded to different extents at different gel densities compared to linear

RNA, lariat and circular RNAs run in an arc above the diagonal produced by linear RNAs 44 (Awan et al., 2013; Chapman & Boeke, 1991; Friedman & Brewer, 1995). A prominent off- diagonal arc was visible in 2D gel analysis of dbr1∆ RNA (Fig. 2-1A).

RNA was isolated from the top, middle and bottom portions of the 2D gel arc to enrich for lariats of different sizes (Fig. 2-S1A) and linearized using purified recombinant

DBR1 enzyme. Following debranching, standard techniques were used to obtain libraries suitable for paired-end Illumina sequencing. This strategy yields read pairs in which the 3' mapping read corresponds to the BP, and the 5' mapping read identifies the associated

5'SS. The 3' ends mostly correspond to BPs rather than 3'SS because the lariat intermediate stabilized in dbr1∆ yeast is one in which the intron sequence 3' of the BP has been degraded (Fig. 2-S1B). To further characterize yeast introns, we performed a version of random hexamer-primed RNA-seq known as ‘Lariat-seq’ (Awan et al., 2013), again using

RNA isolated from a 2D gel arc (steps 2L and 3L)(discussed further below).

Branch-seq accurately identified annotated BP and 5'SS. Overall, ~60% of mappable reads corresponded to annotated introns, and ~75% of expressed yeast introns contained one or more read pairs. As an example, read pairs mapping to the intron of PCH2 are shown in Figure 2-1B. The 5' end reads (pink) predominantly began exactly at the annotated 5'SS, which matches the /GTATGT yeast consensus (with ‘/’ indicating the splice junction), while 3'-end reads predominantly began at the presumptive BP of this intron, a

CACTAAC sequence near the intron 3' end (differing only at the underlined C from the yeast

BP consensus motif, TACTAAC, where the BP nucleotide is in bold). A meta-analysis of all annotated 5'SS and BP confirmed this pattern, with a sharp peak of 5' end read starts at annotated 5'SS, and a similarly sharp peak of 3' end read starts at annotated BP (Fig. 2-1C).

45

Figure 2-1. Branch-seq accurately identifies BP locations on a genome wide scale. (A) Schematic of the Branch-seq protocol. Steps labeled with “B” and “L” correspond to Branch-seq and Lariat-seq, respectively. (B) Branch-seq locates the annotated 5'SS (pink) and BP (blue) in the PCH2 intron(Robinson et al., 2011). Dashed lines show locations of 5'SS (GTATGT), BP (CACTAAC), and 3'SS (AG) sequences. Mismatches from consensus are underlined. BP nucleotide is red and bold. Mismatches in reads are indicated by small red, green, dark blue, and orange horizontal lines. Inset axes show read start locations for PCH2 intron 5'SS and BP reads where the 0nt is the 5'SS or BP nucleotide, respectively. (C) Meta 5'SS and BP read start plots as in (b) but for all annotated 5'SS and BP. Dotted vertical lines at +/- 2nt. (D) Locations of BP peaks called by SW and GEM-BP relative to annotated BP positions.

46 Additionally, Branch-seq finds one novel BP adjacent to an annotated BP, suggesting that the annotation needs to be changed (Fig. 2-S1F). A small secondary peak 2 bases upstream of the BP in the meta analysis likely reflects shifted RT priming, (Fig. 2-S1C-E and

Supplemental Methods). These results support the utility of Branch-seq for systematic identification of yeast BP and associated 5'SS.

Branch-seq identifies novel BP and associated 5'SS

Application of two independent peak calling algorithms to Branch-seq data identified BP locations with high precision, yielding an unexpectedly large number of 268

“confident novel BPs” (cnBPs). First, we used a simple sliding window approach (winBP) to find peaks of high local read density without using any sequence information. Second, we adapted the existing GEM ChIP-seq peak caller (Guo, Mahony, & Gifford, 2012) to identify BP peaks in software called GEM-BP (Supplemental Methods). GEM-BP uses the sharply peaked distribution of read starts at BPs and strong BP motif in yeast to accurately call peaks. GEM-BP recovered 75% of expressed annotated BP within 3 nt of their annotated locations, while winBP identified 59% of expressed annotated BP, including two not found by GEM-BP, with somewhat lower precision (Table 1, Fig. 2-1D, Table II-S1).

The BP motif is highly constrained in S. cerevisiae, with ~90% of annotated BP matching the TACTAAC motif perfectly (Spingola et al., 1999). Overall, GEM-BP peaks matched the consensus BP motif more frequently than winBP peaks, reflecting the use of a motif in the predictions by GEM-BP. To maximize sensitivity, the union of peaks called by both approaches was used, a set of 430 putative novel BP (Table II-S2). We generated a high confidence set of novel BP peaks for all downstream analyses, using the paired-end sequencing information from Branch-seq, which provides a built-in quality control for BP 47 identification. Requiring presence of a typical 5'SS motif in the associated 5' end reads yielded a set of 268 cnBPs, with an estimated false discovery rate (FDR) of 1.1% (Fig. 2-2A,

Table II-S3) (see Methods), nearly doubling the number of BPs in budding yeast. The remaining set of 162 putative novel BP with atypical 5'SS showed a modest preference for the /GTATGT consensus (Fig. 2-S2B) suggesting the presence of additional novel BPs, but was not pursued further.

Most of the 268 cnBPs were located in annotated exons, introns or UTRs, but almost one third were located outside of annotated transcripts, sometimes in regions antisense to annotated genes such as cryptic unstable transcripts (CUTs) and stable uncharacterized transcripts (SUTs) (Fig. 2-2B, 2-S2C). These observations suggest that Branch-seq can be used to extend annotation of genic as well as non-genic features in yeast. For example, in the second exon of the RPL30 gene we observed a substantial peak of more than one hundred Branch-seq reads at a variant BP motif, GGCTAAC, associated with a potential novel 5'SS, pointing to the presence of a second intron in this gene (Fig. 2-2C). The 5' end reads associated with the cnBP in RPL30 began with the sequence GTAAGT, just one mismatch from the yeast 5'SS consensus (Fig. 2-2C). As another example, we observed three distinct peaks of Branch-seq reads in the intron and second exon of the TDA5 gene.

These peaks corresponded to the annotated BP (TACTAAC) and to two other sites downstream in the transcript, which were associated with motifs related to the BP motif by one (AACTAAC) or two (GTCTAAC) mismatches (Fig. 2-2D). All three of these peaks were paired with reads mapping to the annotated 5'SS. These data suggest that alternative BPs are used in splicing of this intron, likely yielding at least two or three different 3'SS.

48

Figure 2-2. Branch-seq locates hundreds of novel BPs. (A) Number of annotated BP recovered by Branch-seq (light orange) compared to number of computationally predicted BP (dark orange) (Meyer et al., 2011). The cnBP (light green) are a subset of all novel BP (dark green). (B) Genomic locations of the 268 cnBP. Novel BPs located in CDS (C-D) introns (D) and of the TDA5 and RPL30 genes. Annotated TDA5 BP and 5'SS are blue. Potential AG 3'SS are depicted. 3'SS confirmed by entropy are indicated by asterisk (C-D). Potential BP-3'SS paring indicated by matching colors (D). (E) Sequence motifs created by MEME of annotated BPs (left) and typical 5'SS cnBPs(middle) recovered by Branch-seq and human BP motif (right) for comparison. Position 0 is the BP nucleotide.

49 Comprehensive BP sequencing allowed us to identify BP that deviate from the strict yeast consensus. Of the 268 cnBP, 51 were a perfect match to the TACTAAC consensus motif and the remaining 217 had up to 4 mismatches, yielding a more degenerate motif when aligned (Fig. 2-2E). Interestingly, the –1 position, which in yeast is considered to have a strong preference for “A” appears to also tolerate “G”, as often seen in mammalian

BP motif (Mercer et al., 2015). The –5 to –3 positions are also more degenerate in cnBP motif, and “T” seems to be tolerated at the –4 and –3 positions. Overall, the cnBP motif resembles known mammalian BP motifs of CTRAC or minimally TNA (R = A or G, N = any base). Surprisingly, these cnBP did not show a peak in conservation at the BP as is observed for annotated BPs, suggesting that these BPs are specific to S. cerevisiae (Fig. 2-S2A). The weaker cnBP motifs might reflect lower levels of splicing (Fig. 2-S3C, Table II-S4) or more frequent regulation of novel BPs than of annotated BPs.

We compared our approach for BP detection to a recently described approach that uses ‘lariat junction’ (LJ) reads that originate from reverse transcription across the 5'SS to

BP junction of the lariat (Awan et al., 2013; Bitton et al., 2014; Taggart et al., 2012). For this purpose, we identified Lariat-seq reads that were composed of a pair of segments that mapped near each other, but in a discordant order, and used the ends of these read segment pairs to define 5'SS and BP locations (Fig. 2-3A, 2-S2D, Table II-S5). For example, we detected 23 reads that crossed 5'SS (GTAAGT) and BP (CACTAAC) of an unannotated intron in the BDF2 gene, which encodes a transcription factor (Fig. 2-3B). The yield of LJ reads was two orders of magnitude higher in Lariat-seq data (450 per 106 reads) than in conventional RNA-seq data (5.5 per 106), confirming that Lariat-seq synergizes with the LJ

50

Figure 2-3. Lariat-set junction reads identify BP locations.

(A) Schematic of lariat junction read mapping strategy. Green box indicates location of best 5'SS in lariat junction read. (B) Novel intron in BDF2 CDS is supported by Branch-seq reads (top, pink and blue as in Figure 2-1) and Lariat-seq junction reads (middle, 5'SS read fragments in dark green, BP read fragments in light green). Black boxes denote novel 5'SS and BP sequences identified by Branch-seq and Lariat-seq reads. (C) Summary of overlaps among novel BP identified by Lariat-seq JR reads, cnBP identified by Branch-seq, and novel splice junctions identified by RNA-seq.

51 approach (Taggart et al. 2012; Awan, Manfredo, and Pleiss 2013). These LJ reads confirmed 41 annotated BPs and 17 novel BPs (Table II-S5), several of which overlapped with cnBPs identified by Branch-seq (Fig. 2-3C). Differences in novel BPs recovered by

Lariat-seq and Branch-seq is in part due to the lariat sizes successfully recovered by each method (Fig. 2-S3D).

Over 100 additional introns and splice sites in the yeast genome

The unexpectedly large number of new BPs identified by our approaches prompted us to further explore yeast splicing patterns using RNA-seq, which is complementary to

Branch-seq in that it identifies 3'SS as well as 5'SS. We hypothesized that some cnBP might be located inside unannotated introns that derive from spliced mRNAs that are quickly degraded by NMD. Therefore, we performed RNA-seq on a upf1∆ strain (which is defective for NMD), as well as wildtype (WT) and dbr1∆ strains, and used stringent criteria to define novel splice junctions from the data. Briefly, RNA-seq reads were mapped using TopHat, allowing novel junction discovery (Kim et al., 2013). After mapping, we required a minimum splice junction entropy of at least 2 bits (Graveley et al., 2011), which corresponds to uniform coverage of at least 4 distinct start positions, or more variable coverage of a larger number of positions (Fig. 2-S4A).

This approach yielded 136 unannotated splice junctions, 38 of which were observed in a recent study (Kawashima et al., 2014) (Table II-S6). In all, 115 novel introns overlapped 88 annotated genes. analysis of this set yielded a bias for ribosomal protein genes, a class which is enriched for annotated introns as well (Spingola et al., 1999). Comparing the locations of cnBPs defined by Branch-seq (n = 268) and novel introns defined by RNA-seq (n =136), we observed a degree of overlap (n = 22) that was 52 significant (binomial test, p<0.001) but relatively modest (Fig. 2-3C), likely because the two protocols have different biases in the RNAs they capture. We observed that novel splice junctions with RNA-seq entropy ≥2 bits are strongly biased toward shorter genes (Fig. 2-

S4C). This trend likely reflects features of yeast messages, which have shorter poly(A) tails than in metazoa and can be stable in non-polyadenylated form, biasing recovery by poly(A)+ RNA-seq (Hu, Sweet, Chamnongpol, Baker, & Coller, 2009; Presnyak et al., 2015).

Branch-seq also has a bias toward recovery of BPs from shorter introns (Fig. 2-S3A,B,D), but the associated introns are located in genes from all yeast length classes (Fig. 2-S4C).

The novel introns identified by RNA-seq exhibited several characteristics of known yeast introns. The distribution of their lengths mirrored that of known introns (Fig. 2-4A).

The splice sites of novel introns resembled motifs of annotated introns, but showed more deviation from the consensus, especially at the +4 and +6 positions of the 5'SS, and lacked a polypyrimidine tract at the 3'SS (Fig. 2-S4B), consistent with a recent report (Kawashima et al., 2014). Presence of weaker splice site motifs suggested that splicing of these introns might be less efficient and/or regulated, making them more difficult to detect.

New splice sites have distinctive features and conservation

The novel splice junctions identified by RNA-seq were associated with distinct patterns of evolutionary sequence conservation across yeast species compared to annotated introns, suggesting a level of evolutionary constraint that is above background but well below that of annotated introns (Fig. 2-4B). As expected, conservation declined sharply downstream of the 5'SS and increased just upstream of the 3'SS for most annotated introns. For novel splice sites that fell inside of annotated coding sequences, a modest

53

Figure 2-4. RNA-seq discovers additional novel introns. (A) Length distribution of annotated (blue) and novel (red) splice junctions. Novel splice junctions include any junction with entropy ≥2. (B) Conservation of splice sites for annotated splice sites (black) and novel splice sites located in annotated CDS (blue), introns (yellow), and outside of ORFs (green). Average predicted BP location for intronic 3'SS is 54 denoted with dotted line, shading is +/- 1 standard deviation (only plotting -30 to +30 nt around the splice site). For 5'SS, annotated n=282, CDS n=14, intron n=19, intergenic=18. For 3'SS, annotated n=282, CDS n=34, intron n=7, intergenic n=18. (C) Effect on coding length of ORFs from splicing out of novel introns. Predicted change to the coding sequence of REC107 (D) and RUB1 (E) after splicing out novel introns. Red arrow indicates location of RUB1 protein cleavage prior to its addition to substrates. (F) RT-PCR sequence (black) aligns to annotated intron of RPL30 (light blue)(Kent, 2002) http://genome.ucsc.edu, SacCer3 genome assembly. Colored triangles represent splice sites. Grey, annotated splice sites; red, AT-AC 5'SS; orange, AT-AC 3'SS 1; green, AT-AC 3'SS 2. Depending on which AC/ 3'SS is used, the second AUG is either 104 nt or 170 nt downstream of the truncated main ORF. (G) WebLogo of published U2-dependent AT-AC intron 5’SS and 3’SS. RPL30 AT-AC splice sites shown bellow(Sheth et al., 2006).

55 decline in conservation was observed after the 5'SS. This pattern might be expected for an intron that is spliced in some species but retained (or alternatively spliced) in others, giving rise to a conservation pattern that is intermediate between those of typical exons and introns. Novel pairs of splice sites located in annotated introns exhibited a different pattern, with low conservation overall (expected in introns), but with elevated conservation about 20 nt upstream of the 3'SS, in the vicinity of the predicted BP location

(shaded yellow, Fig. 2-4B).

New splice sites have distinctive features and conservation

The novel splice junctions identified by RNA-seq were associated with distinct patterns of evolutionary sequence conservation across yeast species compared to annotated introns, suggesting a level of evolutionary constraint that is above background but well below that of annotated introns (Fig. 2-4B). As expected, conservation declined sharply downstream of the 5'SS and increased just upstream of the 3'SS for most annotated introns. For novel splice sites that fell inside of annotated coding sequences, a modest decline in conservation was observed after the 5'SS. This pattern might be expected for an intron that is spliced in some species but retained (or alternatively spliced) in others, giving rise to a conservation pattern that is intermediate between those of typical exons and introns. Novel pairs of splice sites located in annotated introns exhibited a different pattern, with low conservation overall (expected in introns), but with elevated conservation about 20 nt upstream of the 3'SS, in the vicinity of the predicted BP location

(shaded yellow, Fig. 2-4B).

Often, splicing of the novel introns shortened the predicted protein sequence by at least 50% (Fig. 2-4C) as in the case of REC107 (Fig. 2-4D), a gene involved early on during 56 meiotic recombination (Malone et al., 1991). Even for cases in which the splicing did not change the length of the ORF dramatically, the function of the protein might be altered, as is the case in RUB1, a ubiquitin-like protein (Fig. 2-4E). The predicted RUB1 protein resulting from the spliced mRNA would be shortened by just 7 residues, but with altered composition of the 20 C-terminal residues, including loss of the C-terminal glycine, which is used in ligation of RUB1 to its targets (Vierstra & Callis, 1999).

AT-AC splice sites are used in yeast

No /AT-AC/ introns have been reported in yeast. However, we identified RNA-seq splice junction reads supporting the splicing of a novel intron nested inside the annotated

RPL30 intron, which had an /AT 5'SS that spliced to one of two different AC/ 3'SS, one of which had high entropy (> 3 bits). The unconventional AT-AC isoform with distal AC/ 3'SS was supported in the WT, dbr1∆, and upf1∆ RNA-seq datasets (Fig. 2-S4A), and we confirmed both novel AT-AC splice junctions by RT-PCR and sequencing (Fig. 2-4F, 2-S5).

By RNA-seq analysis, AT-AC splicing is much less abundant than the canonical isoform, representing ~1-2% of mRNAs from this highly-expressed gene locus (see supplemental text in Methods).

Yeast lack the distinct machinery of the minor spliceosome that splices most known

/AT-AC/ introns in metazoans (Russell, Charette, Spencer, & Gray, 2006), and as expected, the 5'SS motif of the RPL30 intron bore no resemblance to the highly conserved

/ATATCCTT consensus typical of animal and plant U12-type AT-AC introns. However, it also deviated quite substantially from the consensus of the few dozen major spliceosomal

(‘U2-type’) AT-AC introns that are known in metazoans (Sheth et al., 2006), which have a

57 very strong /ATAAGT consensus (Fig. 2-4G), raising questions about the mechanism of its splicing (see Discussion).

Multi-BP introns occur in at least twelve genes and can impact gene expression

Branch-seq revealed 11 unconventional introns that make use of two BPs (Fig. 2-

5A) and one intron that uses three BPs (Fig. 2-2D). In about half of these cases, the novel

BP is located in a long intron, but is much closer to the 5'SS than the annotated BP (Fig. 2-

5A), consistent with preferential selection of small lariats by Branch-seq (Fig. 2-S3A,B,D).

In one case a methylation guide snoRNA, snR18, is located between two BPs, and in two other cases a putative ORF occurs between BPs (Fig. 2-5B). In the snR18 case, use of the upstream BP would shift the intron-encoded RNA from the lariat loop to the lariat tail.

Overall, for introns that use two BPs, the first BP tends to have a weaker motif than the downstream BP (Fig. 2-5C).

Branch-seq identified two BPs in the LSM2 gene (Fig. 2-5D), which encodes an Sm- like protein that has both nuclear and cytoplasmic functions and plays a role in RNA processing and turnover (Beggs, 2005). The novel BP in LSM2, AACTAAC, is upstream of the annotated BP and allows for splicing to a novel 3'SS, located between the annotated and novel BPs (Fig. 2-5D,E). The novel isoform, which was confirmed by RT-PCR and sequencing (Fig. 2-5E), contains a PTC in the newly included portion of the downstream exon, making it a potential NMD target (Fig. 2-5D). Isoform-specific primers used for qRT-

PCR showed that the PTC isoform is up-regulated about 3-fold in upf1∆ yeast, with the

58

Figure 2-5. Alternative BP usage reveals previously unknown nonsense-mediated mRNA decay splice isoform. (A) Distance from 5'SS to BP for first and second BP in introns that use two BPs. Red line: x=y. (B) Three genes from (A) where novel BP (red) is located close the 5'SS and far from the annotated BP (blue). Intronic transcript position shown bellow each intron, direction indicated with white arrows. (C) Motif of upstream BP (top) and downstream BP (bottom) for 11 introns that use two BPs. (D) Branch-seq read coverage from the top, middle, and bottom sections of the 2D gel arc (Figure 2-S1A)

59 correspond to usage of the canonical LSM2 BP (blue dotted line and circle) and a “new” BP (red dotted line and circle). Potential alternative 3'SS usage would insert a premature termination codon (octagon stop sign). (E) RT-PCR and subsequent sequencing confirmed the novel LSM2 PTC isoform. (F) qPCR verification that LSM2 PTC isoform is up regulated in upf1 null yeast.

60 annotated isoform remaining unchanged (Fig. 2-5E,F), implicating NMD in targeting of the novel isoform. Thus, it is likely that AS of the novel isoform is used to regulate the level of

LSM2 message and protein.

Changes in splicing among growth conditions

To investigate the regulation of novel AS, we mapped RNA-seq data from 18 different environmental growth conditions to annotated and novel intron junctions to assess intron retention (Fig. 2-6A) (Waern & Snyder, 2013). MISO (Katz, Wang, Airoldi, &

Burge, 2010) was used to quantify “percent spliced in” (PSI, representing the fraction of a gene’s mRNAs that retain the intron) across samples. Novel introns generally had high PSI values relative to previously annotated introns, with some exceptions (Fig. 2-6A). Even some novel splice sites that are poorly conserved in other yeast species, such as RPL43B and MTR2 (Fig. 2-S6), undergo splicing changes in response to environmental conditions.

Overall, we observed increased intron retention during stationary phase as compared to all other environmental conditions analyzed (Fig. 2-6A). Additionally, growth in salt or juice conditions each seem to have unique effects on the splicing profile. We also observed a group of genes that have substantial but not complete intron retention during most growth conditions (bottom of Fig. 2-6A).

Since AS is known to be more common in yeast during meiosis than in vegetative growth, we analyzed RNA-seq and ribosome footprint profiling data from a detailed meiosis time course (Brar et al., 2012) to determine if the splicing and translation of the novel splice isoforms that do not appear to be regulated under mitotic growth (Fig. 2-6A, black boxes) are regulated during the dramatic cellular transformation of meiosis. We

61

Figure 2-6. Rare splicing of novel retained introns mirrors splicing patterns of known introns.

62 Clustering of psi values calculated by MISO for retained introns in (A) RNA-seq of 18 environmental conditions (Waern & Snyder, 2013) and (B) Ribosome Footprint Profiling data in a meiosis time course (Brar et al., 2012). Black; retained introns. Purple; introns spliced out. Side bar: red; novel intron, blue; annotated intron. If psi value was confident in at least half of the samples, unknown psi values were replaced with the mean psi value in the confident conditions. *alternative splice site to annotated splice site. ** only one splice site overlaps gene ORF listed. ***antisense to an annotated transcript. ****intron likely in unannotated UTR of transcript. Alternative splice sites were considered as retained introns if the annotated introns were not detected with entropy≥2. YLL056C: 5'UTR supported by RNA-seq. IDP3: 5'SS inside ORF. RFU1 and RSB1: 3'SS inside ORF. (C) Sashimi plot (Katz et al., 2010)depiction of ribosome footprint profiling splice junction reads from (B) joining YNL194 and YNL195 transcripts at a few stages of meiosis.

63 observe that the YNL194C-YNL195C splice fusion transcript (whose splicing was previously confirmed (Miura et al., 2006), is rarely spliced during environmental stress conditions

(Fig. 2-6A), but is differentially represented in the translatome during meiosis (Fig. 2-6B,C,

Fig. 2-S7). Changes in translation of these novel isoforms during meiosis suggest involvement in meiotic progression, as in the case of YNL194C, an integral membrane protein required for sporulation (Young et al., 2002) (Fig. 2-6B,C). The other novel isoforms might be regulated under other conditions, or might represent new introns that have yet to evolve function.

Discussion

The Branch-seq approach introduced here allowed comprehensive identification of

BP and associated 5'SS from individual lariat RNAs with high precision (Fig. 2-1). These

BPs revealed unexpected post-transcriptional regulatory capacity of the yeast genome.

Examples included genes that make use of multiple BP with various regulatory consequences. For example, alternative BP usage in the LSM2 pre-mRNA alters 3'SS usage, producing a message that is degraded by NMD, enabling regulation of expression level at the level of splicing (Fig. 2-5), analogous to regulatory strategies used by a number of metazoan splicing factors (Sureau, 2001; Wollerton et al., 2004). We also found that the

EFB1 intron contains alternative BP whose use shifts the location of the snoRNA snR18 between the lariat loop and the lariat tail (Fig. 2-5B), potentially impacting the relative production of mature mRNA and mature snoRNA as seen for other snoRNAs (Hirose et al.,

2006).

64 Some novel introns were spliced at intermediate or even high levels (e.g., those in the MTR2, RPL22, and RPL43B genes), but most appear to be spliced at lower levels than annotated introns in standard growth conditions (Fig. 2-6A). Considering all introns, we observed a large increase in intron retention during stationary growth that affected most known introns, as well as some novel introns such as the one in RPL43B (Fig. 2-6A).

Increased retention of a substantial subset of introns was observed in salt stress, however most novel introns maintained relatively low levels of splicing across most of the environmental conditions examined. Nevertheless, some of the novel introns that exhibited low and constant splicing across all environmental conditions, like PDC1, exhibited large changes in splicing during meiosis (Fig. 2-6B). This observation suggests that some yeast introns may have specialized regulation that is only observed when considering a large number of cell states and conditions. Additionally, further analysis of different types of AS, including alternative 5'SS and 3'SS, may reveal regulation under the conditions we have examined or other conditions.

Defining novel splice junctions from RNA-seq led us to find and validate an intron with AT-AC splice site dinucleotides, nested inside the annotated intron of RPL30. To our knowledge, this is the first AT-AC splice site intron reported in S. cerevisiae. The relatively rare AT-AC introns that occur in metazoans are often spliced by the U12-dependent

“minor” spliceosome (Russell et al., 2006). However, the extended 5'SS and BP motifs characteristic of U12-dependent introns are absent in this case, and no evidence for presence of U12 snRNA or related machinery has been found in S. cerevisiae, ruling out involvement of the minor spliceosome (Russell et al., 2006). The U2-dependent “major” spliceosome is also capable of splicing a small subset of AT-AC introns (Dietrich, Incorvaia,

65 & Padgett, 1997). However, the RPL30 /AT 5'SS does not resemble typical /AT 5'SS spliced by the major spliceosome (Fig. 2-4G), leaving open the question of whether this intron is indeed spliced by the major spliceosome or by some other mechanism (e.g., a protein enzyme or an RNA-based self-splicing mechanism) (Kruger et al., 1982).

Recently, recurrent mutations in several core spliceosome components that recognize BP and intron 3' ends, including U2 snRNP component SF3B1 and the U2AF1 and

U2AF2 genes, have been observed in leukemias (Quesada et al., 2012), raising interest in understanding details of BP and 3'SS recognition. Branch-seq is a powerful method for detection of BP in small lariats, and yeast have provided many insights into the inner workings of the spliceosome (Hossain & Johnson, 2014), making this a suitable system for application of Branch-seq to study BP regulation. Due to ease of creating double mutant in yeast, Branch-seq could be used to study the effects of perturbations of the core splicing machinery on BP selection by crossing the desired spliceosome mutant to dbr1∆ yeast.

Applying this method to other organisms with small introns such as other fungi, plants or

Drosophila could aid in detection of novel introns or regulatory mechanisms, such as recursive splicing (Burnette, Miyamoto-Sato, Schaub, Conklin, & Lopez, 2005) or stalled splicing (Dumesic et al., 2013).

66 Methods

Yeast strains

Strains were grown in YPD (1% Yeast extract, 2% Peptone, 0.01% Adenine hemisulfate, 2% Dextrose) at 30°C with vigorous shaking unless otherwise noted. The null strains were obtained from the deletion collection. WT (s288c): BY4742 Mat α his3Δ1 leu2Δ0 lys2Δ0 ura3Δ0. dbr1Δ:BY4742 Mat α his3Δ1 leu2Δ0 lys2Δ0 ura3Δ0

YLK149C::KanMX4. upf1Δ: BY4742 Mat α his3Δ1 leu2Δ0 lys2Δ0 ura3Δ0 YMR080C::Kan.

RNA isolation

RNA isolation for Lariat-seq and Branch-seq was performed as follows. Yeast were grown to OD600 0.94-0.98 and were collected by centrifugation at 7000 RPM for 5 min at

4°C. Media was poured off and yeast were washed twice in water and frozen at -80°C. Cells were thawed and transferred to tubes containing 2.8mm ceramic beads and 1mL Trizol

(Life Technologies) was added to 1/10 cell pellet. An Omni Bead Ruptor was used to lyse the cells, twice for 20 seconds on ½ max speed and once for 10 seconds on max speed.

Samples were incubated at room temp for 5 min, 1/5 volume of chloroform was added and mixed, samples incubated at room temp 2-3 min and were spun at max speed for 15 min at

4°C. The upper aqueous layer was transferred to a new tube and precipitated with ½ volume isopropanol. After 5min on ice, samples were spun at max speed, 4°C for 25 min.

The RNA pellet was washed with 70% ethanol before storage at -80°C.

RNA isolation for RNA-seq was performed as follows. Overnight yeast cultures were

67 grown in 5mL YPD media and were diluted in the morning into 50mL YPD and grown to log phase (OD600 0.5 to 1), spun down, and the pellets were frozen in liquid nitrogen. RNA was isolated as in (Clarkson, Gilbert, & Doudna, 2010). Pellets were resuspended in 1mL Acid

Phenol and an equal volume of AES buffer (50mM NaAcetate pH 5.2, 10mM EDTA, 1% SDS) was added. In 2mL eppendorf tubes, samples were incubated at 65°C for 10 min with vortexing every minute. Samples were incubated on ice for 5 min and then transferred to a phaselock tube and one volume chloroform was added. After spinning, the top aqueous layer was transferred to a fresh phaselock tube and one volume of phenol:chloroform:isoamyl alcohol (25:24:1) was added, tubes were spun, one volume of chloroform was added, tubes were spun, and the aqueous layer was transferred to a fresh tube to be precipitated with 50uL 3M NaOAc (pH 5.5) and 550uL isopropanol. Samples were spun at max speed for 25 minutes at 4°C. The pellet was washed twice with 70% ethanol and resuspended in water.

2D PAGE Gels

For all 2D polyacrylamide gels, RNA was mixed with an equal volume of denaturing loading dye and heated at 80-95°C prior to loading. For the Branch-seq gels, ultra-pure sequagel reagents from National Diagnostics were used to pour 6% (first dimension: D1) and 20% (second dimension: D2) gels. These gels were poured with 1.5mm spacers and

~20cm by ~32cm plates and a metal heat sink was used. D1 was poured the night before with 12 wells and stored at 4°C in saran wrap to maintain moisture. D1 was run at 15W for

1hr and 45 min, stained with sybr gold, and imaged on a Safe Light. D2 was poured while

D1 was running using a comb with one large well. After removing the D2 comb, running

68 buffer (TBE) was added to the well to aid in D1 gel insertion. A single lane of the D1 gels was cut out with a clean razor and slid into the D2 gel using tweezers and a razor blade, taking care to minimize the number of air bubbles between the D1 and D2 gel interface.

Additional loading dye was added on top of the D1 gel slice in the D2 gel for easy visualization of running of the D2 gel. D2 was run at 30W for 6hr and 30 min. Gels were stained with sybr gold. Slices were excised, and in the case of Fig. 2-S1A were frozen at -

20°C.

The gels used for lariat-seq were pre-cast mini gels from Invitrogen where D1:6% and D2: 10% where the wells were manually cut out to make one large well.

RNA was eluted from 2D gels using PAGE elution buffer (30 mM Tris-HCl (pH 7.5),

300 mM NaCl, and 3 mM EDTA) (Ooi et al., 2001) 12mL for by rotation over night at 4°C.

RNA was precipitated with isopropanol and glycogen.

Debranching enzyme purification

S. cer. DBR1 cDNA was generated from WT S288C yeast and cloned into the pET151

expression vector from Invitrogen. Protein was expressed in Rosetta 2(DE3)pLysS

competent cells grown in YT media at 37°C until they were induced with IPTG and grown

at 18°C. Bacteria were lysed using Native Lysis Buffer (Qiagen). Protein was purified with

a Ni-NTA column (Qiagen) and subsequently over an S200 column (Buffer: 125 mM KCL,

20mM HEPES pH 7.3, 1mM DTT, 10% glycerol). Protein was concentrated (final 50%

glycerol) and flash frozen. Protein was tested for RNase activity and debranching activity

on linear RNA an in-vitro spliced lariat.

69 Isolation of in vitro-spliced Drosophila melanogaster lariat RNA

HeLa nuclear extracts for in-vitro splicing were a kind gift from the Reed Lab (Folco,

Lei, Hsu, & Reed, 2012). Coupled in vitro transcription and splicing were performed similar to Folco and Reed (Folco & Reed, 2014) except without addition of α-amanitin to obtain as many lariats as possible. Reactions were digested with RNase R (Epicenter) at 37°C for 1 hour to obtain radio labeled FTZ lariats.

Debranching

Debranching was performed similar to (Ooi et al., 2001) protocol. Briefly, RNA and

debranching enzyme were incubated for 1 hour at 30°C in debranching buffer (5X

debranching buffer: 100nM Hepes, 625mM KCl, 2.5 mM MgCl2, 5 mM DTT, 50% glycerol).

Prior to debranching of the top, middle, and bottom fractions of lariats, radio-labeled FTZ

lariat RNA was spiked in to each sample to confirm debranching via gel electrophoresis.

Samples were phenol chloroform extracted after debranching and ethanol precipitated.

PolyA tailing

Debranched lariat RNA was poly(A) tailed using E. coli poly(A) polymerase from

NEB for 10 minutes at 37°C and subsequently phenol chloroform extracted and

isopropanol precipitated.

Reverse transcription

Reverse transcription was performed using primer

/5Phos/AGATCGGAAGAGCGTCGTGTAGGGAAAG/iSp18/CACTCA/iSp18/GTGACTGGAGTTC

70 CTTGGCACCCGAGAATTCCA/TTTTTTTTTTTTTTTTTTTTVN (designed in collaboration with Yarden Katz (Katz et al., 2014)) incubated with SuperScriptIII RT (Invitrogen) for 30 min at 48°C. Subsequently 2.1 uL of 1M NaOH was added and samples were incubated at

98°C for 15 min. The RT primer is a modified version if the ribosome footprint profiling RT primer where the 5' end of the RNA gets sequenced first and paired end, barcoded sequencing is possible(Ingolia, Ghaemmaghami, Newman, & Weissman, 2009).

The samples were then run on a 6% TBE-urea gels (Invitrogen) for 93 min at 200V to remove excess RT primer. Gels were stained with SYBR gold and gel slices were excised where product was observed to run above the RT primer for the top, middle, and bottom lariat samples. Gel slices were shredded and DNA was eluted in 400uL PAGE elution buffer overnight (see 2D gels above). Gel was removed before precipitation using a NanoSep column.

Circularization

Circligase (Epicentre) was used to circularize the gel isolated RT products for 1 hour at 60°C and the enzyme was inactivated by heating at 80°C for 10 minutes.

PCR

Phusion high-fidelity polymerase (NEB) was used to amplify the circularized products. Illumina PCR primer 1.0

AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

was paired with Illumina barcode primers (RPI#s)

(RPI1) CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA

71 (RPI2) CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA

(RPI3) CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA

(RPI4) CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA

Samples were removed after 6, 8, 10 and 12 PCR cycles and run on an 8% TBE gel

(Invitrogen) for 40 min at 200V. PCR products were gel isolated by shredding the gel through a hole poked with a needle in the bottom of a 0.5 mL eppendorf tube and eluted in

400uL PAGE elution puffer (see above) at 65°C, shaking at 1400RPM for one hour. Gel was removed with a NanoSep column and precipitated with isopropanol.

Oligonucleotide sequences © 2006-2008 Illumina, Inc. All rights reserved. http://epigenome.usc.edu/docs/resources/core_protocols/Illumina%20Sequence%20Info rmation%20for%20Customers%20DEC2008.pdf

Sequencing

One Illumina MiSeq flow cell was sequenced at the MIT Bio Micro Center (November

2011). 5' end reads were 50 bases and 3' end reads were 250 bases. 3' end reads were sequenced with custom sequencing primer

GTGACTGGAGTTCCTTGGCACCCGAGAATTCCATTTTTTTTTTTTTTTTTTTT to avoid sequencing the un-templated As added by the poly(A) tailing reaction. The 3' end sequencing primer was gel purified prior to use in sequencing (primer design might need to be changed for sequencing on other Illumina machines).

Branch-seq read mapping

72 Reads were trimmed to 30 by 30nt and mapped with Bowtie1(Langmead, Trapnell,

Pop, & Salzberg, 2009) (bowtie-1.0.0) using the following parameters: bowtie -S -m 1 -1 end1reads.fastq -2 end2reads.fastq. Branch-seq reads for each gel slice were mapped to the genome and then combined using samtools merge (samtools-0.1.7a)(Li et al., 2009).

Reads were initially mapped to SacCer2 (S288C_reference_genome_R61-1-1_20080605) and subsequently to SacCer3 (S288C_reference_genome_R64-1-1_20110203) downloaded from SGD

(http://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/).

Peak calling

Was done in SacCer2 and peak calls were converted to SacCer3 coordinates using the UCSC browser tool, liftOver (http://genome.ucsc.edu/). Peaks were called using the combined reads from the top, middle, and bottom sections of the arc (see samtools merge above). For Figure 2-1D if there were multiple peaks within 3nt of the annotated BP, the annotated BP was only counted once.

winBP peak calling

The sliding window approach was adapted from Arribere and Gilbert (Arribere &

Gilbert, 2013) with some modifications. A 200nt region was taken starting at the 5' end of each . Average read coverage per nucleotide, α, for this region was calculated using only BP end (second end) reads and was required to be at least 0.1. A sliding window of 5 nt (196 of these windows/200nt region) within each 200nt region was used to reduce

73 spurious calls in regions with uneven coverage. If coverage in the 5 nt sliding window was at least 12α a peak was called. At least 1nt was required between reported peaks.

Peak calling was done for each strand, always in the 5' to 3' direction. The 200nt regions was shifted 100nt down the chromosome, and the steps outlined above were repeated until reaching the end of the chromosome.

winBP recovered 58% (153/260) (Table 1) of annotated BPs in expressed genes.

GEM-BP peak calling

To discover BP events from the data, we extended the ChIP-seq and ChIP-exo peak caller GEM4 that calls events with high spatial resolution. Unlike other peak callers, GEM does not assume any specific distribution of reads, and therefore is flexible to adapt to a new data type by learning a data-specific empirical spatial read distribution. We used a +/-

10bp window around the confident set of annotated BPs to learn the empirical read distribution (Fig. 2-1C) and used it for peak calling by GEM. To avoid including noisy reads from the non-BP strand, we modified GEM to perform single-strand peak calling and used only the 3' end (BP end) reads as input. As part of the integrated event finding and motif discovery process, GEM discovered the consensus BP motif TACTAAC, some variants that are similar to the consensus motifs, and a poly A motif that represents technical artifacts resulting from anchored oligo(dT) RT step of the protocol. To distinguish events associated with different motifs, we modified GEM to use multiple position weight matrix (PWM) motifs as the positional priors for event discovery. If a base position is matched by multiple motifs, GEM chooses the PWM model that has a more significant p-value to set the positional prior. For each called event, GEM computes an event shape score that quantify

74 the similarity of the event read distribution to the empirical read distribution. The event shape score is defined as the Pearson correlation of read count values across the +/-10bp bases between the called event and the empirical read distribution. The new functionalities of the GEM software, which we called GEM-BP, were implemented in version

2.6. The following parameters were used to analyzed the Branch-seq data: --k 7 --a 2 --q 1 -

-bp --pp_pwm --not_update_model --nrf --nf.

We then post-processed the GEM-BP event calls to discover BP events using a Random

Forest classifier (Breiman, 2001) in the MATLAB software (MathWorks, 2012).The features for the Random Forest include GEM-BP event read count, event shape score, and the binary motif categorical variables. We used the GEM-BP calls that overlap with the annotated BPs as the positive training set, and those that overlap with the tRNA genes as the negative training set. The trained Random Forest classifier was then applied to all of the GEM-BP event calls to make the final BP event calls.

In total, GEM-BP discovered 546 BPs (Table 1), including 75% of expressed BPs

(196/260) (Table 1) within 3 nt of their annotated locations (Fig. 2-1D). Of 546 GEM-BP predicted BPs, 47 (8.6%) had more than one mismatch from the BP consensus motif

TACTAAC, compared to 74 (21.5%) of the 344 peaks identified by the winBP approach.

These numbers indicate that the GEM-BP predictions are more biased toward consensus

BP, presumably because of its use of motif information and training on annotated BP, which match the consensus very closely, information which is not used by the winBP approach.

Thus, we used the union of predictions made by both peak callers for subsequent analyses.

Typical 5'SS filter for putative novel BPs

75 GEM-BP and winBP together called numerous unannotated BPs in the yeast genome; the union of their peak calls yielded 430 putative novel BP peaks in all (Table 1).

To define a high confidence subset of putative novel BPs, the paired-end sequencing information from Branch-seq was used as a built-in quality control for BP identification.

Branch-seq data contains strand-specific read pairs connecting the BPs and 5'SS. Authentic putative novel BP resulting from splicing should be associated with a plausible 5'SS motif at the start of the associated 5' end reads, while any artefactual putative novel BP peaks would not be expected to have such a motif.

For each BP, we took all BP end reads (3' end) within 5nt of the BP peak, accounting for strand. We obtained the paired 5'SS read for each BP read in this set and noted the location of the 5'SS read start. We calculated the mode position from all 5'SS read starts for that BP and looked at the 6mer motif at that position and one position 3'. We considered

6mers that matched the yeast 5'SS consensus GTATGT perfectly or with at most one mismatch as ‘typical 5'SS motifs’, and all others as ‘atypical 5'SS motifs’.

Almost all (97%) annotated yeast introns in nuclear genes have typical 5'SS motifs by this definition (Table II-S3). Of the Branch-seq 3' end peaks that were associated with annotated BP, 76% (149/196) and 90% (138/153) had 5' end peaks at the annotated 5'SS for GEM-BP and winBP, respectively (Table II-S1). This result indicates that our approach can reliably and comprehensively map both the BP and 5'SS of introns, as intended.

After applying the typical 5'SS filter to the 430 putative novel BP, 268 cnBP remained. This subset of 268 should be treated as highly confident and was used for all downstream analyses. We estimate the FDR for the set of 268 BP is 1.1%, which is the

76 genomic background frequency of 6mers matching typical 5'SS motifs in the yeast nuclear genome (Table II-S3, see bellow).

As a note, the overlap between the GEM-BP and winBP cnBP was only 80 BPs (Table

1), further suggesting that the two methods have different strengths and weaknesses in their ability to call novel BPs and there is benefit to using both methods.

Lariat-seq library prep

Reverse transcription was performed on 2D gel isolated lariat RNA using 1ul

Random hexamer Primers (3ug/ul) (Invitrogen) and SuperScript III reverse transcriptase

(Invitrogen). RNA and primer mix was heated at 70°C for 10 minutes and then put on ice.

12 uL of Mix A (mix A: 4uL 5x 1st strand buffer, 2uL 100mM DTT, 1uL dNTPs (10mM), 4uL

Actinomycin D [1mg/1mL], 1uL SuperaseIn (20U/ul)) was added to the RNA and primer.

Then 1 uL of SSIII was added and the RT program was run: 25⁰C 10 minutes, 42⁰C 50 minutes, 70⁰C 15 minutes, 4⁰C hold. Sample volume was brought up to 200uL with water and then samples were phenol chloroform extracted and ethanol precipitated. Second strand synthesis was performed with DNA pol I and dUTP to make strand specific libraries.

Next the samples underwent SPRI-TE (end repair, adenylation, adapter ligation, gel purification #1). Subsequently uracil digestion was performed with USER, samples underwent PCR and gel purification before sequencing (1/30 of a HiSeq2000 lane).

RNA-seq library prep

RNA was isolated using the hot acid phenol method (see RNA isolation above) to ensure isolation of high quality RNA. All 6 samples, 2 WT, 2 dbr1Δ, 2 upf1Δ, had RQN

77 (quality) values of 8.8 or higher as measured on the Advanced Analytical machine. Strand specific libraries were prepared by the MIT Bio Micro Center using the TruSeq™ RNA

Sample Prep Kit v2 (RS-122-2101 kit) through cDNA after which LM-PCR was preformed using the Beckman Coulter SPRIte system with a 200-400bp size cutoff. Samples were barcoded and all sequenced in one HiSeq2000 lane.

Genomic background frequency of 5'SS motif

One random position was selected in each of the 298 nuclear encoded intron containing genes in the SacCer3 genome annotation. The 6mer motif beginning at this location was score for number of mismatches from “GTATGT.” This was done 10 times to obtain 2980 simulated 5'SS in introns. 10 motifs had 0 mismatches and 24 motifs had 1 mismatch for an estimated FDR of 1.1% ((10+24)/2980) (Table II-S3).

Lariat tails are largely absent in vivo

Lariat tails appear to be efficiently digested in vivo, as previously reported, evidenced by a dearth of Lariat-seq reads in the long lariat tail of UBC13 (Fig. 2-S1B). With

Branch-seq we are able to see RT priming preferences based on the nucleotides left down stream of the BP nucleotide after digestion of the lariat tail. It appears 2 nt are generally left after the BP, resulting in RT priming peaks that begin at the +1 or +2 position relative to the

BP (Fig 2-S1C) depending on the genomic sequence at those positions (Fig. 2-S1D-E). The peak at -2 relative to the BP is likely to miss-priming of RT (Fig. 2-S1D). See Fig. 2-S1 legend for more information. To our knowledge, this is the first report of the precise number of

78 nucleotides downstream of the BP nucleotide left undigested in lariat tails from RNA isolated from dbr1∆ yeast.

Mapping lariat junction reads

Lariat junction reads were identified and aligned in four main steps:

1. Reads were attempted to be aligned to the S.cer. genome using the Bowtie

(version 1.0.0) read aligner and those aligning with fewer than 4 mismatches were omitted from further analysis.

2. Each unalignable read was split into two fragments such that each fragment was at least 12 bases long and the hexamer beginning the second fragment had maximum probability of being sampled from the S.cer. 5’ss position weight matrix. Reads for which this maximum probability was less than 0.01 were omitted from further analysis. The fragments will be referred to by their position at the 3’ or 5’ end of the original read moving forwards.

3. The fragment pairs were mapped to the S.cer. genome using the bowtie read aligner allowing no mismatches. The fragments were required to map in an inverted order

(3’ fragment upstream of 5’ fragment). The final base of each 5’ fragment, the putative BP nucleotide, was omitted from this alignment due to the prevalence of mismatches at this position.

4. For all fragment pairs with a valid alignment, the final base of each 5’ fragment was re-added. The aligned position of the 3’ end of the 5’ fragment was called as a BP and the aligned position of the 5’ end of the 3’ fragment was called as the corresponding 5’ss.

79 Skipping across lariat 5'SS-BP junctions

We found that reverse transcriptase often introduces short insertions and deletions when crossing a lariat junction. This results in the 3' end of 5' fragment of lariat junction reads not always ending directly at the BP. The frequency of these events was determined by comparing the BP location called by each lariat junction read to a known BP location as annotated by Meyer et al. within 25 bps if one exists. Figure 2-S2D reports the distribution when allowing no mismatches as used elsewhere in this paper. This criterion precluded observing insertion events as they were found to always have the sequence UACUACU at the 3’ end of the 5’ fragment, resulting in mismatches in the last two positions when aligned to the BP consensus motif.

BP calling from lariat junction reads

In order to make precise BP calls from the lariat junction reads, a probabilistic model based on the observed skipping rates in introns with annotated BP and a self-learned BP motif position weight matrix (PWM) was used.

Reads were separated into clusters based on proximity of their downstream ends.

th The i cluster of reads is denoted by R . The distribution P (Bi = x Ri), where B is a i | i

RV indicating the location of the BP generating Ri, was computed using the proportion

P (B = x R ) P (R B = x) P (B = x). Assuming a uniform prior and that reads i | i / i | i ⇤ i are independent given a BP, we rewrite this proportion as

P (Bi = x Ri) P (r Bi = x) r Ri . Note that P (r Bi = x) is simply the probability of | / 2 | | Q observing a deletion of the size in read r given Bi = x.

80 An EM framework was used to learn a BP motif PWM, which was then used to improve precision. Beginning with an unbiased motif, the following protocol was repeated until the motif did not change between iterations:

1. Calculate P (Bi = x Ri,M) , where M is the current motif, by multiplying |

P (Bi = x Ri) and the probability that the motif implied by Bi = x would be | sampled from Mand then normalizing by the sum across each cluster.

2. Refine M based on the updated distribution. For each nucleotide in all positions in

M , start with a pseudocount of 1. For all possible x, in all clusters i, add

P (B = x R ,M) to the count for the nucleotide in the respective position, for i | i each position in the motif. Normalize by dividing all counts by the number of

clusters plus 4.

Mapping RNA-seq reads for entropy calculations

60 X 60 bp reads (WT, upf1 null, and dbr1 null samples) were initially mapped with

TopHat2(Kim et al., 2013) (tophat-2.0.0.Linux_x86_64) giving TopHat no annotations and allowing it to discover novel splice junctions using the following parameters: tophat -i 20 -

I 10000 -a 10 --segment-length 15 --bowtie1 SacCer3 end1.fastq end2.fastq Each barcoded sample was mapped on its own and additionally all samples were mapped together to find as many novel splice junctions as possible. A custom Bowtie index was created for all splice junctions found by Tophat by concatenating the 50nt of sequence immediately before and after the junction to ensure the reads had at least a 10nt overhang on each side of the junction. Bowtie1 was run with this custom index (genome + novel splice junctions) on each end of each sequencing library separately because parried end

81 reads would be able to map to this custom index with many 100nt fragments. Bowtie was run as follows: bowtie -S -m 1 -SacCer3_custom_index one_end_reads.fastq outfile.sam.

Bowtie read mapping to the custom splice index was used to calculate entropy of each splice junction(Graveley et al., 2011) using the formula bellow, as done in Graveley et al., using the positions around the junction where read starts may fall.

pi = reads at offset i / total reads to junction window

Entropy = - sumi(pi * log(pi) / log2)

RPL30 AT-AC isoforms

These isoforms insert a stop codon early in the message, generating an upstream open reading frame (uORF). These isoforms might therefore be translated under specific conditions via uORF-mediated translational regulation (Hinnebusch, 2006), potentially producing a truncated protein comprising the C-terminal half of full length RPL30. RPL30 is known to regulate splicing and translation of transcripts from the RPL30 locus by binding to RNA secondary structure at the 5' end of the pre-mRNA or mRNA.

Conservation

PhastCons scores were downloaded from the UCSC genome browser

(phastCons7way) for the novel BP and novel splice site analyses. For the novel splice site

plots, the entire region surrounding the splice site in the figure had to fall into the region

of question (i.e., intron or CDS). “Intergenic” refers to any region completely outside of a

82 CDS or intron. For the BP conservation plot, only the location of the BP was considered for

classifying the BP by location.

Protein length analysis

For all novel splice junctions with entropy at least 2 that overlap an annotated gene, the protein sequence of the resultant transcript was constructed. The length of each novel protein sequence was compared to the length of the annotated protein from the same gene and reported in figure 2-4C. When constructing the novel protein sequences, the following assumptions were followed:

1. In cases where a gene has multiple novel splice junctions, only one is considered at a

time (i.e. if there are 3 novel splice junctions in one gene, three protein sequences

are created).

2. All annotated introns are spliced out, except if they overlap the novel splice junction

being considered at the time.

3. If a novel splice junction removes the annotated translation start site, the next

available AUG is used.

MISO analysis of splicing

Retained intron annotations were created from all splice junctions with entropy

>=2. Retained introns were splice junctions detected in the WT, upf1 null, or dbr1 null samples that did not overlap any other splice junctions detected, annotated or novel. To build the RI MISO annotations 200nt flanking the intron was used as exonic sequence. MISO

(misopy/0.4.6) was run. For Waern et al data (downloaded

83 from http://downloads.yeastgenome.org/published_datasets/Waern_2013_PMID_233906

10/fastq/), --read-length = 76. For Brar et al. data (GEO accession number GSE34082), only reads of length 28-30 nt were used and --read-length was set to 29. Only footprints are shown for Brar et al. data because the total RNA libraries had few reads that fell into the

28-30 nt range. Prior to mapping Brar et al. data, poly(A) adaptor sequences were trimmed off of the reads using Cutadapt. Brar et al. and Waern et al. reads were mapped to the genome, defined splice junctions (UCSC, sacCer3), and novel splice junctions with entropy ≥

2 in the WT, upf1 null, and dbr1 null RNA-seq (see above) using Tophat2. Summary tables from MISO output were generated for evens with x=1, y=0, n=20, psi confidence = 0.5 (see

“Using the read class counts” https://miso.readthedocs.org/en/fastmiso/). These were considered “confident” psi values (see bellow).

Clustering of PSI values

If an event had confident PSI values in at least half of the conditions, the missing psi

values were replaced with the mean PSI from the confident samples. Clustering was done

with heatmap.2 in R (Warnes et al., 2015).

Cufflinks (RNA-seq FPKMs)

Cufflinks (Trapnell et al., 2012) (version 2.2.1) was used to calculate FPKMs for the

RNA-seq data using the command cuffdiff -o . --library-type fr-firststrand -u -N -b

SacCer3.fsa saccharomyces_cerevisiae_R64-1-1_20110208.gff wt1.bam,wt1.bam

dbr1-1.bam,dbr1-2.bam upf1-1.bam,upf1-2.bam

84 Branch-seq CPM calculations

Branch-seq CPMs were calculated using the formula CPM = F/((L)(M/1,000,000))

Where M is the total number of mapped reads. F is the number of strand-specific BP (3'

end) reads within the L nucleotides centered on the BP peak. L=11 nt.

Genes with multiple BPs

5'SS-BP pairs from annotated introns with computationally predicted BPs

(282)(Meyer et al., 2011) and all 268 cnBPs with typical 5'SS 5'SS-BP were considered in

this analysis for a total of 550 5'SS-BP pairs. Any overlapping 5'SS-BP pairs on the same

strand were grouped into one “intron island.” For islands that contain 2 or more BPs, it

was required that there was a BP motif with 2 or fewer mismatches from “TACTAAC”

within 3nt of the BP peak the keep the peak for downstream analyses. This yielded 11

intron islands that use 2 BPs and one intron island that uses 3 BPs. For the genes that use

2 BPs the distance from the 5'SS to the BP is the distance for each BP to its paired 5'SS.

BP1 is the more 5'SS BP in the intron island. Sequence logos made with WebLogo(Crooks,

Hon, Chandonia, & Brenner, 2004).

Novel and annotated BP motifs

Sequence 15nt up and downstream of the BP peaks were submitted to MEME

(Bailey et al., 2009) (Version 4.10.0) to generate sequence logos. Only BP detected by

Branch-seq are in the logos in Figure 2-2.

85 Human BP motif was generated using sequences 10 nt up and downstream of the BP nt from Mercer et al’s (Mercer et al., 2015) annotated BPs. 1000 sequences were submitted to MEME (maximum MEME accepts) to generate the motif.

LSM2 qPCR primer sequences

Actin primers:

ScerACT1_junct_F: ATGGATTCTGAGGTTGCTGCT

ScerACT1_mRNA_Rev: GGAGTCTTTTTGACCCATACCGA

LSM2 constitutive exon:

LSM2 qPCR Exon 2F constitutive: TAAAAAACGACATTGAAATAAAAGGTACA

LSM qPCR Exon 2R constitutive: TTCATCTGTGCATGATATGTTGTCTA

LSM2 novel 3'SS (PTC isoform):

LSM2 qPCR new 3’ss junction F: GTGGTCGTAGAGTCAAGTACTAAC

LSM qPCR Exon 2R constitutive: TTCATCTGTGCATGATATGTTGTCTA

LSM2 annotated 3'SS isoform:

LSM2 qPCR canonical (normal) 3’ss junction F: GTGGTCGTAGAGTTAAAAAACGAC

LSM qPCR Exon 2R constitutive: TTCATCTGTGCATGATATGTTGTCTA

RNA14 (NMD negative control):

GG10_for: ATGTCCAGCTCTACGACTCCTGAT

GG11_rev: GCGTATGACTCTTGAGTTTCCAAA (From Joshua Arribere (Arribere &

Gilbert, 2013))

TCA17 (NMD positive control):

GG8_for:GCCTTGCTTCGTATCATTGATAGA

86 GG9_rev:CATCATCAGCTCCACTTAGGCTTT (From Joshua Arribere (Arribere &

Gilbert, 2013))

RPL30 primer sequences

RT: SuperScript II protocol (Invitrogen)

GG13_YGL030W_rev: AAGCCAACTTTTGGTTGATAGA

PCR: Phusion (NEB)

GG14:YGL030W_5’end_for: agaccggagtgtttaagaacct

GG15:YGL030W_rev_ATACjunc: TAACTGGGGCctgttgaaat

SED1 primers

For Figure 2-S4B:

RT: Random hexamers (Invitrogen), following SuperScript II protocol (Invitrogen).

PCR: Phusion (NEB)

GG17:SED1_for: TACATCTTTGCCACCAAGCA

GG18:SED1_rev: TTTGGTGGTAGTGCCCTTAGA

For Figure 2-S5E: SED1 apparent RT artifact

Colony PCR was performed to put a T7 primer onto the start of the SED1 sequence.

PCR product was gel extracted and used as a temple for T7 in vitro transcription (Epicentre

AmpliScribe™ T7-Flash™ Transcription Kit), DNA was digested, and RNA product was cleaned via phenol chloroform extraction. RNA was gel extracted using UV shadowing visualization. RT and PCR were performed as in Figure 2-S5B.

87 Scer_SED1_colony_Forward: TAATACGACTCACTATAGGGgacaagcaaaataaaatacgttcg

Scer_SED1_colony_Reverse: ttaaactacccctattgcttttaga

Plotting

Additional plots in this paper were made with ggplot2(Wickham, 2009), IGV

(Robinson et al., 2011), matplotlib, Pictogram, WebLobo, and MEME.

88 Data access

The data can be found under GEO accession number GSE68022.

GEM-BP (GEM 2.6) software for peak calling can be downloaded from http://cgs.csail.mit.edu/gem/versions.html

Code to find BPs from lariat junction reads can be downloaded from https://github.com/jpaggi/findbps

Acknowledgements

We’d like to thank Andy Berglund for initial ideas on the Branch-seq protocol and members of the Burge lab for helpful discussions. We thank the Reed lab for coupled in- vitro splicing and translation reagents and protocols that were used in development of

Branch-seq. We thank Yarden Katz for primer design assistance, Shijie Zhao for initial conservation analysis of novel splice sites, David Weinberg for personal communication,

Josh Arribere for sporulating dbr1∆ yeast, the Sauer and King labs for assistance with protein purification, Thomas Hansen and Angela Brooks for helpful discussions, and MIT

Bio Micro Center for sequencing assistance. This work was supported by an NIH training grant, by an NSF equipment grant (no. 0821391) and by a research grant from the NIH

(C.B.B.).

Author contributions

GMG, ETW, and CBB designed the study. GMG performed the experiments. GMG and

JMP analyzed the data. YG and DG contributed to analysis tools. BZ processed the Brar et al. data. GMG, JMP, YG, BZ, WVG, and CBB wrote the manuscript.

89 Supplemental figures

Figure 2-S1. Additional details pertaining to Branch-seq protocol. (A) Left: 2D gel used to isolate lariats from top, middle, and bottom sections of arc. Right: Top and bottom splices excised. D1: 6% TBE-urea. D2: 20% TBE-urea. (B) Read coverage (green) in UBC13 intron from Lariat-seq. Depletion of reads between BP and 3'SS indicates 90 lariat tails are digested when lariats accumulate in dbr1∆ yeast(Chapman & Boeke, 1991). (C) Additional examples like inset in figure 2-1B of read start plots for BPs in 4 individual introns. The majority of reads are located at +1 or +2 position on an intron by intron basis. (D) Hypothesis for predominant +1 vs. +2 read start position in individual introns. RNA sequence in black, question marks are unknown nucleotides after the BP. BP A in red. The RT primer, green, may prime at different locations, and produce sequencing products (blue arrow), starting at different positions relative to the BP nucleotide. +1 sequencing is expected if nucleotide after TACTAAC is an A because of anchored oligo(dT) priming step in RT. Similarly, +2 position is expected if nucleotide after TACTAAC is C, G, or T. Sequencing at -2 is due to mis-priming of anchored oligo(dT) primer over the terminal C of the BP motif. (E) Genomic sequence immediately downstream of annotated BPs (boxed) with maximum peak from (C) at +1, left, and +2, right, confirms hypothesis in (D). (F) Branch- seq reads in the EFM5 intron are shifted 5 nt from the annotated BP location (blue underline) corresponding to a AACTAAC BP (red underline).

91

Figure 2-S2. Further characterization of novel BPs. (A) Left: Novel BPs (blue) are not conserved compared to annotated BPs (red). Right: novel BPs from blue line in left plot broken down by genomic location. (B) 5'SS motif of 162

92 putative novel BP with atypical 5'SS. (C) Novel BP overlapping YDL138W ORF (plus strand) comes from the minus strand, potentially from a longer form of the annotated CUT/SUT on the minus strand. Novel BP is confirmed by one Branch-seq read pair and several Lariat- seq junction reads. (D) RT sometimes skips over the BP nucleotide in Lariat-seq junction reads (see methods).

93

Figure 2-S3. Characteristics of lariats captured by Branch-seq. (A) Comparison of expression levels of lariats recovered in Branch-seq (combined top, middle, and bottom slices of arc) to expression of their parent mRNA in poly(A) selected RNA-seq. Only annotated BPs are plotted. (B) Same as (A) but regression calculated for different lariat sizes, suggested that Branch-seq read counts are semi-quantitative for lariat loops smaller than 100 nt. (C) Expression level of annotated and novel BPs recovered by Branch-seq. (D) Lariat loop lengths recovered by Branch-seq and Lariat-seq LJ reads.

94

Figure 2-S4. Novel introns confirmed by entropy resemble annotated introns but preferentially come from short transcripts. (A) Entropy of annotated (green) and novel (pink) splice junctions, separated by splice site motif AT/AC, GC/AG, GT/AG. A cutoff of entropy of 2 was used to define novel splice junctions(Graveley et al., 2011). (B) 5'SS and 3'SS motifs for annotated (top) and novel

95 (bottom) splice sites. (C) Gene lengths (TSS to poly(A) site)(Pelechano, Wei, & Steinmetz, 2013) for genes containing novel BPs identified in Branch-seq and genes containing novel introns with entropy ≥ 2 identified in RNA-seq data.

96

Figure 2-S5. Experimental testing of AT-AC splice site introns. RT-PCR on total RNA to verify (A) RPL30 and (B) SED1 AT-AC splice sites. SED1 AT-AC splice site intron is located inside a long repeat (C) highlighted in green and (D) shown in a 97 dot plot. (E) RT-PCR on in-vitro transcribed full length SED1 RNA. The presence of a product here of the expected spliced size suggests the presence of some sort of RT artifact.

98

Figure 2-S6. Conservation of novel intron splice sites from isoforms that show splicing patterns similar to annotated introns. Arrows above each splice site indicate sequence direction. UCSC browser snapshots are shown for splice sites located outside of coding sequences.

99

Figure 2-S7. Translation of YNL194-YNL195C fusion transcript changes throughout meiosis time course. Sashimi plots depict reads in exons and reads spanning splice junction (numbered arcs) with PSI value shown to the right with confidence bounds (tie fighter plot). Plots are ordered by progression through meiosis time course from Brar et al. (Brar et al., 2012) for (A) ribosome footprint profiling data.

100 Tables

Table 1. Summary of BP peak calling analysis. Peak Caller winBP GEM-BP Overlap Union No. known BP 153 196 151 198 No. putative novel BP 191 350 111 430 No. of cnBP 126 222 80 268

101 References Aebi, M., Hornig, H., Padgett, R. A., Reiser, J., & Weissmann, C. (1986). Sequence requirements for splicing of higher eukaryotic nuclear pre-mRNA. Cell, 47(4), 555–565. doi:10.1016/0092-8674(86)90620-3 Arribere, J. A., & Gilbert, W. V. (2013). Roles for transcript leaders in translation and mRNA decay revealed by transcript leader sequencing. Genome Research, 23(6), 977–987. doi:10.1101/gr.150342.112 Awan, A. R., Manfredo, A., & Pleiss, J. A. (2013). Lariat sequencing in a unicellular yeast identifies regulated alternative splicing of exons that are evolutionarily conserved with humans., 110(31), 12762–12767. doi:10.1073/pnas.1218353110 Bailey, T. L., Boden, M., Buske, F. A., Frith, M., Grant, C. E., Clementi, L., et al. (2009). MEME SUITE: tools for motif discovery and searching. Nucleic Acids Research, 37(Web Server issue), W202–8. doi:10.1093/nar/gkp335 Beggs, J. D. (2005). Lsm proteins and RNA processing. Biochemical Society Transactions, 33(Pt 3), 433–438. doi:10.1042/BST0330433 Bitton, D. A., Rallis, C., Jeffares, D. C., Smith, G. C., Chen, Y. Y. C., Codlin, S., et al. (2014). LaSSO, a strategy for genome-wide mapping of intronic lariats and branch points using RNA-seq. Genome Research, 24(7), 1169–1179. doi:10.1101/gr.166819.113 Bradley, R. K., Merkin, J., Lambert, N. J., & Burge, C. B. (2012). Alternative Splicing of RNA Triplets Is Often Regulated and Accelerates Proteome Evolution, 10(1), e1001229. doi:10.1371/journal.pbio.1001229 Brar, G. A., Yassour, M., Friedman, N., Regev, A., Ingolia, N. T., & Weissman, J. S. (2012). High- resolution view of the yeast meiotic program revealed by ribosome profiling. Science (New York, NY), 335(6068), 552–557. doi:10.1126/science.1215110 Breiman, L. (2001). Random forests. Machine Learning. Burnette, J. M., Miyamoto-Sato, E., Schaub, M. A., Conklin, J., & Lopez, A. J. (2005). Subdivision of large introns in Drosophila by recursive splicing at nonexonic elements. Genetics, 170(2), 661–674. doi:10.1534/genetics.104.039701 Chapman, K. B., & Boeke, J. D. (1991). Isolation and characterization of the gene encoding yeast debranching enzyme. Cell, 65(3), 483–492. doi:10.1016/0092-8674(91)90466-C Clarkson, B. K., Gilbert, W. V., & Doudna, J. A. (2010). Functional overlap between eIF4G isoforms in Saccharomyces cerevisiae. PloS One, 5(2), e9114. doi:10.1371/journal.pone.0009114 Crooks, G. E., Hon, G., Chandonia, J.-M., & Brenner, S. E. (2004). WebLogo: a sequence logo generator. Genome Research, 14(6), 1188–1190. doi:10.1101/gr.849004 Davis, C. A. (2000). Test of intron predictions reveals novel splice sites, alternatively spliced mRNAs and new introns in meiotically regulated genes of yeast. Nucleic Acids Research, 28(8), 1700–1706. doi:10.1093/nar/28.8.1700 Dietrich, R. C., Incorvaia, R., & Padgett, R. A. (1997). Terminal Intron Dinucleotide Sequences Do Not Distinguish between U2- and U12-Dependent Introns. Molecular Cell, 1(1), 151–160. doi:10.1016/S1097-2765(00)80016-7 Dumesic, P. A., Natarajan, P., Chen, C., Drinnenberg, I. A., Schiller, B. J., Thompson, J., et al. (2013). Stalled spliceosomes are a signal for RNAi-mediated genome defense. Cell, 152(5), 957–968. doi:10.1016/j.cell.2013.01.046 Folco, E. G., & Reed, R. (2014). In vitro systems for coupling RNAP II transcription to

102 splicing and polyadenylation. Methods in Molecular Biology (Clifton, NJ), 1126, 169–177. doi:10.1007/978-1-62703-980-2_13 Folco, E. G., Lei, H., Hsu, J. L., & Reed, R. (2012). Small-scale nuclear extracts for functional assays of gene-expression machineries. Journal of Visualized Experiments : JoVE, (64). doi:10.3791/4140 Friedman, K. L., & Brewer, B. J. (1995). Analysis of replication intermediates by two- dimensional agarose gel electrophoresis. In Methods in Enzymology (Vol. 262, pp. 613– 627). Elsevier. doi:10.1016/0076-6879(95)62048-6 González, C. I., Wang, W., & Peltz, S. W. (2001). Nonsense-mediated mRNA decay in Saccharomyces cerevisiae: a quality control mechanism that degrades transcripts harboring premature termination codons. Cold Spring Harbor Symposia on Quantitative Biology, 66, 321–328. Graveley, B. R., Brooks, A. N., Carlson, J. W., Duff, M. O., Landolin, J. M., Yang, L., et al. (2011). The developmental transcriptome of Drosophila melanogaster. Nature, 471(7339), 473–479. doi:10.1038/nature09715 Guo, Y., Mahony, S., & Gifford, D. K. (2012). High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Computational Biology, 8(8), e1002638. doi:10.1371/journal.pcbi.1002638 Hinnebusch, A. G. (2006). Gene-specific translational control of the yeast GCN4 gene by phosphorylation of eukaryotic initiation factor 2. Molecular Microbiology, 10(2), 215– 223. doi:10.1111/j.1365-2958.1993.tb01947.x Hirose, T., Ideue, T., Nagai, M., Hagiwara, M., Shu, M.-D., & Steitz, J. A. (2006). A spliceosomal intron binding protein, IBP160, links position-dependent assembly of intron-encoded box C/D snoRNP to pre-mRNA splicing. Molecular Cell, 23(5), 673–684. doi:10.1016/j.molcel.2006.07.011 Hossain, M. A., & Johnson, T. L. (2014). Using yeast genetics to study splicing mechanisms. Methods in Molecular Biology (Clifton, N.J.), 1126, 285–298. doi:10.1007/978-1-62703- 980-2_21 Hu, W., Sweet, T. J., Chamnongpol, S., Baker, K. E., & Coller, J. (2009). Co-translational mRNA decay in Saccharomyces cerevisiae. Nature, 461(7261), 225–229. doi:10.1038/nature08265 Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S., & Weissman, J. S. (2009). Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling. Science (New York, N.Y.), 324(5924), 218–223. doi:10.1126/science.1168978 Juneau, K., Nislow, C., & Davis, R. W. (2009). Alternative splicing of PTC7 in Saccharomyces cerevisiae determines protein localization. Genetics, 183(1), 185–194. doi:10.1534/genetics.109.105155 Katz, Y., Li, F., Lambert, N. J., Sokol, E. S., Tam, W.-L., Cheng, A. W., et al. (2014). Musashi proteins are post-transcriptional regulators of the epithelial-luminal cell state. eLife, 3, e03915. doi:10.7554/eLife.03915 Katz, Y., Wang, E. T., Airoldi, E. M., & Burge, C. B. (2010). Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nature Methods, 7(12), 1009–1015. doi:10.1038/nmeth.1528 Kawashima, T., Douglass, S., Gabunilas, J., Pellegrini, M., & Chanfreau, G. F. (2014). Widespread use of non-productive alternative splice sites in Saccharomyces cerevisiae. PLoS Genetics, 10(4), e1004249. doi:10.1371/journal.pgen.1004249 103 Kent, W. J. (2002). BLAT--the BLAST-like alignment tool. Genome Research, 12(4), 656–664. doi:10.1101/gr.229202 Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., & Salzberg, S. L. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology, 14(4), R36. doi:10.1186/gb-2013-14-4-r36 Královicová, J., Lei, H., & Vorechovský, I. (2006). Phenotypic consequences of branch point substitutions. Human Mutation, 27(8), 803–813. doi:10.1002/humu.20362 Kruger, K., Grabowski, P. J., Zaug, A. J., Sands, J., Gottschling, D. E., & Cech, T. R. (1982). Self- splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena. Cell, 31(1), 147–157. Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the . Genome Biology, 10(3), R25. doi:10.1186/gb-2009-10-3-r25 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078– 2079. doi:10.1093/bioinformatics/btp352 Malone, R. E., Bullard, S., Hermiston, M., Rieger, R., Cool, M., & Galbraith, A. (1991). Isolation of mutants defective in early steps of meiotic recombination in the yeast Saccharomyces cerevisiae., 128(1), 79–88. MathWorks, I. (2012). MathWorks: MATLAB and Statistics Toolbox Release - Google Scholar. Mercer, T. R., Clark, M. B., Andersen, S. B., Brunck, M. E., Haerty, W., Crawford, J., et al. (2015). Genome-wide discovery of human splicing branchpoints. Genome Research, 25(2), 290–303. doi:10.1101/gr.182899.114 Meyer, M., Plass, M., Pérez-Valle, J., Eyras, E., & Vilardell, J. (2011). Deciphering 3'ss selection in the yeast genome reveals an RNA thermosensor that mediates alternative splicing. Molecular Cell, 43(6), 1033–1039. doi:10.1016/j.molcel.2011.07.030 Miura, F., Kawaguchi, N., Sese, J., Toyoda, A., Hattori, M., Morishita, S., & Ito, T. (2006). A large-scale full-length cDNA analysis to explore the budding yeast transcriptome., 103(47), 17846–17851. doi:10.1073/pnas.0605645103 Ooi, S. L., Dann, C., Nam, K., Leahy, D. J., Damha, M. J., & Boeke, J. D. (2001). RNA lariat debranching enzyme. Methods in Enzymology, 342, 233–248. Padgett, R. A., Konarska, M. M., Aebi, M., Hornig, H., Weissmann, C., & Sharp, P. A. (1985). Nonconsensus branch-site sequences in the in vitro splicing of transcripts of mutant rabbit beta-globin genes, 82(24), 8349–8353. Padgett, R. A., Konarska, M. M., Grabowski, P. J., Hardy, S. F., & Sharp, P. A. (1984). Lariat RNA's as intermediates and products in the splicing of messenger RNA precursors. Science (New York, NY), 225(4665), 898–903. Pan, Q., Shai, O., Lee, L. J., Frey, B. J., & Blencowe, B. J. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics, 40(12), 1413–1415. doi:10.1038/ng.259 Pelechano, V., Wei, W., & Steinmetz, L. M. (2013). Extensive transcriptional heterogeneity revealed by isoform profiling. Nature, 497(7447), 127–131. doi:10.1038/nature12121 Pleiss, J. A., Whitworth, G. B., Bergkessel, M., & Guthrie, C. (2007). Rapid, transcript-specific changes in splicing in response to environmental stress. Molecular Cell, 27(6), 928–937. doi:10.1016/j.molcel.2007.07.018 104 Presnyak, V., Alhusaini, N., Chen, Y.-H., Martin, S., Morris, N., Kline, N., et al. (2015). Codon optimality is a major determinant of mRNA stability. Cell, 160(6), 1111–1124. doi:10.1016/j.cell.2015.02.029 Quesada, V., Conde, L., Villamor, N., Ordóñez, G. R., Jares, P., Bassaganyas, L., et al. (2012). Exome sequencing identifies recurrent mutations of the splicing factor SF3B1 gene in chronic lymphocytic leukemia. Nature Genetics, 44(1), 47–52. doi:doi:10.1038/ng.1032 Rain, J. C. (1997). In vivo commitment to splicing in yeast involves the nucleotide upstream from the branch site conserved sequence and the Mud2 protein. The EMBO Journal, 16(7), 1759–1771. doi:10.1093/emboj/16.7.1759 Reich, C. I., VanHoy, R. W., Porter, G. L., & Wise, J. A. (1992). Mutations at the 3′ splice site can be suppressed by compensatory base changes in U1 snRNA in fission yeast. Cell, 69(7), 1159–1169. doi:10.1016/0092-8674(92)90637-R Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., & Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1), 24–26. doi:10.1038/nbt.1754 Ruskin, B., & Green, M. R. (1985). An RNA processing activity that debranches RNA lariats. Science (New York, NY), 229(4709), 135–140. Russell, A. G., Charette, J. M., Spencer, D. F., & Gray, M. W. (2006). An early evolutionary origin for the minor spliceosome. Nature, 443(7113), 863–866. doi:10.1038/nature05228 Séraphin, B., & Kandels-Lewis, S. (1993). 3′ splice site recognition in S. cerevisiae does not require base pairing with U1 snRNA. Cell, 73(4), 803–812. doi:10.1016/0092- 8674(93)90258-R Sheth, N., Roca, X., Hastings, M. L., Roeder, T., Krainer, A. R., & Sachidanandam, R. (2006). Comprehensive splice-site analysis using comparative genomics. Nucleic Acids Research, 34(14), 3955–3967. doi:10.1093/nar/gkl556 Smith, C. W., & Nadal-Ginard, B. (1989). Mutually exclusive splicing of alpha-tropomyosin exons enforced by an unusual lariat branch point location: implications for constitutive splicing. Cell, 56(5), 749–758. Spingola, M., Grate, L., Haussler, D., & Ares, M. (1999). Genome-wide bioinformatic and molecular analysis of introns in Saccharomyces cerevisiae., 5(2), 221–234. Sureau, A. (2001). SC35 autoregulates its expression by promoting splicing events that destabilize its mRNAs. The EMBO Journal, 20(7), 1785–1796. doi:10.1093/emboj/20.7.1785 Taggart, A. J., DeSimone, A. M., Shih, J. S., Filloux, M. E., & Fairbrother, W. G. (2012). Large- scale mapping of branchpoints in human pre-mRNA transcripts in vivo. Nature Structural & Molecular Biology, 19(7), 719–721. doi:10.1038/nsmb.2327 Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., et al. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562–578. doi:10.1038/nprot.2012.016 Vierstra, R. D., & Callis, J. (1999). Polypeptide tags, ubiquitous modifiers for plant protein regulation. Plant Molecular Biology, 41(4), 435–442. Vijayraghavan, U., Parker, R., Tamm, J., Iimura, Y., Rossi, J., Abelson, J., & Guthrie, C. (1986). Mutations in conserved intron sequences affect multiple steps in the yeast splicing pathway, particularly assembly of the spliceosome. The EMBO Journal, 5(7), 1683– 1695. 105 Vogel, J., Hess, W. R., & Börner, T. (1997). Precise branch point mapping and quantification of splicing intermediates. Nucleic Acids Research, 25(10), 2030–2031. Waern, K., & Snyder, M. (2013). Extensive transcript diversity and novel upstream open reading frame regulation in yeast. G3 (Bethesda, Md.), 3(2), 343–352. doi:10.1534/g3.112.003640 Wahl, M. C., Will, C. L., & Lührmann, R. (2009). The Spliceosome: Design Principles of a Dynamic RNP Machine. Cell, 136(4), 701–718. doi:10.1016/j.cell.2009.02.009 Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., et al. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature, 456(7221), 470–476. doi:doi:10.1038/nature07509 Warnes, G. R., Ben Bolker, Bonebakker, L., Gentleman, R., Liaw, W. H. A., Lumley, T., et al. (2015). gplots: Various R Programming Tools for Plotting Data. R package version 2.16.0. CRAN.R-Project.org. Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Retrieved from http://books.google.com/books?hl=en&lr=&id=bes- AAAAQBAJ&oi=fnd&pg=PR5&dq=H+Wickham+ggplot2+elegant+graphics+for+data+an alysis+Springer+New+York+2009&ots=SA95Sz5RTU&sig=jfYAe6OOtsEgtMPKIuy6Z1q TYFA Wollerton, M. C., Gooding, C., Wagner, E. J., Garcia-Blanco, M. A., & Smith, C. W. J. (2004). Autoregulation of Polypyrimidine Tract Binding Protein by Alternative Splicing Leading to Nonsense-Mediated Decay. Molecular Cell, 13(1), 91–100. doi:10.1016/S1097- 2765(03)00502-1 Young, M. E., Karpova, T. S., Brugger, B., Moschenross, D. M., Wang, G. K., Schneiter, R., et al. (2002). The Sur7p Family Defines Novel Cortical Domains in Saccharomyces cerevisiae, Affects Sphingolipid Metabolism, and Is Involved in Sporulation. Molecular and Cellular Biology, 22(3), 927–934. doi:10.1128/MCB.22.3.927-934.2002 Zhang, Z., Hesselberth, J. R., & Fields, S. (2007). Genome-wide identification of spliced introns using a tiling microarray. Genome Research, 17(4), 503–509. doi:10.1101/gr.6049107

106

Chapter 3: Conclusions

107 Implications

Our discovery of hundreds of novel BPs by Branch-seq revealed an unexpected picture of the budding yeast transcriptome. The surprising number of “intergenic” BPs we found, together with other recent studies that demonstrate gene boundaries in yeast are still being refined, has revealed additional post-transcriptional gene regulation mechanisms and increased coding diversity in yeast (Arribere & Gilbert, 2013; Pelechano,

Wei, & Steinmetz, 2013). Thus, the number of novel BPs that we report in UTRs and antisense transcripts are likely underestimates. Presumably all of the novel BPs fall inside introns, since splicing is the only known mechanism that creates RNA lariats with 2' branched structures.

To help understand the origins of these BPs, we identified over 100 novel introns in budding yeast using RNA-seq. However, most of the novel BPs fell outside of our novel introns and those introns recently identified by another study (Kawashima, Douglass,

Gabunilas, Pellegrini, & Chanfreau, 2014). This persistent discrepancy begs an explanation.

We propose that the small overlap of novel BPs and novel introns could be due to different technical biases in Branch-seq and RNA-seq library preparations or could result from identification of products of incomplete splicing. One way to assay for stalled splicing is to perform targeted RT-PCR to assess whether the completely spliced product is formed. In a more high-throughput approach, read density upstream of the 5'SS, relative to read density downstream of the 3'SS used to normalize for gene expression, can be compared in RNA- seq libraries that were poly(A)-selected versus rRNA-depleted. As long as the first exon is stable following the first step of splicing, it will be present in an rRNA-depleted library but will not be present in a poly(A)-selected library because it lacks a poly(A) tail. Thus, the

108 expectation is that poly(A)-selected libraries will have lower read density upstream of the

5'SS than rRNA-depleted libraries in cases of incomplete splicing.

Our identification of novel introns refined the budding yeast genome annotations and lead us to validate the first AT-AC splice site intron in S. cerevisiae. The usage of AT-AC splice sites in budding yeast is puzzling because the minor spliceosome that usually splices introns with these splice site motifs in metazoans and plants is not present in S.cer. It is possible that this AT-AC intron is removed by the major U2 spliceosome, as is the case for

1-2 dozen other known introns in metazoans. It is also conceivable that this intron is removed by self splicing. However, this intron is only 214 nt long, making it unclear whether it could adopt the 2D conformation typical of group II introns.

Our identification of introns that contain multiple BPs in yeast (Chapter 2) and fly

(Appendix III) suggests that BP selection may impact post-transcriptional gene regulation.

In yeast we showed that alternative BP usage affected 3'SS choice and RNA stability in the

LSM2 gene. In addition to the effects of multiple BPs on alternative splicing, we hypothesize that the position of multiple BPs in one intron may impact the processing of intron-derived

RNAs such as snoRNAs. In both yeast and fly, we observed alternative BP usage that alters snoRNA position in a lariat structure. In yeast we observed alternative BP usage that shifts the snoRNA from the loop of the lariat to the lariat tail. In a fly intron containing two snoRNAs, we observed alternative BPs that create two lariats, resulting in one snoRNA inside each lariat as opposed to two snoRNAs inside one larger lariat. The distance from snoRNAs to BPs is known to be constrained in multiple organisms for proper snoRNA processing (Hirose & Steitz, 2001; Huang, Chen, Zhou, Li, & Qu, 2007; Vincenti, De Chiara,

109 Bozzoni, & Presutti, 2007), suggesting that changes in snoRNA-BP spacing that result from differential BP usage may impact snoRNA biogenesis.

Future directions BP sequencing approaches

The strength of the current Branch-seq protocol lies in its ability to detect BPs from short lariat loops. Thus, Branch-seq should be applicable to additional organisms, such as fly, worm, and plants, that have many short introns (Lim & Burge, 2001). Of those organisms, I have attempted to isolate lariat RNA from 2D gels for fly and worm samples where DBR1 has been knocked down or deleted, respectively (Fig. III-1A and III-S1A). The more promising of these two organisms is fly, because fly RNA produced a faint arc in a 2D gel and worm RNA did not, as shown in Appendix III. It would also be worthwhile to apply the current Branch-seq protocol to study splicing factor mutants in yeast. By applying

Branch-seq on such samples, one could begin to dissect how different mutations affect BP selection and usage, which may elucidate mechanisms for splicing changes observed in those mutants.

To sequence BPs in the long introns that predominate in mammals, a targeted sequencing approach called CaptureSeq was recently adapted to enrich for lariat RNAs and produce reads that cross the 5'SS to BP junctions of individual lariats (Mercer et al., 2015).

Capture-based approaches should be well suited to assay the same group of BPs across many different samples. This approach is useful in cases where one has a set of candidate

110 regulated BPs. For instance, if a set of splicing changes are known in a disease-associated splicing factor mutant (e.g., SF3B1, U2AF1, or U2AF2 mutants in MDS), then the BPs of the affected introns could be targeted in patient samples using CaptureSeq to help understand the mechanism underlying the observed splicing changes. This approach is also attractive for studying BP usage across samples where interesting changes in splicing occur (see below). Another application of the CaptureSeq approach is to discover additional BPs, which requires large-scale design of capture probes throughout thousands of introns. A limitation of this design is that it will not recover BPs in unexpected genomic locations, such as unannotated introns. Our discovery of many novel BPs throughout the relatively small yeast genome emphasizes the utility of untargeted approaches like Branch-seq for BP identification. However, if intron annotations are more complete in metazoans, targeted approaches should identify a large fraction of novel mammalian BPs.

Advice for future development of BP sequencing approaches

There are two main topics discussed in the second half of Appendix I regarding BP sequencing methods: (1) further optimization of Branch-seq for sequencing BPs of longer introns and (2) suggestions for new BP sequencing strategies, including alternative lariat enrichment approaches. By adjusting how the adapters are added to the debranched lariats in Branch-seq, it may be possible to reliably sequence lariats larger than 100 nt. For any BP sequencing approach, isolation of lariat RNA is key. Some methods for lariat enrichment discussed in Appendix I include isolation of nuclear RNA, performing DBR1 co- immunoprecipitation, using cells null for DBR1 and its homolog DRN1 (Garrey et al., 2014),

111 digesting linear RNAs, and/or isolating Y-shaped RNAs. These lariat enrichment strategies can be used in combination with Branch-seq or with future BP sequencing methods.

Additional applications of BP sequencing

Using the BP sequencing methods mentioned above, a wide range of BP-centric questions can be addressed. BP evolution can be studied by mapping BP locations in homologous introns across multiple organisms. Depending on the method used, distant

BPs, recursive splicing, and nested intron splicing can be further characterized across organisms. BP usage coupled to 3'SS usage can be examined by performing targeted or untargeted BP sequencing in conjunction with RNA-seq under conditions in which interesting splicing changes occur, such as differentiation, epithelial-mesenchymal transition, and across different tissues. These datasets have the potential to demonstrate how often AS involves different BPs, the spacing of alternative 3'SS that use different BPs, and could provide further evidence on the extent to which distal BPs favor distal 3'SS usage in NAGNAGs (Bradley, Merkin, Lambert, & Burge, 2012). BP sequencing can also reveal global BP selection effects of anti-tumor drugs that disrupt proper BP recognition, such as

SSA and E7017 (Corrionero, Miñana, & Valcárcel, 2011; Folco, Coil, & Reed, 2011). For organisms where BP sequencing yields non-comprehensive data, the experimentally located BPs can be used to better train predictive BP algorithms (Friedman, 2006). Lastly, if

DBR1 is present in the sample used to generate BP sequencing data, it may be possible to calculate lariat degradation rates by sequencing BPs in a time course after transcriptional inhibition. Such an experiment could identify unusually stable lariats that may function as sponges for other RNAs or proteins. 112 Final remarks

The number of known BP motif mutations that give rise to disease phenotypes is likely to grow as more human BPs are identified and as sequencing of disease samples becomes more prevalent. Whole exome and whole genome sequencing on patient samples will allow identification of BP mutations that lead to splicing defects, expanding our understanding of gene regulation in humans. Though the yeast genome is much more compact than the human genome, even in yeast our understanding of gene regulation is incomplete. The studies outlined in this thesis have shown that BP selection in yeast is more complex than previously known and this complexity can dictate gene regulatory choices. For instance, we have shown several yeast introns use multiple BPs which influence 3'SS choice and regulation via NMD. I hope the future application of BP sequencing to the outstanding questions detailed above will deepen our understanding of

BP regulation and that the techniques described in this thesis will prove useful in those endeavors.

113 References

Arribere, J. A., & Gilbert, W. V. (2013). Roles for transcript leaders in translation and mRNA decay revealed by transcript leader sequencing. Genome Research, 23(6), 977–987. doi:10.1101/gr.150342.112 Bradley, R. K., Merkin, J., Lambert, N. J., & Burge, C. B. (2012). Alternative Splicing of RNA Triplets Is Often Regulated and Accelerates Proteome Evolution, 10(1), e1001229. doi:10.1371/journal.pbio.1001229 Corrionero, A., Miñana, B., & Valcárcel, J. (2011). Reduced fidelity of branch point recognition and alternative splicing induced by the anti-tumor drug spliceostatin A. Genes & Development, 25(5), 445–459. doi:10.1101/gad.2014311 Folco, E. G., Coil, K. E., & Reed, R. (2011). The anti-tumor drug E7107 reveals an essential role for SF3b in remodeling U2 snRNP to expose the branch point-binding region. Genes & Development, 25(5), 440–444. doi:10.1101/gad.2009411 Friedman, B. A. (2006). Evolution and specificity of ribonucleic acid splicing. In C. B. Burge. Massachusetts Institute of Technology. Retrieved from http://hdl.handle.net/1721.1/37139 Garrey, S. M., Katolik, A., Prekeris, M., Li, X., York, K., Bernards, S., et al. (2014). A homolog of lariat-debranching enzyme modulates turnover of branched RNA. RNA (New York, N.Y.), 20(8), 1337–1348. doi:10.1261/rna.044602.114 Hirose, T., & Steitz, J. A. (2001). Position within the host intron is critical for efficient processing of box C/D snoRNAs in mammalian cells. Proceedings of the National Academy of Sciences of the United States of America, 98(23), 12914–12919. doi:10.1073/pnas.231490998 Huang, Z.-P., Chen, C.-J., Zhou, H., Li, B.-B., & Qu, L.-H. (2007). A combined computational and experimental analysis of two families of snoRNA genes from Caenorhabditis elegans, revealing the expression and evolution pattern of snoRNAs in nematodes. Genomics, 89(4), 490–501. doi:10.1016/j.ygeno.2006.12.002 Kawashima, T., Douglass, S., Gabunilas, J., Pellegrini, M., & Chanfreau, G. F. (2014). Widespread use of non-productive alternative splice sites in Saccharomyces cerevisiae. PLoS Genetics, 10(4), e1004249. doi:10.1371/journal.pgen.1004249 Lim, L. P., & Burge, C. B. (2001). A computational analysis of sequence features involved in recognition of short introns, 98(20), 11193–11198. doi:10.1073/pnas.201407298 Mercer, T. R., Clark, M. B., Andersen, S. B., Brunck, M. E., Haerty, W., Crawford, J., et al. (2015). Genome-wide discovery of human splicing branchpoints. Genome Research, 25(2), 290–303. doi:10.1101/gr.182899.114 Pelechano, V., Wei, W., & Steinmetz, L. M. (2013). Extensive transcriptional heterogeneity revealed by isoform profiling. Nature, 497(7447), 127–131. doi:10.1038/nature12121 Vincenti, S., De Chiara, V., Bozzoni, I., & Presutti, C. (2007). The position of yeast snoRNA- coding regions within host introns is essential for their biosynthesis and for efficient splicing of the host pre-mRNA. RNA (New York, N.Y.), 13(1), 138–150. doi:10.1261/rna.251907

114

Appendix I: Branch-seq Protocol

115 Part 1: Branch-seq protocol

Suggestions: 1) Purify DBR1 before starting Branch-seq protocol. 2) Make in vitro spliced lariat close to when performing debranching step so it will have enough radioactive signal for downstream steps. 3) Remember to radiolabel your RNA ladder before performing in vitro splicing.

Pre-protocol steps:

Debranching enzyme purification • Clone S.cer. DBR1 using cDNA generated from WT s288c yeast into the pET151 expression vector from Invitrogen. • Express protein in Rosetta 2(DE3)pLysS competent cells. Grow bacteria in YT media at 37°C until induction of bacteria with IPTG, at which point grow bacteria at 18°C. • Lyse bacteria using Native Lysis Buffer (Qiagen). • Purify protein with a Ni-NTA column (Qiagen) and subsequently over an S200 column (Buffer: 125 mM KCl, 20 mM HEPES pH 7.3, 1 mM DTT, 10% glycerol). • Concentrate protein (final 50% glycerol) and flash freeze. • Test protein for RNase activity and debranching activity on an in vitro transcribed linear RNA (body labeled) and an in vitro spliced lariat, respectively. • Note on debranching enzyme: I tried using many different sources of DBR1 including a commercially available (human) DBR1 from Abnova, recombinantly purified human DBR1, and recombinantly purified S.cer. DBR1. All three of these were capable of debranching an in vitro synthesized lariat (bellow), but the S.cer. DBR1 proved to be the most reliable, perhaps because it was easier to express higher levels of the yeast protein than the human DBR1 and the commercially available human DBR1 was not sold for the purpose of debranching.

Production of in vitro spliced fly lariat HeLa nuclear extracts for in vitro splicing were a kind gift from the Reed Lab. Coupled in vitro transcription and splicing were performed similar to Folco and Reed (Folco & Reed, 2014) except addition of α-amanitin was omitted to obtain as many lariats as possible. Reactions were digested with RNase R (Epicenter) at 37°C for 1 hour to obtain radio labeled FTZ lariats. Note on RNase R digestion: U6 snRNA which became radio labeled during the coupled reaction was not digested by RNase R as seen in Fig. I-1A and B denoted by arrow. Notes: Adapted coupled in vitro transcription and splicing protocol: • Prepared coupled transcription and splicing reaction: o 1 uL 12.5 mM ATP o 1 uL 0.5 M Creatine Phosphate (di-Tris salt) o 1 uL 80 mM MgCl2 o 1 uL FTZ template DNA (final 200 ng/uL) 116 o 6 uL alpha-UTP o 15 uL Nuclear Extract • Note, I typically perform 2 reactions at the same time to produce more labeled lariats. • Incubate at 30°C for one hour. • Add 170 uL water and 5 uL Proteinase K. • Incubate at 37°C for 15 min. • Remove unincorporated α -UTP using RNeasy cleanup (Qiagen, typically use 4 columns) • Elute in 40 uL water each. Pool RNA. • RNase R digest the coupled reaction: o 20uL 10X RNase R buffer (Epicenter) o 3uL RNase R (It is advisable to keep track of the tube of RNase R used if you have multiple tubes because there is a report of possible of endonuclease contamination in some RNase R batches (Salzman, Gawad, Wang, Lacayo, & Brown, 2012)). o 80uL radio-labeled, cleaned, coupled reaction RNA. o 97uL water o 200uL total • Incubate at 37°C for 1 hour. • Phenol/chloroform (pH 7.9) extract in a gel lock tube (Ref # 2302830, 5 Prime Phase Lock Gel Heavy 2 mL tube). • Precipitate RNA o 20uL sodium Acetate (pH 5.5) o 500uL 100% EtOH o 2uL glycogen • Resuspend in 10uL water. • Run gel to confirm presence of lariat RNA o 0.5uL RNA + 4uL water + 5uL 2X denaturing loading buffer. o ladder: low range ssRNA (NEB Cat # N0364S) . o heat samples before loading gel. o 6% TBE urea gel, 200V, 50 min. o expose to phosphorimager plate. o For example, see Fig. I-1A lane 1, 6, 13, 23 and Fig. I-1B no SuperaseIn lane.

Branch-seq protocol:

S.cer. strain used: dbr1Δ:BY4742 Mat α his3Δ1 leu2Δ0 lys2Δ0 ura3Δ0 For schematic of protocol, see Figure 2-1A.

I. Isolate Total RNA -Trizol Isolation

117 • Grow yeast (750mL), collect by centrifugation at 7000 RPM for 5 min at 4°C. • Wash yeast twice with water. • Freeze yeast at -80°C (optional). • Thaw cells. Add 10 mL water and transfer to 12 Omni Bead Ruptor compatible tubes containing 2.8mm ceramic beads. Spin at 7000 RPM for 5 min at 4°C, keep cell pellet. • Add 1mL Trizol (Life Technologies) to each tube. • Use an Omni Bead Ruptor to lyse the cells: o Homogenize twice for 20 seconds on ½ max speed. o Homogenize once for 10 seconds on max speed. • Transfer to 2 15mL conicals (polypropylene plastic). • Incubate samples at room temp for 5 min. • Add 1/5 volume of chloroform (1.2mL/15mL conical) and mix. • Incubate samples at room temp 2-3 min. • Spin at max speed for 15 min at 4°C. • Transfer upper aqueous layer to a new tube and precipitate with ½ volume isopropanol (6mL). • Incubate on ice 5min. • Spin 19000 RPM at 4°C for 25 min. • Wash the RNA pellet with 70% ethanol, resuspend in 200uL EB (Qiagen), store at - 80°C.

II. 2D PAGE Gel Notes: 1) Gel reagents: Ultra-pure sequagel reagents from National Diagnostics. 2) Gel running: Use metal heat sink for all gels. 3) Suggest pouring D1 (first dimension: D1) the evening before you want to run your 2D gel so it is polymerized and ready to use in the morning. This makes it more feasible to run the 2D gel in one day. Wrap the polymerized D1 gel in wet paper towels and saran wrap and store at 4°C overnight. 4) Suggest pouring D2 while D1 is running. 5) I found linear acrylamide (and to a lesser extent glycoblue) inhibit debranching (Fig. I-1A), so I advise only using glycogen for any precipitation steps. 6) Different percentage 2D gels and different gel running times give altered separation of lariat arc from linear RNA diagonal and are better for isolating different size lariats (Fig I-2). Protocol: • Pour D1: o 6% gel o 1.5 mm spacers o ~20 cm by ~32 cm glass plates o 12 well comb

118 • Use 100 ug total RNA. Mix with 2X denaturing loading dye and heat at 80-95°C prior to loading D1. If using a ladder, leave 1 empty lane on each side of total RNA lane so the total RNA lane can be cleanly cut from the D1 gel. • Run D1 at 15 W for 1 hr and 45 min. • Pour D2 o 20% gel o 1.5 mm spacers o ~20 cm by ~32 cm glass plates o 1 well comb • Stain D1 with sybr gold and image on a Safe Light. • Cut out a single lane of the D1 gel as one long strip using a clean razor blade. Leave it on the imager while preparing D2 gel. • Prepare D2 gel: Remove comb, add running buffer (TBE) to well to aid in D1 gel insertion. • Carefully slide D1 gel into D2 gel using tweezers and a razor blade. Avoid introducing air bubbles between the D1 and D2 gel interface. • (Optional) add loading dye on top of the D1 gel slice in the D2 gel for easy visualization of running of the D2 gel. • Run D2 at 30W for 6hr and 30 min. • Stain D2 with sybr gold. • Cut arc out of gel. (Optional) freeze gel at -20°C. • Elute RNA in 12 mL of PAGE elution buffer (30 mM Tris-HCl (pH 7.5), 300 mM NaCl, and 3 mM EDTA) (Ooi et al., 2001) and rotate continuously over night at 4°C. • Precipitate RNA with isopropanol (13.5 mL) and glycogen (7 uL) at -20°C overnight. Spin at 19000 RPM, 25 min, 4°C. Wash with 500uL 70% EtOH, spin in a 1.5 mL epi 10 min at 17000 RPM. Dry pellet and resuspend in 10 uL water.

III Debranching Debranching was performed similar to (Ooi et al., 2001) protocol. • Prepare 5X debranching buffer: o 100 nM Hepes o 625 mM KCl o 2.5 mM MgCl2 o 5 mM DTT o 50% glycerol • Debranching reaction o 3 uL 5X debranching buffer o _____ uL radio-labeled FTZ lariat RNA (I used 1 uL of 10 uL (described in pre-protocol section). This volume can be modified to the user’s discretion depending on the application and radioactive strength of the radio-labeled lariats. The following controls are recommended: (1) debranch the FTZ lariat RNA alone (2) using FTZ lariat RNA, leave out DBR1 (3) spike in FTZ lariat RNA to the samples you wish to debranch and run a small portion of them on gel afterwards to confirm debranching)

119 o 0.5uL recombinantly purified yeast DBR1 o _____uL water o ______uL 2D RNA (I used 4uL) o Total volume = 15uL • Incubate at 30°C for 1 hour. Do NOT use SuperaseIn (RNase inhibitor) because it causes a gel shift like behavior that makes it difficult to determine if debranching worked when running the debranched product on a diagnostic gel (Fig. I-1B). Addition of proteinase K prior to running the gel can resolve this issue if you want to use an RNase inhibitor. • Phenol/chloroform extract the experimental samples, saving a small amount for diagnostic gel (see next step) o Bring volume of debranching reaction to 200uL with water. o Phenol/chloroform extract (pH 7.9), gel lock tube. o Precipitate: 20uL sodium acetate, 500uL 100% EtOH, 2uL glycogen. o Resuspend in 22.5uL water for poly(A) tailing step (bellow). • Run diagnostic debranching gel: o Add 2X denaturing loading buffer to controls and to small aliquots from experimental samples. o Heat samples, run on 10% TBE urea gel, 1hr 15min, 200V. o Expose to phosphorimager plate.

IV. PolyA tail debranched lariats Poly(A) tailing protocol is adapted from the Burge Lab’s ribosome footprint profiling protocol, originally developed by the Weissman Lab (Ingolia, Ghaemmaghami, Newman, & Weissman, 2009).

• Prepare RNA samples: o Suggest as a control 0.5uL piSPIKE RNA (from IDT) +/- poly(A) polymerase to confirm poly(A) tailing and RT steps. 10 ng RNA X ul

Water To 22.5 ul

• Make tailing mix and enzyme mix and store on ice. Amounts listed per 1 sample: o 2X tailing mix (total 25 ul/sample, but only add 22.5 ul to each tube) 10X PAP buffer 5 ul 10 mM ATP (1 mM ATP final conc) 5 ul RNase inhibitor (30 units) (SuperaseIn) 0.75 ul Water 14.25 ul

o Enzyme mix (total 5 ul/sample)

120

2X tailing mix 2.5 ul Water 2 ul Poly(A) polymerase enzyme (5 U/ul) 0.5 ul

• Denature RNA samples 2 min at 80 ˚C. • Place on ice. • Add 22.5 ul 2X tailing mix. • Add 5ul enzyme mix on ice (final volume 50 ul). • Incubate at 37°C for 10 min. • Quench reaction with 200uL 5mM EDTA. • Add 250uL phenol/chloroform (pH 7.9), extract using gel lock tube. • Precipitate RNA (1uL glycogen, 28uL sodium acetate, 300uL 100% isopropanol). • Resuspend in 12uL 10nM Tris pH 8.0. • Run gel to confirm successful poly(A) tailing if you used the piSPIKE control: o ½ reaction (6uL) + 2X denaturing loading buffer (LB) o 1uL low range ssRNA ladder + 2X denaturing LB o heat samples o run on 6% TBE urea gel, 200V, 30 min. Stain with sybr gold. Should see smear above piSPIKE RNA confirming polyA tailing.

V. Reverse Transcription • Prepare mix. Amounts listed per 1 sample: o 11.5 uL template RNA (debranched lariat or 6 uL piSPIKE control +5.5 uL water) o 1 uL 10mM dNTP mix o 1 uL 25uM RT primer • RT primer (barcode compatible): /5Phos/AGATCGGAAGAGCGTCGTGTAGGGAAAG/iSp18/CACTCA/iSp18/GTGAC TGGAGTTCCTTGGCACCCGAGAATTCCA/TTTTTTTTTTTTTTTTTTTTVN (designed in collaboration with Yarden Katz (Folco & Reed, 2014; Katz et al., 2014)) . • Incubate at 65˚C for 5min • Place on ice. • Master mix (1 reaction): o 4 uL 5X first strand buffer (comes with enzyme) o 0.5 uL SuperaseIn o 1 uL 0.1 M DTT o 1 uL SuperScript III Reverse Transcriptase (Invitrogen) • Incubate at 48 ˚C for 30 min. • Add 2.1 uL 1M NaOH. • Incubate at 98˚C for 15 min. • Add 2.1 uL 1M HCl. 121 • Prepare to load on gel. This allows for removal of excess RT primer: o Add 22.5 uL 2X denaturing LB to each RT reaction (debranched lariats and piSPIKE +/- poly(A) tailing controls). o Prepare primer standard for gel: 0.5 uL 25uM RT primer +9.5 uL water + 10 uL 2X LB. o Ladders: 10 bp and 25 bp ladders. o Heat at 95˚C, 1 min. o On ice briefly. o Pre-run gel (10% TBE urea) briefly. o Load gel and run for 1 hour 33 min at 200V. Debranched lariats required 2 lanes/sample due to volumes. o Stain with sybr gold for 10 min. • Excise RT product from gel. Product will appear as a smear, mostly larger than the piSPIKE RNA band (corresponds to a 5'SS to BP distance of 31 nt). Avoid excess RT primer that did not extend any RNA. o Put gel slices into 0.5mL epi. Can freeze to help elution. o (optional) Elute piSPIKE RT DNA as a control. • Elute RT products from gel: o Poke hole in bottom of 0.5 mL epi (after thawing). o Place 0.5 mL epi into 1.5 mL epi. (used 2 tubes/sample because ran each sample in 2 lanes). o Spin gel through 0.5 mL epi at max speed, shredding gel. o Add 400uL PAGE elution buffer. o Elute at 65˚C for 1 hr shaking at 1400 RPM. o Remove shredded gel using a NanoSep column. o Precipitate DNA (450 uL isopropanol, 2 uL glycogen). o Resuspend in 15 uL 10 mM Tris pH 8.0

VI. Circularization • Prepare mix (1 reaction volumes): o 2 uL 10X Circligase buffer (comes with enzyme) o 1 uL 1 mM ATP o 1 uL 50 mM MnCl2 o 15 uL DNA from RT • put into PCR tubes. • mix well. • Add 1uL Circligase (Epicentre), mix well. • Incubate at 60˚C for 60min. • Incubate at 80˚C for 10 min. • 4˚C for ∞, store at -80˚C or proceed directly to PCR.

VII. PCR Illumina PCR primer 1.0 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

122 was paired with Illumina barcode primers (RPI#s) (RPI1) CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA (RPI2) CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA (RPI3) CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA (RPI4) CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA Oligonucleotide sequences © 2006-2008 Illumina, Inc. All rights reserved. http://epigenome.usc.edu/docs/resources/core_protocols/Illumina%20Sequence%20Info rmation%20for%20Customers%20DEC2008.pdf • PCR master mix (volumes for 1 reaction, made 4.5X for 4 reactions/sample to remove samples after 6, 8, 10, or 12 PCR cycles): o 3.34 uL 5X HF buffer o 0.334 uL 10 mM dNTPs o 1.67 uL primer mix (5 uM each) o 11.7 uL water o 0.167 uL Phusion high-fidelity polymerase (NEB) • to the 77.4 uL of 4.5X master mix add 4.5 uL of circularized template. • Aliquot 16.7 uL reaction mix + template/PCR strip tube. • Remove samples after 6, 8, 10, and 12 PCR cycles: o 98˚C, 30 sec (initial denature) o 98˚C, 10 sec (denature) o 68˚C, 10 sec (anneal) o 72˚C, 15 sec (extend) • Run PCR on gel o Add 3.4 uL 6X DNA LB to each PCR reaction o Use 25 bp and 10 bp DNA ladders (0.5uL/each) o Run 12 well 8% TBE gel, 200V, 40 min. o Stain with sybr gold, 5min. • Excise the bands (will appear as smears) larger than the no insert circularization product at 141nt. Some of this no insert product will exist, even when care was taken to avoid excision of the RT primer band in earlier gel excision steps. o Elute from gel using 0.5 mL epi and PAGE buffer as above in Reverse Transcription section). o Precipitate in 400 uL 100% isopropanol and 1 uL glycogen. o Resuspend final pellet in 13 uL 10 mM Tris pH 8.0

VIII. Sequencing One Illumina MiSeq flow cell was sequenced at the MIT Bio Micro Center (November 2011). 5' end reads were 50 bases and 3' end reads were 250 bases. 3' end reads were sequenced with custom sequencing primer GTGACTGGAGTTCCTTGGCACCCGAGAATTCCATTTTTTTTTTTTTTTTTTTT to avoid sequencing the un-templated As added by the poly(A) tailing reaction. The 3' end sequencing primer was gel purified prior to use in sequencing. Note: primer design might have to be changed for sequencing on other Illumina machines where custom primers can only be added for first end sequencing, not 2nd end sequencing.

123 Part 2: Advice for future BP sequencing protocols

If one wanted to further optimize the current Branch-seq protocol now that several technical obstacles have been overcome, the obvious step to modify first is attachment of sequencing adapters to the RNA. The reason to adjust this step is that the DNA circularization likely loses large lariats. The evidence that supports this hypothesis comes from the Branch-seq data itself. I observed large introns that do not contain reads at their annotated BPs, but instead contain BP reads near their 5'SS at an intronic stretch of adenosines. This implies that large lariats were successfully eluted from the 2D gel and that a stretch of adenosines inside the lariat loop primed reverse transcription. Presumably short RNAs are more readily circularized than long RNAs, resulting in this arrangement of

BP reads near the 5'SS, but not near the annotated BP. Thus, ligation of adapters onto the debranched RNA should be attempted instead of poly(A) tailing/circularization to alleviate the size bias. This approach should only been applied to samples with 5'SS to BP lengths suitable for clustering on flow cells used for sequencing (~1 kbp maximum for Illumina).

Another option is to use CircLigase II instead of CircLigase I (both from Epicenter) to determine whether changing the enzyme mitigates the size bias. The protocol as it stands now works well for lariat loops 100 nt or shorter (Figure 2-S3B). It is useful to note that I was able to polyadenylate the in vitro spliced FTZ lariat, but was not able to ligate an adapter to it. However, since it is possible to debranch the lariats without adding any additional nucleotides to their 3' ends, the RNA can be linearized and thus should be a good substrate for ligation.

The remainder of this appendix provides advice for future BP sequencing endeavors. The first hurdle to overcome when sequencing BPs is isolating lariats. If

124 possible, it is advisable to isolate only nuclear RNA in order to obtain a higher fraction of lariat RNA to linear RNA. This approach will miss cytoplasmic lariats, but the enrichment for lariat RNA should help avoid amplification of undesired RNAs. Strategies that allow lariats to accumulate to high levels in cells are advantageous as well, as demonstrated in

Branch-seq (Fig. 2-1A). Double deletion of DBR1 and DRN1, an enzyme that has recently been implicated in debranching (Garrey et al., 2014; Salzman et al., 2012), may improve lariat yields relative to DBR1 single mutants. Based on my attempts to isolate lariats from multiple organisms, I hypothesize that additional lariat degradation pathways apart from

DBR1-dependent debranching may exist. This hypothesis largely emerged from my lariat isolation attempts in worm and fly. In dbr1 null worm RNA I did not observe RNA running in a 2D gel arc (Fig. III-S1A). In RNA from fly cells, I did not observe a noticeable difference in 2D gel arc intensity between WT and DBR1 RNAi treated samples (Fig. III-1).

Alternative approaches to enrich for lariats that do not require 2D gels are attractive ways to isolate large amounts of lariat RNA. One option is to use a catalytically inactive version of DBR1 created by mutating DBR1 residues implicated in debranchase activity

(Findlay, Boyle, Hause, Klein, & Shendure, 2014; Khalid, Damha, Shuman, & Schwer, 2005;

Montemayor et al., 2014; Ooi et al., 2001). Introduction of this mutant enzyme into the organism could be used to co-immunoprecipitate (co-IP) DBR1 with lariat RNAs, with or without crosslinking (Ooi et al., 2001; Ule, Jensen, Mele, & Darnell, 2005). Digesting the proteins will leave intact lariat RNA. An analogous approach would be to use an antibody that recognizes branched RNAs (Ingolia et al., 2009; Reilly et al., 1990) for the co-IP instead of mutant DBR1, which would circumvent the need to create a catalytically inactive debranching enzyme. Another way to enrich for lariats would be to use a cocktail of

125 exonucleases to digest all RNAs except circular and lariat RNAs, including both 5' to 3' and

3' to 5' exonucleases. Using a thermostable RNase R as one of these enzymes, or similar enzyme such as hDIS3L2 (Lubas et al., 2013), would allow heating of the RNA to disrupt secondary structure that might otherwise prevent RNase R digestion at conventional reaction temperatures.

Additional approaches could take advantage of the “Y” shaped structure created by a cleavage event inside the lariat loop to isolate or sequence branched RNA. One option to enrich for branched RNAs would be to nick lariat loops with a limited RNase digestion, and then isolate “Y” shaped RNAs that have one free 5' end and two free 3' ends. Ligation with a mix of two different 3' adapters should yield a population of “Y” shaped RNAs containing both adapters. Sequential RNA pulldowns on each adapter should enrich for nicked lariat

RNAs with two free 3' ends from which BPs can be sequenced. Alternatively, limited digestion to nick RNAs prior to 2D gel electrophoresis could allow large introns to be cleaved to produce smaller “Y” shaped molecules that could be isolated using a 2D gel.

Limited digestion to produce “Y” shaped RNAs could also be used after isolating a pure population of lariats to reduce the size of the lariat loop for long introns. Ligation after digestion could produce smaller lariat loops amenable to the current Branch-seq protocol

(similar to Fig. I-3 and I-4A). This ligation step could alternatively be used to add adapters inside the lariat loop, forcing production of LJ reads (Chapter 2) that are informative for identifying BP position (similar to Fig.I-3 and I-4B).

A last approach could take advantage of the sequence 3' of the BP for isolation and sequencing of BPs. First, one could design capture probes to the 3'SS-exon boundaries to enrich for pre-mRNAs, followed by primer extension with a primer complementary to the

126 3'SS-exon junction. RNAs that have undergone the first but not the second step of splicing should produce an RT stop. As a control, primer extension following debranching should reduce RT stops. As a note, RT will rarely transit past the BP nucleotide in its lariat form during primer extension. In a low throughput experiment, I observed that RT inserts several cytosine nucleotides at the BP nucleotide in this case. This information could be used to identify rare sequences where RT does not stop at the BP.

127 Figures

Figure I-1: Debranched and gel shift like behaviors of lariats. Lariats were produced from the coupled in vitro splicing reaction followed by RNase R digestion. Contrast adjusted and false colors added using the Typhoon. (A) Using Abnova human DBR1, different co- precipitants inhibit debranching activity to various extents. Arrow points to undigested U6 snRNA. Repeated pairs of experiments shown side by side. (B) Addition of SuperaseIn to lariat RNA results in a shifted mobility of lariat RNA. 10% TBE Urea gel. Samples in denaturing loading buffer and were heated prior to gel loading.

128

Figure I-2: S.cer. 2D gels of different densities in D2 show different separation of arc from diagonal. All D1:6% . Samples are total yeast RNA unless otherwise noted. (A) D2:10%. Small gels. (B) D2:15%. Left gel was run using the low range ssRNA ladder. Small gels. (C) D2:20%. Large gels.

129

!"##"$%&"'("$%")%&'"*"+",-%

1. Isolation of total RNA (Trizol)

'./0%

0001%

2. Depletion of rRNA (Ribo-Zero)

0001% 3. Digestion of linear RNA (Rnase R)

4. Addition of poly(A) tail, biotin (PA polymerase, ATP, ATP-biotin) 0000002% 5. Bead capture (streptavidin), !"# Partial digestion (Rnase A) 0000002% $"#

Figure I-3: Common portion of original proposed Branch-seq protocol for lariat enrichment and capture. Dotted white line: 5'SS. Open circle: BP. Grey Circle: biotin. Grey “Y”: streptavidin capture. Thin line: intron. Boxes: exons.

130

0% 3'4$+526789?% 3% 3'4$+526789:%% @7='4$+5;$<2=4-7>% ./0%,;<4("$2=4-7>% !"# !"#

0000002% 0000002% %# p ppA $"# $"# 1.1. RNA circularization 2.1. Addition of adapters (RNA ligase) p (RNA ligase)

0000002% 0000002% $"# 1.2. Debranching (Dbr1) 2.2. RNA circularization 0000002% (RNA ligase)

0000002% 1.3. RT (adapter-polyT- 2.3. RT (adapter primer, VN primer) 0000002% specific RT enzyme) NVTTTTT 0000002% 1.4. Circularize & PCR (adapter primers) 2.4. PCR (adapter primers) TTTTT 1.5. Sequencing and 2.5. Sequencing and analysis analysis

Figure I-4: Protocols for generation of Illumina libraries for BP and 5'SS sequencing from captured RNA lariats (continuation from Fig. I-3). (A) Branch-SeqV1, which uses debranching followed by anchored adapter-polyT priming to attach adapters. (B) Branch- SeqV2, which uses RNA ligation to attach adapters. Dotted white line: 5’SS. Long dotted grey line: sequencing primers. Short dotted grey line: path of RT. Open circle: BPS. Grey Circle: biotin. Grey “Y”: streptavidin capture. Thin line: intron. Boxes: exons.

131 References Findlay, G. M., Boyle, E. A., Hause, R. J., Klein, J. C., & Shendure, J. (2014). Saturation editing of genomic regions by multiplex homology-directed repair. Nature, 513(7516), 120– 123. doi:10.1038/nature13695 Folco, E. G., & Reed, R. (2014). In vitro systems for coupling RNAP II transcription to splicing and polyadenylation. Methods in Molecular Biology (Clifton, NJ), 1126, 169–177. doi:10.1007/978-1-62703-980-2_13 Garrey, S. M., Katolik, A., Prekeris, M., Li, X., York, K., Bernards, S., et al. (2014). A homolog of lariat-debranching enzyme modulates turnover of branched RNA. RNA (New York, N.Y.), 20(8), 1337–1348. doi:10.1261/rna.044602.114 Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S., & Weissman, J. S. (2009). Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling. Science (New York, N.Y.), 324(5924), 218–223. doi:10.1126/science.1168978 Katz, Y., Li, F., Lambert, N. J., Sokol, E. S., Tam, W.-L., Cheng, A. W., et al. (2014). Musashi proteins are post-transcriptional regulators of the epithelial-luminal cell state. eLife, 3, e03915. doi:10.7554/eLife.03915 Khalid, M. F., Damha, M. J., Shuman, S., & Schwer, B. (2005). Structure-function analysis of yeast RNA debranching enzyme (Dbr1), a manganese-dependent phosphodiesterase. Nucleic Acids Research, 33(19), 6349–6360. doi:10.1093/nar/gki934 Lubas, M., Damgaard, C. K., Tomecki, R., Cysewski, D., Jensen, T. H., & Dziembowski, A. (2013). Exonuclease hDIS3L2 specifies an exosome-independent 3“-5” degradation pathway of human cytoplasmic mRNA. The EMBO Journal, 32(13), 1855–1868. doi:10.1038/emboj.2013.135 Montemayor, E. J., Katolik, A., Clark, N. E., Taylor, A. B., Schuermann, J. P., Combs, D. J., et al. (2014). Structural basis of lariat RNA recognition by the intron debranching enzyme Dbr1. Nucleic Acids Research, 42(16), 10845–10855. doi:10.1093/nar/gku725 Ooi, S. L., Dann, C., Nam, K., Leahy, D. J., Damha, M. J., & Boeke, J. D. (2001). RNA lariat debranching enzyme. Methods in Enzymology, 342, 233–248. Reilly, J. D., Freeman, S. K., Melhem, R. F., Kierzek, R., Caruthers, M. H., Edmonds, M., & Munns, T. W. (1990). Antibodies specific for branched ribonucleic acids. Analytical Biochemistry, 185(1), 125–130. Salzman, J., Gawad, C., Wang, P. L., Lacayo, N., & Brown, P. O. (2012). Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PloS One, 7(2), e30733. doi:10.1371/journal.pone.0030733 Ule, J., Jensen, K., Mele, A., & Darnell, R. B. (2005). CLIP: a method for identifying protein- RNA interaction sites in living cells. Methods (San Diego, Calif.), 37(4), 376–386. doi:10.1016/j.ymeth.2005.07.018

132

Appendix II: Supplemental Tables to Chapter 2

133 Table II-S1. Branch-seq BP peaks paired 5'SS motifs.

No. mismatches from 0 or 1 % 0 or /GTATGT 0 1 2 3 4 5 6 All mut 1 mut 153 annotated BPs: winBP 103 35 7 7 1 0 0 153 138 90.20 191 putative novel BPs: winBP 64 62 28 13 18 5 1 191 126 65.97 196 annotated BPs: GEM-BP 105 44 24 16 7 0 0 196 149 76.02 350 putative novel BPs: GEM-BP 61 161 78 34 12 3 1 350 222 63.43

134 Table II-S2. GEM-BP and winBP peaks

5ss_bp_pair is in the format chr:nt1:nt2:strand nt1 < nt2 On the plus strand, nt1 is the 5'ss and nt2 is the BP On the reverse strand, nt1 is the BP and nt2 is the 5'SS In general 1=true, 0=false, except for 5ss_mm, the number of mismatches at the 5'ss from GTATGT u_a and u_n are the union of annotated and novel BP, respectively, of both peak callers

win GEM bp_ 5ss_ 5ss_ chr bp_nt strand 5ss_bp_pair BP -BP anno anno mm 5ss u_a u_n cnBP

chrXII 564467 - chrXII:564467:564515:- 1 1 1 1 1 GTAAGT 1 0 0 123876 chrIV 9 - chrIV:1238769:1238817:- 1 1 1 1 0 GTATGT 1 0 0

chrXII 857038 + chrXII:856988:857038:+ 1 1 1 0 2 GTAAGC 1 0 0

chrII 479389 + chrII:479340:479389:+ 1 1 1 1 0 GTATGT 1 0 0

chrXII 694444 + chrXII:694385:694444:+ 1 1 1 1 0 GTATGT 1 0 0 chrXII I 666961 - chrXIII:666961:667016:- 1 1 1 1 0 GTATGT 1 0 0

chrV 307801 + chrV:307743:307801:+ 1 1 1 1 0 GTATGT 1 0 0

chrX 365853 + chrX:365780:365853:+ 1 1 1 1 0 GTATGT 1 0 0 131975 chrIV 1 - chrIV:1319751:1319809:- 1 1 1 1 0 GTATGT 1 0 0

chrII 407047 - chrII:407047:407116:- 1 1 1 1 0 GTATGT 1 0 0 110387 chrIV 1 + chrIV:1103808:1103871:+ 1 1 1 1 0 GTATGT 1 0 0

chrIX 166484 + chrIX:166432:166484:+ 1 1 1 1 0 GTATGT 1 0 0

chrVII 157245 - chrVII:157245:157288:- 1 1 1 1 0 GTATGT 1 0 0

chrII 142787 - chrII:142787:142849:- 1 1 1 1 0 GTATGT 1 0 0

chrV 548611 + chrV:548548:548611:+ 1 1 1 1 0 GTATGT 1 0 0

chrIV 337596 + chrIV:337525:337596:+ 1 1 1 1 1 GTAAGT 1 0 0

chrXI 625594 + chrXI:625544:625594:+ 1 1 1 1 1 GTAAGT 1 0 0

chrV 159013 - chrV:159013:159086:- 1 1 1 1 0 GTATGT 1 0 0 chrXI V 48377 + chrXIV:48293:48377:+ 1 1 1 1 0 GTATGT 1 0 0 chrXII I 140113 - chrXIII:140113:140183:- 1 1 1 1 0 GTATGT 1 0 0

chrIV 267780 + chrIV:267726:267780:+ 1 1 1 1 0 GTATGT 1 0 0

chrX 469190 - chrX:469190:469256:- 1 1 1 1 2 GTTCGT 1 0 0 chrXV I 883453 + chrXVI:883384:883453:+ 1 1 1 1 0 GTATGT 1 0 0 126679 chrIV 0 - chrIV:1266790:1266854:- 1 1 1 1 0 GTATGT 1 0 0

chrXII 286498 - chrXII:286498:286557:- 1 1 1 1 0 GTATGT 1 0 0 131962 chrIV 7 - chrIV:1319627:1319690:- 1 1 1 1 0 GTATGT 1 0 0 123758 chrIV 1 + chrIV:1237527:1237581:+ 1 1 1 0 1 GTTTGT 1 0 0 chrXV I 218711 + chrXVI:218646:218711:+ 1 1 1 1 0 GTATGT 1 0 0

chrI 87439 + chrI:87389:87439:+ 1 1 1 1 1 GTAAGT 1 0 0 145048 chrIV 5 - chrIV:1450485:1450533:- 1 1 1 0 3 GTTGTT 1 0 0

chrXII 766086 - chrXII:766086:766129:- 1 1 1 1 0 GTATGT 1 0 0

chrII 170771 + chrII:170680:170771:+ 1 1 1 1 0 GTATGT 1 0 0

chrIV 254999 - chrIV:254999:255044:- 1 1 1 1 0 GTATGT 1 0 0

135 chrVI 63915 - chrVI:63915:63972:- 1 1 1 1 0 GTATGT 1 0 0

chrII 47071 - chrII:47071:47143:- 1 1 1 1 0 GTATGT 1 0 0 chrVII I 129611 + chrVIII:129523:129611:+ 1 1 1 1 0 GTATGT 1 0 0 chrXII I 206162 + chrXIII:206098:206162:+ 1 1 1 1 0 GTATGT 1 0 0

chrIV 239421 - chrIV:239421:239509:- 1 1 1 1 0 GTATGT 1 0 0

chrIII 173137 - chrIII:173137:173194:- 1 1 1 1 0 GTATGT 1 0 0

chrVII 62173 + chrVII:62132:62173:+ 1 1 1 1 0 GTATGT 1 0 0 chrXII I 854879 + chrXIII:854816:854879:+ 1 1 1 1 0 GTATGT 1 0 0

chrX 435302 + chrX:435222:435302:+ 1 1 1 1 0 GTATGT 1 0 0

chrII 462255 + chrII:462204:462255:+ 1 1 1 1 1 GTATGA 1 0 0

chrV 308040 + chrV:307953:308040:+ 1 1 1 1 0 GTATGT 1 0 0

chrXII 786667 + chrXII:786616:786667:+ 1 1 1 1 0 GTATGT 1 0 0 chrXV I 174036 + chrXVI:173971:174036:+ 1 1 1 0 3 GTTTTA 1 0 0

chrIV 733713 - chrIV:733713:733773:- 1 1 1 1 0 GTATGT 1 0 0

chrXV 92476 - chrXV:92476:92521:- 1 1 1 0 3 GTAATG 1 0 0 chrXV I 911328 + chrXVI:911273:911328:+ 1 1 1 1 0 GTATGT 1 0 0

chrXI 618096 - chrXI:618096:618168:- 1 1 1 0 1 GTTTGT 1 0 0 chrVII I 189778 - chrVIII:189778:189843:- 1 1 1 1 0 GTATGT 1 0 0

chrIV 65358 + chrIV:65308:65358:+ 1 1 1 1 0 GTATGT 1 0 0 chrXV I 492958 - chrXVI:492958:493018:- 1 1 1 1 0 GTATGT 1 0 0

chrII 726947 - chrII:726947:727006:- 1 1 1 1 0 GTATGT 1 0 0 chrXV I 833779 + chrXVI:833690:833779:+ 1 1 1 1 0 GTATGT 1 0 0

chrII 653429 + chrII:653364:653429:+ 1 1 1 1 0 GTATGT 1 0 0

chrII 186375 - chrII:186375:186430:- 1 1 1 1 0 GTATGT 1 0 0

chrX 74178 + chrX:74112:74178:+ 1 1 1 0 2 GGTTGT 1 0 0

chrIV 715265 - chrIV:715265:715356:- 1 1 1 1 0 GTATGT 1 0 0

chrVII 346854 - chrVII:346854:346896:- 1 1 1 1 0 GTATGT 1 0 0

chrV 131853 + chrV:131777:131853:+ 1 1 1 1 0 GTATGT 1 0 0

chrXII 744219 + chrXII:744156:744219:+ 1 1 1 1 1 GTACGT 1 0 0 102463 chrXII:1024570:1024631: chrXII 1 + + 1 1 1 1 1 GTAAGT 1 0 0 chrXI V 545341 + chrXIV:545293:545341:+ 1 1 1 1 1 GTAAGT 1 0 0 chrXI V 609837 + chrXIV:609792:609837:+ 1 1 1 1 1 GTAAGT 1 0 0

chrII 462472 + chrII:462424:462472:+ 1 1 1 1 0 GTATGT 1 0 0

chrX 387380 - chrX:387380:387430:- 1 1 1 1 0 GTATGT 1 0 0

chrIV 399468 + chrIV:399360:399468:+ 1 1 1 1 0 GTATGT 1 0 0

chrII 602175 + chrII:602099:602175:+ 1 1 1 1 0 GTATGT 1 0 0

chrIII 101646 - chrIII:101646:101700:- 1 1 1 1 0 GTATGT 1 0 0 chrXI V 534915 - chrXIV:534915:534966:- 1 1 1 1 1 GTATGC 1 0 0

chrVI 203304 - chrVI:203304:203374:- 1 1 1 1 0 GTATGT 1 0 0 chrVII I 251233 + chrVIII:251158:251233:+ 1 1 1 1 0 GTATGT 1 0 0

chrXII 987191 + chrXII:987139:987191:+ 1 1 1 1 0 GTATGT 1 0 0

chrXV 900802 - chrXV:900802:900850:- 1 1 1 0 2 GTAATT 1 0 0

chrVII 497394 - chrVII:497394:497462:- 1 1 1 1 0 GTATGT 1 0 0

chrXI 447376 - chrXI:447376:447453:- 1 1 1 1 0 GTATGT 1 0 0

chrXV 242453 - chrXV:242453:242503:- 1 1 1 1 0 GTATGT 1 0 0

chrXV 240976 - chrXV:240976:241024:- 1 1 1 1 1 GTAAGT 1 0 0

136 chrVI 221291 - chrVI:221291:221402:- 1 1 1 1 0 GTATGT 1 0 0 chrXI 449611 - chrXI:449611:449663:- 1 1 1 1 0 GTATGT 1 0 0 chrXI V 366096 + chrXIV:366038:366096:+ 1 1 1 1 0 GTATGT 1 0 0 chrXI V 351021 + chrXIV:350960:351021:+ 1 1 1 1 0 GTATGT 1 0 0 chrIII 107089 + chrIII:107034:107089:+ 1 1 1 1 0 GTATGT 1 0 0 chrII 125234 + chrII:125158:125234:+ 1 1 1 1 0 GTATGT 1 0 0 chrV 148254 + chrV:148194:148254:+ 1 1 1 1 1 GTATGC 1 0 0 chrXII I 721231 - chrXIII:721231:721344:- 1 1 1 1 0 GTATGT 1 0 0 chrIII 177952 - chrIII:177952:178031:- 1 1 1 0 1 GTATAT 1 0 0 chrXII I 123777 - chrXIII:123777:123824:- 1 0 1 0 3 ATCTCT 1 0 0 chrIV 569665 - chrIV:569665:569722:- 1 1 1 1 0 GTATGT 1 0 0 chrVII I 498731 - chrVIII:498731:498786:- 1 1 1 1 0 GTATGT 1 0 0 chrXI 437526 + chrXI:437481:437526:+ 1 1 1 1 0 GTATGT 1 0 0 chrXII I 337886 + chrXIII:337817:337886:+ 1 1 1 1 1 GTACGT 1 0 0 chrXI 83061 + chrXI:83004:83061:+ 1 1 1 1 0 GTATGT 1 0 0 chrVII I 107875 + chrVIII:107827:107875:+ 1 1 1 1 1 GTAAGT 1 0 0 chrXII 327295 - chrXII:327295:327400:- 1 1 1 1 0 GTATGT 1 0 0 chrV 166786 - chrV:166786:166873:- 1 1 1 1 0 GTATGT 1 0 0 chrII 110451 - chrII:110451:110507:- 1 1 1 1 1 GTAAGT 1 0 0 chrVI 242044 + chrVI:241997:242044:+ 1 1 1 1 1 GTAAGT 1 0 0 chrVII 497961 - chrVII:497961:498003:- 1 0 1 1 1 GTAAGT 1 0 0 chrVII 556282 + chrVII:556232:556282:+ 1 1 1 0 3 GTGGAT 1 0 0 chrX 50351 - chrX:50351:50411:- 1 1 1 1 0 GTATGT 1 0 0 chrIII 107255 + chrIII:107192:107255:+ 1 1 1 1 0 GTATGT 1 0 0 chrII 679973 - chrII:679973:680034:- 1 1 1 1 0 GTATGT 1 0 0 121294 chrIV 1 + chrIV:1212871:1212941:+ 1 1 1 1 0 GTATGT 1 0 0 chrXI V 494551 - chrXIV:494551:494633:- 1 1 1 0 3 GTAATG 1 0 0 chrVII I 298418 - chrVIII:298418:298487:- 1 1 1 1 0 GTATGT 1 0 0 chrIX 155276 + chrIX:155220:155276:+ 1 1 1 1 1 GCATGT 1 0 0 chrXII I 211547 + chrXIII:211502:211547:+ 1 1 1 0 4 TTTTAA 1 0 0 chrVII I 255689 - chrVIII:255689:255752:- 1 1 1 1 0 GTATGT 1 0 0 chrII 606615 + chrII:606567:606615:+ 1 1 1 0 1 TTATGT 1 0 0 chrVII 73034 - chrVII:73034:73136:- 1 1 1 1 1 GTACGT 1 0 0 chrVII I 354926 + chrVIII:354868:354926:+ 1 1 1 1 0 GTATGT 1 0 0 chrXV I 623662 + chrXVI:623575:623662:+ 1 1 1 1 0 GTATGT 1 0 0 chrIX 47743 + chrIX:47699:47743:+ 1 1 1 1 1 GTAAGT 1 0 0 chrXV I 407002 + chrXVI:406952:407002:+ 1 1 1 0 1 GTGTGT 1 0 0 chrXI V 145191 - chrXIV:145191:145255:- 1 1 1 1 0 GTATGT 1 0 0 chrXI V 557672 + chrXIV:557612:557672:+ 1 1 1 1 0 GTATGT 1 0 0 chrIV 579963 + chrIV:579894:579963:+ 1 1 1 0 1 GTATGG 1 0 0 chrXV I 96189 - chrXVI:96189:96233:- 1 1 1 1 0 GTATGT 1 0 0 chrV 348217 - chrV:348217:348272:- 1 1 1 1 0 GTATGT 1 0 0 chrVII 946407 + chrVII:946331:946407:+ 1 1 1 1 1 GTACGT 1 0 0 chrIX 317136 + chrIX:317062:317136:+ 1 1 1 0 2 TTCTGT 1 0 0 chrII 333365 + chrII:333319:333365:+ 1 1 1 0 2 GTAGGA 1 0 0

137 chrXI V 380726 - chrXIV:380726:380783:- 1 1 1 1 0 GTATGT 1 0 0 chrXV I 412958 + chrXVI:412887:412958:+ 1 1 1 0 2 TAATGT 1 0 0

chrIX 232012 - chrIX:232012:232070:- 1 1 1 0 3 GTACTG 1 0 0

chrXII 40353 - chrXII:40353:40399:- 1 1 1 1 1 GTATGC 1 0 0

chrIV 431423 - chrIV:431423:431470:- 1 1 1 1 1 GTACGT 1 0 0

chrXII 766205 - chrXII:766205:766249:- 1 1 1 1 0 GTATGT 1 0 0 chrXV I 305374 + chrXVI:305306:305374:+ 1 1 1 1 0 GTATGT 1 0 0 107334 chrIV 6 - chrIV:1073346:1073398:- 1 1 1 1 1 GTATGC 1 0 0 chrXII I 82343 + chrXIII:82291:82343:+ 1 1 1 1 0 GTATGT 1 0 0

chrXII 398583 + chrXII:398534:398583:+ 1 1 1 1 0 GTATGT 1 0 0

chrIV 458048 - chrIV:458048:458095:- 1 1 1 1 0 GTATGT 1 0 0 chrXI V 185553 + chrXIV:185493:185553:+ 1 1 1 1 0 GTATGT 1 0 0

chrXI 430163 - chrXI:430163:430239:- 1 1 1 1 0 GTATGT 1 0 0 chrXV I 729414 - chrXVI:729414:729479:- 1 1 1 1 0 GTATGT 1 0 0 chrXV I 678219 - chrXVI:678219:678276:- 1 1 1 1 0 GTATGT 1 0 0

chrXII 548706 - chrXII:548706:548765:- 1 1 1 1 0 GTATGT 1 0 0

chrX 396512 - chrX:396512:396565:- 1 1 1 1 0 GTATGT 1 0 0 chrVII I 315810 - chrVIII:315810:315861:- 1 1 1 1 0 GTATGT 1 0 0 chrVII I 187613 - chrVIII:187613:187669:- 1 1 1 1 0 GTATGT 1 0 0

chrIII 111588 - chrIII:111588:111632:- 1 1 1 1 1 GTATGC 1 0 0

chrII 110932 + chrII:110882:110932:+ 1 1 1 1 1 GTATGC 1 0 0 chrXII I 225264 - chrXIII:225264:225338:- 1 1 1 1 1 GTACGT 1 0 0 chrXII I 652797 - chrXIII:652797:652846:- 1 1 1 1 1 GTTTGT 1 0 0

chrXII 250899 - chrXII:250899:250948:- 1 1 1 1 1 GTATGG 1 0 0

chrI 151022 - chrI:151022:151098:- 1 1 1 1 0 GTATGT 1 0 0 chrXII I 537527 + chrXIII:537448:537527:+ 1 1 1 1 0 GTATGT 1 0 0 chrXII I 99301 - chrXIII:99301:99375:- 1 1 1 1 0 GTATGT 1 0 0

chrVII 543708 + chrVII:543643:543708:+ 1 1 1 1 0 GTATGT 1 0 0

chrII 366534 - chrII:366534:366582:- 1 1 1 1 0 GTATGT 1 0 0

chrII 426524 - chrII:426524:426624:- 0 1 1 0 1 GTAAGT 1 0 0 chrVII I 104751 + chrVIII:104684:104751:+ 0 1 1 0 2 ACATGT 1 0 0 chrXII I 223421 - chrXIII:223421:223496:- 0 1 1 0 2 GTACGC 1 0 0

chrV 269803 - chrV:269803:269865:- 0 1 1 0 3 GTGTTA 1 0 0

chrX 608540 + chrX:608476:608540:+ 0 1 1 0 1 GTAGGT 1 0 0 chrXII I 753788 - chrXIII:753788:753832:- 0 1 1 0 4 ATAAAA 1 0 0 chrXII I 517842 + chrXIII:517791:517842:+ 0 1 1 0 3 GTTAGC 1 0 0 chrVII I 138274 - chrVIII:138274:138323:- 0 1 1 0 1 GTCTGT 1 0 0

chrIV 307375 - chrIV:307375:307457:- 0 1 1 0 2 GGAAGT 1 0 0

chrVI 64615 - chrVI:64615:64682:- 0 1 1 0 4 GCAGTA 1 0 0 chrXI V 62407 - chrXIV:62407:62472:- 0 1 1 0 3 AGTTGT 1 0 0

chrX 702871 - chrX:702871:702937:- 0 1 1 0 1 TTATGT 1 0 0

chrXV 93868 - chrXV:93868:93914:- 0 1 1 0 2 GTACGG 1 0 0 chrVII I 262372 - chrVIII:262372:262442:- 0 1 1 1 0 GTATGT 1 0 0 chrXII I 425113 + chrXIII:425020:425113:+ 0 1 1 0 3 TTTTCT 1 0 0

chrII 168770 + chrII:168696:168770:+ 0 1 1 0 2 GTAATT 1 0 0

138 chrXV I 345483 - chrXVI:345483:345574:- 0 1 1 0 4 TAACGG 1 0 0 chrXII 242652 + chrXII:242595:242652:+ 0 1 1 0 2 GTACTT 1 0 0 chrXII I 732835 + chrXIII:732773:732835:+ 0 1 1 0 2 GTATTA 1 0 0 chrV 397225 + chrV:397147:397225:+ 0 1 1 0 3 TGTTGT 1 0 0 chrIV 630016 + chrIV:629904:630016:+ 0 1 1 1 0 GTATGT 1 0 0 chrXII I 499898 - chrXIII:499898:499948:- 0 1 1 0 1 TTATGT 1 0 0 chrVII 311486 + chrVII:311402:311486:+ 0 1 1 0 2 GGTTGT 1 0 0 chrXV 778910 - chrXV:778910:778959:- 0 1 1 0 4 GTCGAA 1 0 0 chrVII I 382356 - chrVIII:382356:382410:- 0 1 1 0 4 ATACAA 1 0 0 chrV 433162 + chrV:433110:433162:+ 0 1 1 0 3 TTCTCT 1 0 0 chrIV 230262 + chrIV:230202:230262:+ 0 1 1 0 1 GTATTT 1 0 0 chrIV 491873 + chrIV:491792:491873:+ 0 1 1 0 3 ATTTTT 1 0 0 chrXI V 623270 + chrXIV:623196:623270:+ 0 1 1 0 3 GTACAG 1 0 0 chrXV I 76014 - chrXVI:76014:76071:- 0 1 1 0 2 GTCTTT 1 0 0 chrXII 522984 + chrXII:522893:522984:+ 0 1 1 0 1 GTATGC 1 0 0 chrVII 435728 + chrVII:435686:435728:+ 0 1 1 0 2 TTAAGT 1 0 0 chrXV I 405426 + chrXVI:405372:405426:+ 0 1 1 0 3 ATTTAT 1 0 0 chrV 239667 - chrV:239667:239710:- 0 1 1 1 1 GTACGT 1 0 0 chrXI 283016 + chrXI:282962:283016:+ 0 1 1 0 2 GGAAGT 1 0 0 chrXI V 415872 + chrXIV:415827:415872:+ 0 1 1 0 2 GCATGA 1 0 0 chrXV 867516 + chrXV:867439:867516:+ 0 1 1 0 3 GTAAAA 1 0 0 chrX 76268 + chrX:76213:76268:+ 0 1 1 0 2 GTATTG 1 0 0 chrIX 225834 - chrIX:225834:225896:- 0 1 1 1 1 GTATAT 1 0 0 chrXII 263540 + chrXII:263478:263540:+ 0 1 1 0 4 GCAGTA 1 0 0 chrXI V 331803 + chrXIV:331744:331803:+ 0 1 1 0 2 ATATGC 1 0 0 chrIX 348380 - chrIX:348380:348491:- 0 1 1 1 1 GTATGA 1 0 0 chrXII I 651593 + chrXIII:651524:651593:+ 0 1 1 0 2 GTACAT 1 0 0 chrIV 217970 + chrIV:217907:217970:+ 0 1 1 0 2 GTATTA 1 0 0 chrXV I 654534 + chrXVI:654471:654534:+ 0 1 1 0 2 GTATAA 1 0 0 chrXV I 335968 + chrXVI:335905:335968:+ 1 0 0 0 0 GTATGT 0 1 1 chrIV 929251 + chrIV:929206:929251:+ 1 0 0 0 1 GTATGA 0 0 0 chrX 264006 - chrX:264006:264058:- 1 0 0 0 2 GTGCGT 0 0 0 chrV 305682 + chrV:305624:305682:+ 1 0 0 0 0 GTATGT 0 0 0 chrXII 609447 - chrXII:609447:609498:- 1 1 0 0 1 GTATGA 0 1 1 chrIV 331276 + chrIV:331190:331276:+ 1 0 0 0 1 GTAAGT 0 0 0 chrIV 392679 + chrIV:392591:392679:+ 1 0 0 0 0 GTATGT 0 0 0 chrV 374670 - chrV:374670:374721:- 1 0 0 0 1 GTTTGT 0 0 0 chrXII 818678 + chrXII:818645:818678:+ 1 0 0 0 3 GTGGCT 0 1 0 chrVII 543691 + chrVII:543643:543691:+ 1 0 0 1 0 GTATGT 0 1 1 chrV 151105 + chrV:151038:151105:+ 1 0 0 0 0 GTATGT 0 1 1 chrXII 522714 + chrXII:522672:522714:+ 1 0 0 1 0 GTATGT 0 0 0 chrXV 349520 - chrXV:349520:349598:- 1 0 0 0 0 GTATGT 0 1 1 chrII 306856 - chrII:306856:306902:- 1 0 0 0 2 GTATCA 0 0 0 chrXI 225773 - chrXI:225773:225852:- 1 0 0 0 1 GTAAGT 0 0 0 chrII 643000 - chrII:643000:643073:- 1 0 0 0 3 GCTCGT 0 1 0 chrIX 54025 + chrIX:53941:54025:+ 1 0 0 0 0 GTATGT 0 0 0

139 chrIV 268933 - chrIV:268933:269000:- 1 1 0 0 1 GTACGT 0 1 1

chrII 170731 + chrII:170680:170731:+ 1 0 0 1 0 GTATGT 0 1 1

chrX 172450 - chrX:172450:172499:- 1 1 0 0 4 AGTTCT 0 1 0

chrVII 594137 - chrVII:594137:594202:- 1 0 0 0 0 GTATGT 0 0 0 chrXI V 413438 - chrXIV:413438:413482:- 1 0 0 0 4 CGGCGT 0 1 0

chrXII 327338 - chrXII:327338:327400:- 1 0 0 0 4 CCATTA 0 1 0

chrV 540383 - chrV:540383:540436:- 1 0 0 0 0 GTATGT 0 1 1

chrXV 117635 - chrXV:117635:117684:- 1 0 0 0 1 GTATGC 0 1 1

chrXI 231492 + chrXI:231447:231492:+ 1 1 0 0 1 GTATGA 0 1 1 chrXV I 883586 - chrXVI:883586:883649:- 1 0 0 0 1 GTATAT 0 0 0

chrXII 605300 - chrXII:605300:605434:- 1 0 0 0 3 GCTCGT 0 1 0 chrXV I 481182 - chrXVI:481182:481238:- 1 0 0 0 1 GTACGT 0 0 0

chrIV 768399 - chrIV:768399:768453:- 1 0 0 0 1 ATATGT 0 0 0

chrIX 127052 - chrIX:127052:127135:- 1 0 0 0 1 GTATAT 0 0 0

chrIII 228654 + chrIII:228610:228654:+ 1 0 0 0 1 GTATGA 0 0 0

chrVII 497408 - chrVII:497408:497462:- 1 0 0 1 0 GTATGT 0 1 1 chrXV I 656520 - chrXVI:656520:656572:- 1 1 0 0 2 GTAGTT 0 1 0

chrII 221024 + chrII:220982:221024:+ 1 0 0 0 1 GTATGA 0 1 1

chrXII 382387 + chrXII:382302:382387:+ 1 0 0 0 0 GTATGT 0 1 1

chrIII 42034 + chrIII:41999:42034:+ 1 0 0 0 3 CTAATT 0 1 0

chrII 443734 - chrII:443734:443827:- 1 1 0 0 0 GTATGT 0 1 1

chrX 234005 + chrX:233962:234005:+ 1 0 0 0 4 CTGGCT 0 1 0

chrXII 50295 + chrXII:50224:50295:+ 1 0 0 0 0 GTATGT 0 0 0 chrXV I 445519 - chrXVI:445519:445573:- 1 1 0 0 1 GTGTGT 0 1 1

chrII 341037 - chrII:341037:341094:- 1 0 0 0 0 GTATGT 0 0 0 chrXII I 480618 - chrXIII:480618:480665:- 1 0 0 0 4 ATGACT 0 1 0

chrIX 261764 + chrIX:261723:261764:+ 1 0 0 0 1 GTATGA 0 0 0

chrXV 791703 - chrXV:791703:791753:- 1 1 0 0 2 TTACGT 0 1 0

chrII 291715 - chrII:291715:291771:- 1 0 0 0 1 GTAAGT 0 1 1 114514 chrIV 8 + chrIV:1145107:1145148:+ 1 0 0 0 1 GTACGT 0 1 1 chrXI V 583910 - chrXIV:583910:583971:- 1 1 0 0 0 GTATGT 0 1 1

chrV 258656 + chrV:258593:258656:+ 1 0 0 0 2 GTAGGA 0 0 0

chrIV 235105 + chrIV:235039:235105:+ 1 0 0 0 1 GTATGA 0 0 0 149051 chrIV 6 - chrIV:1490516:1490571:- 1 0 0 0 0 GTATGT 0 0 0

chrXI 272946 - chrXI:272946:272992:- 1 0 0 0 0 GTATGT 0 0 0

chrXI 67767 - chrXI:67767:67805:- 1 0 0 0 4 GACTAA 0 1 0

chrXII 491543 + chrXII:491497:491543:+ 1 0 0 0 1 GTATAT 0 0 0

chrXV 448694 + chrXV:448667:448694:+ 1 0 0 0 3 TAATAT 0 1 0 chrXI V 429574 - chrXIV:429574:429637:- 1 0 0 0 0 GTATGT 0 0 0

chrIV 122157 + chrIV:122079:122157:+ 1 0 0 1 0 GTATGT 0 0 0

chrV 166806 - chrV:166806:166873:- 1 0 0 1 0 GTATGT 0 1 1 chrXV I 281450 - chrXVI:281450:281502:- 1 0 0 1 0 GTATGT 0 1 1 chrVII I 148769 - chrVIII:148769:148814:- 1 1 0 0 1 GTATGC 0 1 1 chrXII I 652808 - chrXIII:652808:652846:- 1 0 0 1 1 GTTTGT 0 1 1

chrVII 143935 - chrVII:143935:143984:- 1 0 0 0 0 GTATGT 0 1 1

chrVII 472352 - chrVII:472352:472397:- 1 0 0 0 2 GTGAGT 0 1 0

140 chrXI V 763544 + chrXIV:763498:763544:+ 1 0 0 0 3 GTACAG 0 0 0 chrVII 555885 + chrVII:555835:555885:+ 1 0 0 1 0 GTATGT 0 1 1 chrXI 74700 + chrXI:74657:74700:+ 1 0 0 0 4 GCGACT 0 1 0 chrII 142813 - chrII:142813:142850:- 1 0 0 0 3 CTAAGA 0 1 0 chrII 342755 + chrII:342697:342755:+ 1 0 0 0 1 GTGTGT 0 0 0 chrXII 286519 - chrXII:286519:286557:- 1 0 0 1 0 GTATGT 0 1 1 chrIV 992896 + chrIV:992862:992896:+ 1 0 0 0 4 CCATCG 0 1 0 chrVII 574775 + chrVII:574705:574775:+ 1 0 0 0 1 GTATTT 0 0 0 chrXII 83742 - chrXII:83742:83803:- 1 1 0 0 3 GTTGTT 0 1 0 chrIII 123571 - chrIII:123571:123637:- 1 0 0 0 3 GTCTAG 0 1 0 chrIV 107156 + chrIV:107107:107156:+ 1 0 0 0 1 GTAGGT 0 1 1 chrIV 22303 - chrIV:22303:22355:- 1 0 0 0 1 GTAAGT 0 1 1 chrVII 439462 + chrVII:439387:439462:+ 1 0 0 0 1 GTAAGT 0 1 1 chrXII I 182669 + chrXIII:182594:182669:+ 1 0 0 0 0 GTATGT 0 1 1 chrI 152002 - chrI:152002:152055:- 1 0 0 0 2 GTATAC 0 0 0 chrXV 720497 + chrXV:720444:720497:+ 1 0 0 0 0 GTATGT 0 0 0 chrXV I 729395 - chrXVI:729395:729479:- 1 0 0 1 0 GTATGT 0 1 1 chrIX 166501 + chrIX:166432:166501:+ 1 0 0 1 0 GTATGT 0 1 1 125699 chrIV 9 - chrIV:1256999:1257056:- 1 0 0 0 4 GGTTAG 0 1 0 chrVII 167428 + chrVII:167361:167428:+ 1 0 0 1 0 GTATGT 0 0 0 chrXI V 427176 + chrXIV:427106:427176:+ 1 0 0 0 1 GTATGA 0 0 0 chrXII 987202 + chrXII:987139:987202:+ 1 0 0 1 0 GTATGT 0 0 0 chrXI V 174520 - chrXIV:174520:174572:- 1 0 0 0 2 GTATCG 0 0 0 chrXV 845020 + chrXV:844959:845020:+ 1 0 0 0 0 GTATGT 0 0 0 chrXV I 218695 + chrXVI:218646:218695:+ 1 0 0 1 0 GTATGT 0 1 1 chrVII I 189806 - chrVIII:189806:189844:- 1 0 0 0 4 GCTGAT 0 1 0 chrVII 253204 - chrVII:253204:253253:- 1 0 0 0 1 GTATGA 0 0 0 chrIX 128117 + chrIX:128061:128117:+ 1 0 0 0 2 TTTTGT 0 0 0 chrXV I 937537 - chrXVI:937537:937617:- 1 0 0 0 0 GTATGT 0 1 1 chrIV 456687 - chrIV:456687:456757:- 1 0 0 0 1 GTAAGT 0 0 0 chrXI 193009 - chrXI:193009:193071:- 1 0 0 0 0 GTATGT 0 0 0 chrXI 93382 - chrXI:93382:93470:- 1 0 0 1 1 GTACGT 0 0 0 chrIV 655244 + chrIV:655202:655244:+ 1 0 0 0 1 GTATGC 0 1 1 chrV 292571 - chrV:292571:292618:- 1 1 0 0 2 GTACAT 0 1 0 chrXI V 380699 - chrXIV:380699:380783:- 1 0 0 1 0 GTATGT 0 0 0 chrI 142338 + chrI:142256:142338:+ 1 0 0 1 0 GTATGT 0 0 0 chrII 606346 + chrII:606276:606346:+ 1 0 0 1 0 GTATGT 0 1 1 chrXII 744185 + chrXII:744185:744221:+ 1 0 0 0 4 ATAATC 0 1 0 chrXV 505980 + chrXV:505939:505980:+ 1 0 0 1 0 GTATGT 0 1 1 chrX 71474 + chrX:71426:71474:+ 1 0 0 0 2 GTTGGT 0 0 0 chrVII I 406673 - chrVIII:406673:406756:- 1 0 0 0 1 GCATGT 0 1 1 chrIV 451430 - chrIV:451430:451488:- 1 0 0 0 1 GTATGG 0 1 1 chrVII 561204 - chrVII:561204:561267:- 1 0 0 0 2 GGTTGT 0 0 0 chrVI 176326 - chrVI:176326:176385:- 1 0 0 0 1 GTATGG 0 0 0 chrVII 700757 + chrVII:700717:700757:+ 1 0 0 0 4 TCCTGA 0 1 0 chrXI V 373657 + chrXIV:373596:373657:+ 1 0 0 0 1 GTATGA 0 0 0

141 chrXV 518763 - chrXV:518763:518826:- 1 0 0 0 1 ATATGT 0 0 0

chrIX 387166 + chrIX:387136:387166:+ 1 0 0 0 5 CCAACA 0 1 0 chrXII I 290869 + chrXIII:290835:290869:+ 1 0 0 0 4 CACCGT 0 1 0

chrXII 982467 - chrXII:982467:982535:- 1 0 0 0 0 GTATGT 0 0 0

chrX 570579 - chrX:570579:570633:- 1 0 0 0 3 GTAGTC 0 0 0

chrIV 715306 - chrIV:715264:715306:- 1 0 0 0 2 GTTAGT 0 1 0

chrV 517899 + chrV:517847:517899:+ 1 0 0 0 0 GTATGT 0 0 0

chrX 31819 + chrX:31768:31819:+ 1 0 0 0 0 GTATGT 0 0 0

chrX 435281 + chrX:435222:435281:+ 1 0 0 1 0 GTATGT 0 1 1

chrXI 468921 - chrXI:468921:468975:- 1 0 0 0 1 GTATGC 0 1 1 chrVII I 508825 + chrVIII:508795:508825:+ 1 0 0 0 6 CCGGCC 0 1 0

chrII 115534 + chrII:115478:115534:+ 1 0 0 0 0 GTATGT 0 1 1

chrXII 898621 + chrXII:898549:898621:+ 1 0 0 0 2 GTATAA 0 0 0 chrXII I 647110 + chrXIII:647059:647110:+ 1 0 0 0 1 GCATGT 0 0 0

chrIV 392638 + chrIV:392591:392638:+ 1 0 0 0 0 GTATGT 0 1 1 chrXII I 822588 - chrXIII:822588:822625:- 1 0 0 0 5 ATCAAA 0 1 0 chrXV I 777582 - chrXVI:777582:777639:- 1 0 0 0 1 GTAAGT 0 1 1 chrVII I 129590 + chrVIII:129523:129590:+ 1 0 0 1 0 GTATGT 0 1 1 chrXII I 425094 + chrXIII:424997:425094:+ 1 0 0 1 0 GTATGT 0 1 1

chrII 592709 - chrII:592709:592763:- 1 0 0 1 0 GTATGT 0 1 1

chrVII 443012 - chrVII:443012:443066:- 1 1 0 0 2 GTCAGT 0 1 0 chrVII I 255706 - chrVIII:255706:255753:- 1 0 0 0 3 GTCCAT 0 1 0

chrXV 733353 - chrXV:733353:733417:- 1 0 0 0 1 GCATGT 0 0 0

chrX 538617 + chrX:538579:538617:+ 1 0 0 0 5 TCCTAA 0 1 0 chrXII I 559828 + chrXIII:559782:559828:+ 1 0 0 0 0 GTATGT 0 0 0 chrXV I 146507 - chrXVI:146507:146597:- 1 1 0 0 1 GTATGA 0 1 1

chrXI 408145 + chrXI:408101:408145:+ 1 0 0 0 1 GTAAGT 0 0 0

chrXV 423675 - chrXV:423675:423735:- 1 1 0 0 1 GTATGG 0 1 1

chrV 336934 + chrV:336866:336934:+ 1 0 0 0 1 GTATGA 0 0 0

chrIX 183435 - chrIX:183435:183500:- 1 0 0 0 4 CCCAGT 0 1 0

chrXI 166477 + chrXI:166405:166477:+ 1 0 0 1 1 GTACGT 0 0 0

chrXI 96757 + chrXI:96692:96757:+ 1 0 0 0 1 GTAGGT 0 1 1

chrXV 594353 - chrXV:594353:594426:- 1 0 0 0 5 GCGCAA 0 1 0

chrII 331390 - chrII:331390:331438:- 1 0 0 0 1 GTATGA 0 0 0

chrX 517545 - chrX:517545:517607:- 1 1 0 0 2 GTACGC 0 1 0 chrXV I 939905 - chrXVI:939905:939951:- 1 0 0 0 3 GTACCG 0 0 0 chrXV I 602284 - chrXVI:602284:602338:- 1 0 0 0 0 GTATGT 0 1 1

chrIV 437849 + chrIV:437796:437849:+ 1 0 0 0 4 AAAGAT 0 1 0 chrXII I 551141 - chrXIII:551141:551202:- 1 0 0 1 0 GTATGT 0 0 0

chrVII 859434 - chrVII:859434:859478:- 1 0 0 1 0 GTATGT 0 1 1

chrXV 506800 - chrXV:506800:506870:- 1 0 0 0 1 GTATTT 0 0 0 126684 chrIV 6 + chrIV:1266789:1266846:+ 1 0 0 0 2 GTTAGT 0 0 0

chrXI 490069 + chrXI:489994:490069:+ 1 0 0 0 0 GTATGT 0 0 0

chrII 13949 - chrII:13949:14027:- 1 1 0 0 0 GTATGT 0 1 1 chrXV I 231558 + chrXVI:231499:231558:+ 1 0 0 0 2 GTAAGA 0 0 0 chrXV I 593080 - chrXVI:593080:593139:- 1 0 0 0 2 GTATAA 0 0 0 142 chrXII I 832410 + chrXIII:832362:832410:+ 1 0 0 0 2 GTAAAT 0 0 0 chrX 632996 - chrX:632996:633051:- 1 1 0 0 1 GTATGC 0 1 1 chrVII 35284 - chrVII:35284:35351:- 1 0 0 0 2 GTAAGG 0 0 0 chrVI 96257 + chrVI:96184:96257:+ 1 0 0 0 1 GTATTT 0 0 0 chrXI 302664 + chrXI:302572:302664:+ 1 0 0 0 4 CTCAAT 0 1 0 141191 chrIV 4 - chrIV:1411914:1411963:- 1 1 0 0 2 ACATGT 0 1 0 chrV 362828 + chrV:362729:362828:+ 1 0 0 1 0 GTATGT 0 0 0 chrVII 627186 - chrVII:627186:627225:- 1 0 0 0 4 AACTGA 0 1 0 chrVII 436362 + chrVII:436318:436362:+ 1 0 0 0 0 GTATGT 0 0 0 chrXV I 115267 + chrXVI:115219:115267:+ 1 0 0 1 0 GTATGT 0 0 0 chrIV 438275 + chrIV:438208:438275:+ 1 0 0 0 1 GTAAGT 0 1 1 chrXV I 76164 - chrXVI:76164:76223:- 1 0 0 1 0 GTATGT 0 1 1 chrXII I 608719 + chrXIII:608660:608719:+ 1 0 0 0 2 GTATTA 0 1 0 chrXV 930113 + chrXV:930063:930113:+ 1 0 0 0 1 GTAAGT 0 1 1 chrXII 155740 - chrXII:155740:155813:- 1 0 0 0 1 GTAAGT 0 0 0 chrX 432390 - chrX:432390:432464:- 1 0 0 0 2 GTATAC 0 0 0 chrV 423912 + chrV:423821:423912:+ 1 0 0 0 3 GTAGAG 0 0 0 chrIX 317159 + chrIX:317107:317159:+ 1 0 0 0 5 ATGAAA 0 1 0 chrIV 676355 + chrIV:676270:676355:+ 1 0 0 0 1 GTACGT 0 0 0 chrXV I 739748 - chrXVI:739748:739794:- 1 0 0 0 2 GTAAGC 0 0 0 chrXI 94081 - chrXI:94081:94154:- 1 1 0 0 0 GTATGT 0 1 1 chrV 201996 + chrV:201953:201996:+ 1 0 0 0 0 GTATGT 0 0 0 chrIV 540580 + chrIV:540538:540580:+ 1 0 0 0 1 GTAAGT 0 0 0 chrX 422629 + chrX:422549:422629:+ 1 0 0 0 1 GTATGA 0 0 0 chrIX 134038 + chrIX:133971:134038:+ 1 0 0 0 1 GTAAGT 0 0 0 chrIX 155301 + chrIX:155220:155301:+ 1 0 0 1 1 GCATGT 0 1 1 chrXI V 611577 + chrXIV:611503:611577:+ 1 0 0 0 0 GTATGT 0 0 0 106251 chrXV 2 - chrXV:1062512:1062560:- 1 1 0 0 1 GTATGC 0 1 1 chrVII 148658 - chrVII:148658:148721:- 1 0 0 0 2 TTACGT 0 0 0 chrXV I 584022 - chrXVI:584022:584078:- 1 0 0 0 2 GTATTG 0 0 0 chrVII 383552 + chrVII:383489:383552:+ 1 0 0 0 0 GTATGT 0 0 0 chrIV 232835 + chrIV:232747:232835:+ 1 0 0 0 1 GTATAT 0 0 0 111433 chrIV 2 + chrIV:1114289:1114332:+ 1 0 0 0 0 GTATGT 0 0 0 chrXI 290610 - chrXI:290610:290705:- 1 1 0 0 1 GTAGGT 0 1 1 chrXV 325375 - chrXV:325375:325450:- 1 1 0 0 2 GTATCG 0 1 0 chrVII I 236985 + chrVIII:236958:236985:+ 1 0 0 0 4 TTAAAA 0 1 0 chrV 336936 + chrV:336866:336936:+ 0 1 0 0 1 GTATGA 0 1 1 chrXI 262172 + chrXI:262138:262172:+ 0 1 0 0 3 AGAAGT 0 1 0 chrI 78464 + chrI:78419:78464:+ 0 1 0 0 2 GTGTGG 0 1 0 chrXII 491545 + chrXII:491497:491545:+ 0 1 0 0 1 GTATAT 0 1 1 chrXV 518764 - chrXV:518764:518826:- 0 1 0 0 1 ATATGT 0 1 1 chrXII 366443 + chrXII:366367:366443:+ 0 1 0 0 2 GTATCG 0 1 0 chrVII 188025 - chrVII:188025:188070:- 0 1 0 0 3 GTACCG 0 1 0 chrII 425495 - chrII:425495:425538:- 0 1 0 0 2 CTAAGT 0 1 0 chrVII 772104 + chrVII:772019:772104:+ 0 1 0 0 0 GTATGT 0 1 1 chrXII 982469 - chrXII:982469:982535:- 0 1 0 0 0 GTATGT 0 1 1

143 chrXV I 685775 + chrXVI:685691:685775:+ 0 1 0 0 1 GTAGGT 0 1 1 102138 chrIV 6 + chrIV:1021323:1021386:+ 0 1 0 0 1 ATATGT 0 1 1 chrXI V 224799 + chrXIV:224764:224799:+ 0 1 0 0 3 TTAAGA 0 1 0

chrVII 423867 - chrVII:423867:423911:- 0 1 0 0 1 GTATGA 0 1 1 chrXII I 491153 + chrXIII:491103:491153:+ 0 1 0 0 3 GTACTG 0 1 0 chrXI V 373658 + chrXIV:373596:373658:+ 0 1 0 0 1 GTATGA 0 1 1

chrVII 951678 - chrVII:951678:951729:- 0 1 0 0 2 GTATAC 0 1 0

chrX 712053 - chrX:712053:712106:- 0 1 0 0 1 GTTTGT 0 1 1 140717 chrIV 7 + chrIV:1407133:1407177:+ 0 1 0 0 2 GTATAC 0 1 0 103046 chrXV 3 - chrXV:1030463:1030507:- 0 1 0 0 4 ATCTAA 0 1 0

chrI 134871 - chrI:134871:134944:- 0 1 0 0 1 GTAAGT 0 1 1 100399 chrXV 4 - chrXV:1003994:1004066:- 0 1 0 0 2 GTGAGT 0 1 0 chrXII I 887491 + chrXIII:887427:887491:+ 0 1 0 0 3 CGTTGT 0 1 0

chrV 258657 + chrV:258593:258657:+ 0 1 0 0 2 GTAGGA 0 1 0

chrVII 504217 - chrVII:504217:504268:- 0 1 0 0 1 GTATGA 0 1 1 chrVII I 442922 + chrVIII:442871:442922:+ 0 1 0 0 1 GTATAT 0 1 1

chrIV 235107 + chrIV:235039:235107:+ 0 1 0 0 1 GTATGA 0 1 1

chrV 124638 - chrV:124638:124687:- 0 1 0 0 0 GTATGT 0 1 1

chrXI 313867 + chrXI:313796:313867:+ 0 1 0 0 2 GTAGAT 0 1 0

chrX 209415 + chrX:209358:209415:+ 0 1 0 0 1 GTATGG 0 1 1

chrXV 437796 - chrXV:437796:437846:- 0 1 0 0 1 GTATAT 0 1 1

chrXII 918954 - chrXII:918954:919021:- 0 1 0 0 1 ATATGT 0 1 1

chrII 431594 - chrII:431594:431650:- 0 1 0 0 1 GTAGGT 0 1 1

chrII 419959 + chrII:419915:419959:+ 0 1 0 0 2 TTATGG 0 1 0

chrVII 345543 - chrVII:345543:345614:- 0 1 0 0 0 GTATGT 0 1 1

chrVI 131142 - chrVI:131142:131201:- 0 1 0 0 2 GTTAGT 0 1 0

chrVII 380660 - chrVII:380660:380732:- 0 1 0 0 1 GTATTT 0 1 1

chrVII 167430 + chrVII:167361:167430:+ 0 1 0 1 0 GTATGT 0 1 1

chrII 342756 + chrII:342697:342756:+ 0 1 0 0 1 GTGTGT 0 1 1

chrVII 875660 + chrVII:875581:875660:+ 0 1 0 0 1 GCATGT 0 1 1 chrXI V 598312 + chrXIV:598259:598312:+ 0 1 0 0 2 GTTTGA 0 1 0 chrXV I 339245 + chrXVI:339174:339245:+ 0 1 0 0 1 GTATAT 0 1 1 chrXII I 647112 + chrXIII:647059:647112:+ 0 1 0 0 1 GCATGT 0 1 1

chrI 30552 - chrI:30552:30626:- 0 1 0 0 3 AAGTGT 0 1 0

chrXI 633804 - chrXI:633804:633874:- 0 1 0 0 1 GTCTGT 0 1 1

chrII 206284 + chrII:206192:206284:+ 0 1 0 0 0 GTATGT 0 1 1

chrIV 653423 + chrIV:653365:653423:+ 0 1 0 0 3 TTACGA 0 1 0 chrXV I 196570 - chrXVI:196570:196616:- 0 1 0 0 1 GTATGG 0 1 1

chrV 362830 + chrV:362729:362830:+ 0 1 0 1 0 GTATGT 0 1 1 chrVII I 491866 - chrVIII:491866:491919:- 0 1 0 0 0 GTATGT 0 1 1

chrVII 937274 + chrVII:937216:937274:+ 0 1 0 0 2 GTAAGA 0 1 0

chrVI 103130 - chrVI:103130:103173:- 0 1 0 0 2 TTTTGT 0 1 0

chrXI 225774 - chrXI:225774:225852:- 0 1 0 0 1 GTAAGT 0 1 1 chrXV I 579895 - chrXVI:579895:579965:- 0 1 0 0 1 GTATGA 0 1 1

chrX 345273 + chrX:345230:345273:+ 0 1 0 0 3 GTAGAG 0 1 0

chrII 792451 - chrII:792451:792495:- 0 1 0 0 2 GTACTT 0 1 0 144 chrXII 827561 - chrXII:827561:827606:- 0 1 0 0 4 CGACTT 0 1 0 chrXII 988409 + chrXII:988329:988409:+ 0 1 0 0 2 GTAAGC 0 1 0 chrIX 363683 + chrIX:363652:363683:+ 0 1 0 0 4 TAACGC 0 1 0 chrIV 188133 - chrIV:188133:188194:- 0 1 0 0 1 GTATGA 0 1 1 chrIX 54026 + chrIX:53941:54026:+ 0 1 0 0 0 GTATGT 0 1 1 chrX 422631 + chrX:422549:422631:+ 0 1 0 0 1 GTATGA 0 1 1 chrXV I 513415 - chrXVI:513415:513466:- 0 1 0 0 1 GTATGG 0 1 1 118108 chrIV 0 + chrIV:1181036:1181080:+ 0 1 0 0 3 AGATCT 0 1 0 chrVII 59556 + chrVII:59483:59556:+ 0 1 0 0 1 GTATGG 0 1 1 chrX 432391 - chrX:432391:432464:- 0 1 0 0 2 GTATAC 0 1 0 108374 chrVII 7 - chrVII:1083747:1083798:- 0 1 0 0 2 GTATAA 0 1 0 chrI 142343 + chrI:142256:142343:+ 0 1 0 1 0 GTATGT 0 1 1 chrXV 630885 - chrXV:630885:630936:- 0 1 0 0 2 GTATAG 0 1 0 chrXII 382366 + chrXII:382302:382366:+ 0 1 0 0 0 GTATGT 0 1 1 chrIII 27689 - chrIII:27689:27745:- 0 1 0 0 3 GTAATG 0 1 0 chrX 328112 - chrX:328112:328177:- 0 1 0 0 2 GTAAGC 0 1 0 chrVII 16566 + chrVII:16513:16566:+ 0 1 0 0 4 ATTAAT 0 1 0 chrXII I 273520 - chrXIII:273520:273577:- 0 1 0 0 1 GTGTGT 0 1 1 chrV 423913 + chrV:423821:423913:+ 0 1 0 0 3 GTAGAG 0 1 0 chrXV 506801 - chrXV:506801:506870:- 0 1 0 0 1 GTATTT 0 1 1 chrVII I 372283 + chrVIII:372191:372283:+ 0 1 0 0 1 GTTTGT 0 1 1 chrII 519067 - chrII:519067:519131:- 0 1 0 0 1 GTAAGT 0 1 1 chrIV 392681 + chrIV:392591:392681:+ 0 1 0 0 0 GTATGT 0 1 1 chrI 31425 - chrI:31425:31464:- 0 1 0 0 3 ATAACT 0 1 0 chrV 561245 + chrV:561200:561245:+ 0 1 0 0 0 GTATGT 0 1 1 chrXI 465717 - chrXI:465717:465764:- 0 1 0 0 2 GTAGGA 0 1 0 chrIV 721691 - chrIV:721691:721739:- 0 1 0 0 1 GTTTGT 0 1 1 chrXI V 724878 + chrXIV:724794:724878:+ 0 1 0 0 3 GGTAGT 0 1 0 101756 chrVII 0 - chrVII:1017560:1017603:- 0 1 0 0 5 CCACCA 0 1 0 149051 chrIV 7 - chrIV:1490517:1490571:- 0 1 0 0 0 GTATGT 0 1 1 chrX 455748 + chrX:455690:455748:+ 0 1 0 0 1 GTACGT 0 1 1 chrIX 134040 + chrIX:133971:134040:+ 0 1 0 0 1 GTAAGT 0 1 1 chrX 31820 + chrX:31768:31820:+ 0 1 0 0 0 GTATGT 0 1 1 chrII 89157 - chrII:89157:89219:- 0 1 0 0 4 CAAAGC 0 1 0 chrVII I 335607 - chrVIII:335607:335675:- 0 1 0 0 1 GTACGT 0 1 1 chrIV 273737 + chrIV:273662:273737:+ 0 1 0 0 2 GTCCGT 0 1 0 chrXV 418633 + chrXV:418565:418633:+ 0 1 0 0 1 GTATGA 0 1 1 chrXII 316897 + chrXII:316853:316897:+ 0 1 0 0 1 GTATGA 0 1 1 chrX 349225 + chrX:349181:349225:+ 0 1 0 0 0 GTATGT 0 1 1 chrXII I 242968 - chrXIII:242968:243056:- 0 1 0 0 1 GTAAGT 0 1 1 chrIV 248590 + chrIV:248513:248590:+ 0 1 0 0 2 GTCAGT 0 1 0 chrVII 682969 - chrVII:682969:683069:- 0 1 0 0 0 GTATGT 0 1 1 chrXV 87614 + chrXV:87554:87614:+ 0 1 0 0 1 GTATCT 0 1 1 chrII 331391 - chrII:331391:331438:- 0 1 0 0 1 GTATGA 0 1 1 151950 chrIV 5 - chrIV:1519505:1519576:- 0 1 0 0 1 GTATGC 0 1 1 chrIV 540585 + chrIV:540538:540585:+ 0 1 0 0 1 GTAAGT 0 1 1

145 chrVII 166586 + chrVII:166543:166586:+ 0 1 0 0 3 ATACGC 0 1 0 chrXI V 429575 - chrXIV:429575:429637:- 0 1 0 0 0 GTATGT 0 1 1 chrXII I 204536 - chrXIII:204536:204588:- 0 1 0 0 1 GTATGA 0 1 1

chrIV 122159 + chrIV:122079:122159:+ 0 1 0 1 0 GTATGT 0 1 1 chrXV I 339305 + chrXVI:339230:339305:+ 0 1 0 0 1 GTAAGT 0 1 1

chrIX 225116 + chrIX:225035:225116:+ 0 1 0 0 1 GTATGA 0 1 1 chrXI V 429599 - chrXIV:429599:429637:- 0 1 0 0 0 GTATGT 0 1 1

chrXI 285868 + chrXI:285816:285868:+ 0 1 0 0 1 GTATGA 0 1 1 108120 chrIV 4 + chrIV:1081142:1081204:+ 0 1 0 0 1 GTATGA 0 1 1 105263 chrVII 8 - chrVII:1052638:1052724:- 0 1 0 0 1 GTATGG 0 1 1 108023 chrIV 8 - chrIV:1080238:1080322:- 0 1 0 0 1 GTATAT 0 1 1 chrXI V 369425 - chrXIV:369425:369506:- 0 1 0 0 0 GTATGT 0 1 1

chrIII 17806 + chrIII:17761:17806:+ 0 1 0 0 2 GTAGGC 0 1 0 126684 chrIV 8 + chrIV:1266789:1266848:+ 0 1 0 0 2 GTTAGT 0 1 0

chrIV 451995 - chrIV:451995:452077:- 0 1 0 0 2 GTCGGT 0 1 0 chrXII I 649541 - chrXIII:649541:649630:- 0 1 0 0 1 GTATGG 0 1 1

chrXI 93384 - chrXI:93384:93470:- 0 1 0 1 1 GTACGT 0 1 1 chrXII I 552476 + chrXIII:552425:552476:+ 0 1 0 0 3 ATTTTT 0 1 0 chrXII I 894575 - chrXIII:894575:894616:- 0 1 0 0 4 CAATCA 0 1 0

chrVII 383554 + chrVII:383489:383554:+ 0 1 0 0 0 GTATGT 0 1 1

chrIV 698041 - chrIV:698041:698092:- 0 1 0 0 1 GTACGT 0 1 1 100293 chrVII:1002886:1002936: chrVII 6 + + 0 1 0 0 1 GCATGT 0 1 1 chrVII I 400568 - chrVIII:400568:400651:- 0 1 0 0 0 GTATGT 0 1 1

chrIV 299959 - chrIV:299959:300001:- 0 1 0 0 2 GTTTGG 0 1 0 chrXV I 800193 + chrXVI:800147:800193:+ 0 1 0 0 5 AACTTC 0 1 0

chrXII 155741 - chrXII:155741:155813:- 0 1 0 0 1 GTAAGT 0 1 1 chrXII I 832412 + chrXIII:832362:832412:+ 0 1 0 0 2 GTAAAT 0 1 0 chrXI V 64388 - chrXIV:64388:64425:- 0 1 0 0 3 TTTTCT 0 1 0 111433 chrIV 4 + chrIV:1114289:1114334:+ 0 1 0 0 0 GTATGT 0 1 1 chrXII I 114128 - chrXIII:114128:114189:- 0 1 0 0 3 TTAAGA 0 1 0

chrX 165022 + chrX:164962:165022:+ 0 1 0 0 2 GTGGGT 0 1 0

chrII 341038 - chrII:341038:341094:- 0 1 0 0 0 GTATGT 0 1 1

chrXII 898623 + chrXII:898549:898623:+ 0 1 0 0 2 GTATAA 0 1 0

chrVII 607018 - chrVII:607018:607091:- 0 1 0 0 1 GTATGG 0 1 1

chrXV 819159 + chrXV:819077:819159:+ 0 1 0 0 1 GTCTGT 0 1 1 chrXII I 559830 + chrXIII:559782:559830:+ 0 1 0 0 0 GTATGT 0 1 1 chrXV I 745391 + chrXVI:745342:745391:+ 0 1 0 0 2 TAATGT 0 1 0 chrXI V 611579 + chrXIV:611503:611579:+ 0 1 0 0 0 GTATGT 0 1 1

chrIV 885148 + chrIV:885101:885148:+ 0 1 0 0 2 GTATCA 0 1 0

chrVI 96258 + chrVI:96184:96258:+ 0 1 0 0 1 GTATTT 0 1 1 chrXV I 937231 - chrXVI:937231:937285:- 0 1 0 0 2 GTATAC 0 1 0

chrVII 414313 + chrVII:414250:414313:+ 0 1 0 0 1 GTATGG 0 1 1 chrXV I 617122 + chrXVI:617066:617122:+ 0 1 0 0 1 GTAAGT 0 1 1 chrXI V 561150 - chrXIV:561150:561218:- 0 1 0 0 3 AAATAT 0 1 0

chrXII 791911 - chrXII:791911:791964:- 0 1 0 0 0 GTATGT 0 1 1 146 chrXI 191089 - chrXI:191089:191173:- 0 1 0 0 1 GTAAGT 0 1 1 chrXV I 115269 + chrXVI:115219:115269:+ 0 1 0 1 0 GTATGT 0 1 1 chrXV I 841159 + chrXVI:841109:841159:+ 0 1 0 0 2 GAAAGT 0 1 0 chrVII 962303 + chrVII:962229:962303:+ 0 1 0 0 0 GTATGT 0 1 1 chrXII 90638 + chrXII:90568:90638:+ 0 1 0 0 1 GTATGC 0 1 1 chrXV I 593081 - chrXVI:593081:593139:- 0 1 0 0 2 GTATAA 0 1 0 chrXI V 250194 - chrXIV:250194:250238:- 0 1 0 0 1 GTATAT 0 1 1 chrIV 676357 + chrIV:676270:676357:+ 0 1 0 0 1 GTACGT 0 1 1 chrVII I 107814 + chrVIII:107766:107814:+ 0 1 0 0 2 GCAAGT 0 1 0 chrIV 657890 + chrIV:657846:657890:+ 0 1 0 0 2 GTATCA 0 1 0 chrIV 331277 + chrIV:331190:331277:+ 0 1 0 0 1 GTAAGT 0 1 1 chrXV 527237 + chrXV:527186:527237:+ 0 1 0 0 1 GTATGA 0 1 1 chrXII I 56618 + chrXIII:56557:56618:+ 0 1 0 0 1 GTATTT 0 1 1 chrXII 333652 + chrXII:333603:333652:+ 0 1 0 0 3 GTAAAG 0 1 0 chrXV I 739749 - chrXVI:739749:739794:- 0 1 0 0 2 GTAAGC 0 1 0 chrXII 331859 - chrXII:331859:331922:- 0 1 0 0 0 GTATGT 0 1 1 chrXI V 174521 - chrXIV:174521:174572:- 0 1 0 0 2 GTATCG 0 1 0 chrVII I 97941 - chrVIII:97941:98036:- 0 1 0 0 1 GTACGT 0 1 1 chrVII 427172 - chrVII:427172:427221:- 0 1 0 0 0 GTATGT 0 1 1 chrXV 836398 - chrXV:836398:836448:- 0 1 0 0 2 ATATTT 0 1 0 chrXII I 38609 - chrXIII:38609:38680:- 0 1 0 0 2 GTATCG 0 1 0 chrXI V 17324 - chrXIV:17324:17361:- 0 1 0 0 3 CCAAGT 0 1 0 chrXI V 324785 - chrXIV:324785:324827:- 0 1 0 0 5 AGGTAG 0 1 0 chrVII 878175 - chrVII:878175:878242:- 0 1 0 0 3 TTAAGA 0 1 0 chrXII 928301 - chrXII:928301:928370:- 0 1 0 0 1 GTATAT 0 1 1 chrXI 552231 + chrXI:552176:552231:+ 0 1 0 0 2 GTAAAT 0 1 0 chrX 506239 + chrX:506193:506239:+ 0 1 0 0 1 GTATGA 0 1 1 chrIV 130285 - chrIV:130285:130359:- 0 1 0 0 2 GTATCG 0 1 0 chrV 7557 - chrV:7557:7625:- 0 1 0 0 2 GTATAA 0 1 0 chrV 517901 + chrV:517847:517901:+ 0 1 0 0 0 GTATGT 0 1 1 chrXII 907075 + chrXII:907032:907075:+ 0 1 0 0 3 GTCAGG 0 1 0 chrI 157809 - chrI:157809:157862:- 0 1 0 0 2 GTATAC 0 1 0 chrXI 166500 + chrXI:166452:166500:+ 0 1 0 0 3 GTACTC 0 1 0 chrXV 605429 + chrXV:605379:605429:+ 0 1 0 0 1 GTACGT 0 1 1 chrXV I 717057 - chrXVI:717057:717144:- 0 1 0 0 1 GTATGC 0 1 1 chrII 76231 + chrII:76189:76231:+ 0 1 0 0 1 GTCTGT 0 1 1 chrXII I 910098 - chrXIII:910098:910130:- 0 1 0 0 2 CGATGT 0 1 0 chrXI 649449 + chrXI:649382:649449:+ 0 1 0 0 1 GTATGA 0 1 1 chrXII I 551142 - chrXIII:551142:551202:- 0 1 0 1 0 GTATGT 0 1 1 chrIV 232837 + chrIV:232747:232837:+ 0 1 0 0 1 GTATAT 0 1 1 chrXII I 653956 - chrXIII:653956:654018:- 0 1 0 0 0 GTATGT 0 1 1 chrXV I 590144 - chrXVI:590144:590200:- 0 1 0 0 1 ATATGT 0 1 1 chrVII 51958 + chrVII:51882:51958:+ 0 1 0 0 0 GTATGT 0 1 1 chrIII 221009 + chrIII:220979:221009:+ 0 1 0 0 4 TATTCT 0 1 0 chrXII 50297 + chrXII:50224:50297:+ 0 1 0 0 0 GTATGT 0 1 1

147 chrXII I 637453 + chrXIII:637403:637453:+ 0 1 0 0 1 GTGTGT 0 1 1 chrXI V 273231 + chrXIV:273189:273231:+ 0 1 0 0 1 GTACGT 0 1 1

chrVII 77520 + chrVII:77467:77520:+ 0 1 0 0 2 GTATCA 0 1 0

chrV 61194 + chrV:61112:61194:+ 0 1 0 0 1 GTACGT 0 1 1

chrVI 176327 - chrVI:176327:176385:- 0 1 0 0 1 GTATGG 0 1 1 chrXV I 883587 - chrXVI:883587:883649:- 0 1 0 0 1 GTATAT 0 1 1

chrIX 40245 + chrIX:40178:40245:+ 0 1 0 0 3 TCAAGT 0 1 0 chrXI V 273584 - chrXIV:273584:273650:- 0 1 0 0 2 GTATAA 0 1 0

chrVII 148659 - chrVII:148659:148721:- 0 1 0 0 2 TTACGT 0 1 0

chrXI 193010 - chrXI:193010:193071:- 0 1 0 0 0 GTATGT 0 1 1

chrVII 140633 + chrVII:140554:140633:+ 0 1 0 0 1 GTATCT 0 1 1

chrIV 929253 + chrIV:929206:929253:+ 0 1 0 0 1 GTATGA 0 1 1

chrVII 253205 - chrVII:253205:253253:- 0 1 0 0 1 GTATGA 0 1 1

chrXI 408146 + chrXI:408101:408146:+ 0 1 0 0 1 GTAAGT 0 1 1

chrXV 720499 + chrXV:720444:720499:+ 0 1 0 0 0 GTATGT 0 1 1

chrXI 446830 - chrXI:446830:446885:- 0 1 0 0 1 GTATGC 0 1 1

chrIV 381064 - chrIV:381064:381143:- 0 1 0 0 1 GTATCT 0 1 1

chrVII 293422 + chrVII:293335:293422:+ 0 1 0 0 1 GTATTT 0 1 1

chrXI 158626 - chrXI:158626:158679:- 0 1 0 0 1 ATATGT 0 1 1

chrII 167927 + chrII:167855:167927:+ 0 1 0 0 1 GTAAGT 0 1 1 chrXI V 217113 - chrXIV:217113:217193:- 0 1 0 0 0 GTATGT 0 1 1

chrVI 137938 + chrVI:137848:137938:+ 0 1 0 0 4 ATGGAT 0 1 0

chrV 201997 + chrV:201953:201997:+ 0 1 0 0 0 GTATGT 0 1 1

chrIV 333709 + chrIV:333639:333709:+ 0 1 0 0 1 GTATGC 0 1 1

chrV 435437 - chrV:435437:435498:- 0 1 0 0 2 GTAAAT 0 1 0

chrVII 35287 - chrVII:35287:35351:- 0 1 0 0 2 GTAAGG 0 1 0

chrXII 823170 + chrXII:823108:823170:+ 0 1 0 0 1 GTAAGT 0 1 1

chrII 756982 - chrII:756982:757037:- 0 1 0 0 1 GTAGGT 0 1 1 chrXV I 560470 + chrXVI:560413:560470:+ 0 1 0 0 1 GTAAGT 0 1 1

chrXII 987203 + chrXII:987139:987203:+ 0 1 0 1 0 GTATGT 0 1 1

chrX 237005 + chrX:236962:237005:+ 0 1 0 0 1 GTATGC 0 1 1

chrV 487859 + chrV:487816:487859:+ 0 1 0 0 1 GTATGA 0 1 1

chrX 633064 - chrX:633064:633144:- 0 1 0 0 2 GTAAGC 0 1 0

chrXV 249913 + chrXV:249872:249913:+ 0 1 0 0 3 ATCGGT 0 1 0

chrVI 268722 + chrVI:268674:268722:+ 0 1 0 0 1 GTATGC 0 1 1

chrX 526803 - chrX:526803:526879:- 0 1 0 0 2 TAATGT 0 1 0 chrXV I 616651 + chrXVI:616593:616651:+ 0 1 0 0 0 GTATGT 0 1 1

chrVII 561205 - chrVII:561205:561267:- 0 1 0 0 2 GGTTGT 0 1 0

chrI 126071 + chrI:125985:126071:+ 0 1 0 0 1 GTATGG 0 1 1

chrXII 75835 - chrXII:75835:75882:- 0 1 0 0 3 GTAAAC 0 1 0 chrXII I 845438 - chrXIII:845438:845541:- 0 1 0 0 1 TTATGT 0 1 1

chrIV 456688 - chrIV:456688:456757:- 0 1 0 0 1 GTAAGT 0 1 1

chrXI 166478 + chrXI:166405:166478:+ 0 1 0 1 1 GTACGT 0 1 1 chrXI V 349280 - chrXIV:349280:349355:- 0 1 0 0 1 GTATGG 0 1 1

chrVI 32028 - chrVI:32028:32077:- 0 1 0 0 1 GTAAGT 0 1 1

chrIX 127053 - chrIX:127053:127135:- 0 1 0 0 1 GTATAT 0 1 1

148 chrIII 84709 + chrIII:84665:84709:+ 0 1 0 0 1 GTCTGT 0 1 1 chrXV I 159662 + chrXVI:159585:159662:+ 0 1 0 0 1 GTAAGT 0 1 1 chrVII 574777 + chrVII:574705:574777:+ 0 1 0 0 1 GTATTT 0 1 1 chrXII I 810474 - chrXIII:810474:810549:- 0 1 0 0 1 GTATGA 0 1 1 chrXI V 227320 - chrXIV:227320:227353:- 0 1 0 0 4 GTGATA 0 1 0 chrIV 104615 + chrIV:104542:104615:+ 0 1 0 0 1 GTATGA 0 1 1 chrXV 845022 + chrXV:844959:845022:+ 0 1 0 0 0 GTATGT 0 1 1 chrIII 169189 - chrIII:169189:169268:- 0 1 0 0 1 GTATAT 0 1 1 chrXV I 584023 - chrXVI:584023:584078:- 0 1 0 0 2 GTATTG 0 1 0 chrX 71475 + chrX:71426:71475:+ 0 1 0 0 2 GTTGGT 0 1 0 chrVII 658296 + chrVII:658204:658296:+ 0 1 0 0 1 GTTTGT 0 1 1 chrXI V 359077 - chrXIV:359077:359147:- 0 1 0 0 1 GTATAT 0 1 1 chrV 203928 + chrV:203855:203928:+ 0 1 0 0 1 GTATGG 0 1 1 chrXI V 69244 + chrXIV:69184:69244:+ 0 1 0 0 0 GTATGT 0 1 1 chrXV I 490809 - chrXVI:490809:490857:- 0 1 0 0 1 GTAAGT 0 1 1 chrXV I 824941 - chrXVI:824941:825021:- 0 1 0 0 2 GTGCGT 0 1 0 chrXI 272947 - chrXI:272947:272992:- 0 1 0 0 0 GTATGT 0 1 1 chrIII 228656 + chrIII:228610:228656:+ 0 1 0 0 1 GTATGA 0 1 1 chrXI 38909 + chrXI:38829:38909:+ 0 1 0 0 4 AGATAG 0 1 0 chrXV I 101222 + chrXVI:101139:101222:+ 0 1 0 0 1 GTAAGT 0 1 1 chrXII I 301516 - chrXIII:301516:301586:- 0 1 0 0 1 GTAAGT 0 1 1 chrVII 792327 - chrVII:792327:792385:- 0 1 0 0 1 GTATCT 0 1 1 chrXV 733355 - chrXV:733355:733417:- 0 1 0 0 1 GCATGT 0 1 1 chrXV I 231560 + chrXVI:231499:231560:+ 0 1 0 0 2 GTAAGA 0 1 0 chrXII 783405 - chrXII:783405:783492:- 0 1 0 0 2 TAATGT 0 1 0 chrVII 593472 + chrVII:593395:593472:+ 0 1 0 0 1 GTAAGT 0 1 1 chrIII 290580 + chrIII:290522:290580:+ 0 1 0 0 1 GTATTT 0 1 1 chrVII 262616 - chrVII:262616:262696:- 0 1 0 0 0 GTATGT 0 1 1 chrXI V 427177 + chrXIV:427106:427177:+ 0 1 0 0 1 GTATGA 0 1 1 chrXI V 380702 - chrXIV:380702:380783:- 0 1 0 1 0 GTATGT 0 1 1 chrII 49795 + chrII:49765:49795:+ 0 1 0 0 3 GAACGC 0 1 0 chrIII 119534 - chrIII:119534:119586:- 0 1 0 0 3 GTACTG 0 1 0 chrXII 522715 + chrXII:522672:522715:+ 0 1 0 1 0 GTATGT 0 1 1 chrV 160514 + chrV:160465:160514:+ 0 1 0 0 1 GTATAT 0 1 1 chrXI V 394037 + chrXIV:393996:394037:+ 0 1 0 0 1 GTATGC 0 1 1 chrII 306857 - chrII:306857:306902:- 0 1 0 0 2 GTATCA 0 1 0 chrVII 914790 - chrVII:914790:914879:- 0 1 0 0 1 GTAAGT 0 1 1 chrXI V 767744 - chrXIV:767744:767788:- 0 1 0 0 2 GTACGC 0 1 0 chrIV 768400 - chrIV:768400:768453:- 0 1 0 0 1 ATATGT 0 1 1 chrII 45935 + chrII:45875:45935:+ 0 1 0 0 2 GCATGC 0 1 0 chrVII I 115532 + chrVIII:115489:115532:+ 0 1 0 0 2 GTATAA 0 1 0 chrIII 120379 + chrIII:120290:120379:+ 0 1 0 0 1 GTAAGT 0 1 1 chrXII 965102 + chrXII:965038:965102:+ 0 1 0 0 1 GTATAT 0 1 1 133331 chrIV 6 - chrIV:1333316:1333401:- 0 1 0 0 1 GTATGC 0 1 1 chrIX 261766 + chrIX:261723:261766:+ 0 1 0 0 1 GTATGA 0 1 1

149 chrXI 490071 + chrXI:489994:490071:+ 0 1 0 0 0 GTATGT 0 1 1

chrXV 139434 + chrXV:139377:139434:+ 0 1 0 0 1 GTAAGT 0 1 1

chrIX 334600 - chrIX:334600:334643:- 0 1 0 0 1 GTTTGT 0 1 1

chrX 119609 - chrX:119609:119650:- 0 1 0 0 3 GAGTTT 0 1 0 chrXI V 726914 - chrXIV:726914:726978:- 0 1 0 0 1 GTCTGT 0 1 1 chrXI V 401581 + chrXIV:401504:401581:+ 0 1 0 0 2 GTATCG 0 1 0

chrXII 380684 + chrXII:380627:380684:+ 0 1 0 0 0 GTATGT 0 1 1

chrXV 754237 - chrXV:754237:754290:- 0 1 0 0 1 GTAAGT 0 1 1 111188 chrIV 7 + chrIV:1111836:1111887:+ 0 1 0 0 0 GTATGT 0 1 1 chrXI V 763546 + chrXIV:763498:763546:+ 0 1 0 0 3 GTACAG 0 1 0 chrXV I 939906 - chrXVI:939906:939951:- 0 1 0 0 3 GTACCG 0 1 0 105970 chrXV 9 - chrXV:1059709:1059794:- 0 1 0 0 1 GTAAGT 0 1 1

chrV 374671 - chrV:374671:374721:- 0 1 0 0 1 GTTTGT 0 1 1

chrX 649521 - chrX:649521:649569:- 0 1 0 0 1 GCATGT 0 1 1

chrX 570580 - chrX:570580:570633:- 0 1 0 0 3 GTAGTC 0 1 0 chrXV I 481183 - chrXVI:481183:481238:- 0 1 0 0 1 GTACGT 0 1 1 chrXI V 545602 + chrXIV:545556:545602:+ 0 1 0 0 2 GTAACT 0 1 0

chrVII 314262 - chrVII:314262:314298:- 0 1 0 0 2 ATATAT 0 1 0

chrXII 232370 + chrXII:232297:232370:+ 0 1 0 0 1 GTAAGT 0 1 1

chrXV 674437 - chrXV:674437:674504:- 0 1 0 0 1 GTATGA 0 1 1 135514 chrIV 0 - chrIV:1355140:1355228:- 0 1 0 0 3 GTAGAG 0 1 0

chrV 305684 + chrV:305624:305684:+ 0 1 0 0 0 GTATGT 0 1 1

chrIX 360694 - chrIX:360694:360737:- 0 1 0 0 1 GCATGT 0 1 1 chrXII I 112660 - chrXIII:112660:112715:- 0 1 0 0 1 GTATGA 0 1 1 145341 chrIV 2 + chrIV:1453370:1453412:+ 0 1 0 0 2 GTTGGT 0 1 0

chrVII 730063 + chrVII:730017:730063:+ 0 1 0 0 1 GTATGG 0 1 1

chrVII 594140 - chrVII:594140:594202:- 0 1 0 0 0 GTATGT 0 1 1

chrXII 987240 + chrXII:987139:987240:+ 0 1 0 1 0 GTATGT 0 1 1

chrVI 109969 - chrVI:109969:110026:- 0 1 0 0 1 GTAAGT 0 1 1

chrVII 293817 + chrVII:293737:293817:+ 0 1 0 0 1 GTATGG 0 1 1

chrIV 130328 + chrIV:130289:130328:+ 0 1 0 0 1 GTATGC 0 1 1 chrXI V 573917 - chrXIV:573917:573973:- 0 1 0 0 2 GTTTGA 0 1 0

chrX 535286 - chrX:535286:535343:- 0 1 0 0 1 GTAAGT 0 1 1

chrIV 247245 - chrIV:247245:247300:- 0 1 0 0 1 GTATAT 0 1 1

chrII 290470 - chrII:290470:290528:- 0 1 0 0 0 GTATGT 0 1 1

chrVII 34393 - chrVII:34393:34444:- 0 1 0 0 4 ACCGGT 0 1 0

chrX 264007 - chrX:264007:264058:- 0 1 0 0 2 GTGCGT 0 1 0 chrXII I 493957 + chrXIII:493909:493957:+ 0 1 0 0 1 GTTTGT 0 1 1

chrI 152003 - chrI:152003:152055:- 0 1 0 0 2 GTATAC 0 1 0 chrXII I 701813 + chrXIII:701740:701813:+ 0 1 0 0 1 GTAGGT 0 1 1

chrIX 128118 + chrIX:128061:128118:+ 0 1 0 0 2 TTTTGT 0 1 0 chrXII I 727505 + chrXIII:727436:727505:+ 0 1 0 0 6 TCCCAA 0 1 0

chrII 372030 + chrII:371971:372030:+ 0 1 0 0 2 GAACGT 0 1 0

chrVII 436364 + chrVII:436318:436364:+ 0 1 0 0 0 GTATGT 0 1 1

totals 198 430 268

150

151

Table II-S3. GTATGT motif frequency at 5'SS and generally in introns.

0 or % 0 No. mismatches 1 or 1 from /GTATGT 0 1 2 3 4 5 6 total mut mut Annotated 5'SS 216 73 6 18 7 7 3 330 289 87.58 Annotated 5'SS, no chrM 216 73 6 1 2 0 0 298 289 96.98 Arbitrary intron positions 10 24 130 471 982 973 390 2980 34 1.1

152 Table II-S4. Branch-seq CPMs.

The format of the bp_id is chromsome:bp_nucleotide:strand_of_bp

Branch_seq_cpm are the counts per million calculated from all data (top, middle, and bottom slices of gel arc combined).

If the gene_name is “not_in_intron_or_TIF” it did not fall inside a gene according to Pelechano et al. (2013).

Bp_type is “annotated” if the peak fell within 3nt of an annotated BP location according to Meyer et al. (2011).

bp_id branch_seq_cpm gene_name bp_type chrVII:497961:- 476.7266362 YGR001C annotated chrXIII:123777:- 1.126520145 YML073C annotated chrV:433162:+ 0.292060778 YER133W annotated chrV:308040:+ 205.9028487 YER074W-A annotated chrV:307801:+ 248.1264927 YER074W-A annotated chrV:269803:- 0.083445937 YER056C-A annotated chrV:348217:- 221.7575767 YER093C-A annotated chrV:148254:+ 301.9908448 YEL003W annotated chrV:239667:- 0.75101343 YER044C-A annotated chrV:131853:+ 160.3830903 YEL012W annotated chrV:159013:- 674.4517831 YER003C annotated chrV:548611:+ 1.376857955 YER179W annotated chrV:166786:- 251.2557153 YER007C-A annotated chrV:397225:+ 0.333783747 YER117W annotated chrII:462255:+ 274.036456 YBR111W-A annotated chrII:479389:+ 247.9596008 YBR119W annotated chrII:602175:+ 2.711992942 YBR186W annotated chrII:606615:+ 0.917905303 YBR191W annotated chrII:653429:+ 196.306566 YBR215W annotated chrII:170771:+ 29.95709126 YBL026W annotated chrII:110932:+ 198.5178833 YBL059W annotated chrII:333365:+ 7.092904617 YBR048W annotated chrII:110451:- 375.590161 YBL059C-A annotated chrII:186375:- 513.1507876 YBL018C annotated chrII:462472:+ 79.1901939 YBR111W-A annotated chrII:142787:- 241.9932163 YBL040C annotated chrII:47071:- 44.47668425 YBL091C-A annotated chrII:168770:+ 0.166891873 YBL027W annotated chrII:426524:- 0.458952652 YBR090C annotated chrII:125234:+ 112.1930619 YBL050W annotated chrII:366534:- 129.0491411 YBR062C annotated 153 chrII:726947:- 120.9966082 YBR255C-A annotated chrII:679973:- 102.6802251 YBR230C annotated chrII:407047:- 2316.334033 YBR082C annotated chrVI:64615:- 0.458952652 YFL034C-A annotated chrVI:221291:- 3.00405372 YFR031C-A annotated chrVI:242044:+ 349.5967517 YFR045W annotated chrVI:63915:- 332.3651658 YFL034C-B annotated chrVI:203304:- 787.2289666 YFR024C-A annotated chrXIV:62407:- 0.166891873 YNL302C annotated chrXIV:534915:- 362.1970881 YNL050C annotated chrXIV:494551:- 3.50472934 YNL069C annotated chrXIV:623270:+ 0.083445937 YNL004W annotated chrXIV:351021:+ 438.2163364 YNL147W annotated chrXIV:545341:+ 1186.726388 YNL044W annotated chrXIV:48377:+ 87.78512538 YNL312W annotated chrXIV:331803:+ 0.50067562 YNL162W annotated chrXIV:185553:+ 134.3062351 YNL246W annotated chrXIV:145191:- 172.9417038 YNL265C annotated chrXIV:609837:+ 1.25168905 YNL012W annotated chrXIV:366096:+ 225.1371371 YNL138W-A annotated chrXIV:380726:- 537.4335551 YNL130C annotated chrXIV:415872:+ 0.292060778 YNL112W annotated chrXIV:557672:+ 334.659929 YNL038W annotated chrXVI:96189:- 218.8369689 YPL241C annotated chrXVI:492958:- 403.210766 YPL031C annotated chrXVI:623662:+ 391.1528281 YPR028W annotated chrXVI:729414:- 126.3788711 YPR098C annotated chrXVI:407002:+ 9.679728654 YPL079W annotated chrXVI:305374:+ 165.9322451 YPL129W annotated chrXVI:883453:+ 202.1477816 YPR170W-B annotated chrXVI:678219:- 565.4713899 YPR063C annotated chrXVI:345483:- 0.083445937 YPL109C annotated chrXVI:76014:- 0.166891873 YPL249C-A annotated chrXVI:218711:+ 254.3849379 YPL175W annotated chrXVI:654534:+ 0.292060778 YPR043W annotated chrXVI:405426:+ 0.083445937 YPL081W annotated chrXVI:911328:+ 1047.663735 YPR187W annotated chrXVI:833779:+ 136.8930591 YPR153W annotated chrXVI:412958:+ 28.87229409 YPL075W annotated chrXVI:174036:+ 0.792736398 YPL198W annotated chrXI:625594:+ 630.6009434 YKR095W-A annotated chrXI:447376:- 229.267711 YKR004C annotated chrXI:430163:- 132.1366407 YKL006C-A annotated chrXI:283016:+ 0.292060778 YKL081W annotated chrXI:437526:+ 592.6330422 YKL002W annotated chrXI:449611:- 5.131925105 YKR005C annotated chrXI:83061:+ 319.4727685 YKL190W annotated

154 chrXI:618096:- 15.64611313 YKR094C annotated chrVII:543708:+ 129.6332626 YGR029W annotated chrVII:946407:+ 1.00135124 YGR225W annotated chrVII:497394:- 401.9590769 YGR001C annotated chrVII:157245:- 3.212668562 YGL183C annotated chrVII:435728:+ 0.166891873 YGL033W annotated chrVII:346854:- 32.08496265 YGL087C annotated chrVII:73034:- 19.65151809 YGL226C-A annotated chrVII:311486:+ 0.667567493 YGL103W annotated chrVII:556282:+ 0.625844525 YGR034W annotated chrVII:62173:+ 317.0945594 YGL232W annotated chrX:702871:- 0.166891873 YJR145C annotated chrX:387380:- 414.5594134 YJL031C annotated chrX:435302:+ 207.3214297 YJL001W annotated chrX:76268:+ 0.083445937 YJL189W annotated chrX:74178:+ 31.33394922 YJL191W annotated chrX:365853:+ 441.0117753 YJL041W annotated chrX:608540:+ 0.792736398 YJR094W-A annotated chrX:396512:- 329.3193891 YJL024C annotated chrX:469190:- 15.47922125 YJR021C annotated chrX:50351:- 77.02059955 YJL205C annotated chrXV:93868:- 0.083445937 YOL120C annotated chrXV:240976:- 344.9655022 YOL048C annotated chrXV:778910:- 0.083445937 YOR234C annotated chrXV:867516:+ 0.542398588 YOR293W annotated chrXV:242453:- 19.60979512 YOL047C annotated chrXV:92476:- 51.44441996 YOL121C annotated chrXV:900802:- 1.585472797 YOR312C annotated chrIX:166484:+ 603.3141221 YIL106W annotated chrIX:232012:- 1.293412018 YIL069C annotated chrIX:225834:- 0.083445937 YIL073C annotated chrIX:317136:+ 0.542398588 YIL018W annotated chrIX:47743:+ 547.1132838 YIL156W-B annotated chrIX:155276:+ 275.329868 YIL111W annotated chrIX:348380:- 0.50067562 YIL004C annotated chrXII:522984:+ 0.292060778 YLR185W annotated chrXII:766205:- 535.8063594 YLR316C annotated chrXII:564467:- 31.12533438 YLR211C annotated chrXII:548706:- 175.1530211 YLR199C annotated chrXII:1024631:+ 1.919256543 YLR445W annotated chrXII:398583:+ 89.28715224 YLR128W annotated chrXII:286498:- 113.0275212 YLR078C annotated chrXII:786667:+ 10.84797177 YLR329W annotated chrXII:857038:+ 81.40151122 YLR367W annotated chrXII:40353:- 2850.179413 YLL050C annotated chrXII:694444:+ 310.5023304 YLR275W annotated chrXII:250899:- 1.710641702 YLR054C annotated

155 chrXII:327295:- 10.51418802 YLR093C annotated chrXII:744219:+ 171.6900147 YLR306W annotated chrXII:766086:- 482.1506221 YLR316C annotated chrXII:987191:+ 193.8031879 YLR426W annotated chrXII:242652:+ 0.584121557 YLR048W annotated chrXII:263540:+ 0.166891873 YLR061W annotated chrXIII:337886:+ 835.6276098 YMR033W annotated chrXIII:666961:- 225.4291979 YMR201C annotated chrXIII:223421:- 0.292060778 YML026C annotated chrXIII:499898:- 0.208614842 YMR116C annotated chrXIII:732835:+ 0.083445937 YMR230W annotated chrXIII:651593:+ 0.166891873 YMR194W annotated chrXIII:854879:+ 223.6768332 YMR292W annotated chrXIII:425113:+ 0.083445937 YMR079W annotated chrXIII:82343:+ 217.2932191 YML094W annotated chrXIII:206162:+ 118.4515071 YML036W annotated chrXIII:140113:- 99.59272542 YML067C annotated chrXIII:537527:+ 1.543749828 YMR133W annotated chrXIII:517842:+ 0.083445937 YMR125W annotated chrXIII:721231:- 1.460303892 YMR225C annotated chrXIII:753788:- 0.125168905 YMR242C annotated chrXIII:652797:- 114.9885007 YMR194C-B annotated chrXIII:211547:+ 59.9559055 YML034W annotated chrXIII:99301:- 437.2984311 YML085C annotated chrXIII:225264:- 634.5646254 YML025C annotated chrIII:111588:- 118.2428923 YCL002C annotated chrIII:177952:- 1.919256543 YCR031C annotated chrIII:173137:- 618.2926678 YCR028C-A annotated chrIII:101646:- 93.66806391 YCL012C annotated chrIII:107089:+ 297.6099331 YCL005W-A annotated chrIII:107255:+ 352.4339135 YCL005W-A annotated chrIV:1212941:+ 276.3312193 YDR367W annotated chrIV:1266790:- 144.945592 YDR397C annotated chrIV:1237581:+ 3.379560435 YDR381W annotated chrIV:1103871:+ 150.8285305 YDR318W annotated chrIV:1073346:- 150.9119765 YDR305C annotated chrIV:1450485:- 1.960979512 YDR500C annotated chrIV:491873:+ 0.166891873 YDR025W annotated chrIV:65358:+ 205.9862947 YDL219W annotated chrIV:1238769:- 150.7868076 YDR381C-A annotated chrIV:1319751:- 113.277859 YDR424C annotated chrIV:254999:- 439.1342417 YDL115C annotated chrIV:267780:+ 101.5537049 YDL108W annotated chrIV:239421:- 415.2687038 YDL125C annotated chrIV:217970:+ 0.417229683 YDL136W annotated chrIV:431423:- 1294.872322 YDL012C annotated chrIV:337596:+ 331.6975983 YDL064W annotated

156 chrIV:715265:- 166.4746437 YDR129C annotated chrIV:579963:+ 20.65286933 YDR064W annotated chrIV:399468:+ 10.38901912 YDL029W annotated chrIV:630016:+ 0.709290462 YDR092W annotated chrIV:733713:- 307.7903374 YDR139C annotated chrIV:569665:- 153.7491383 YDR059C annotated chrIV:458048:- 576.6531454 YDR005C annotated chrIV:307375:- 0.25033781 YDL083C annotated chrIV:230262:+ 0.083445937 YDL130W annotated chrIV:1319627:- 157.2538677 YDR424C annotated chrI:151022:- 100.4271848 YAL001C annotated chrI:87439:+ 179.6173787 YAL030W annotated chrVIII:354926:+ 255.67835 YHR123W annotated chrVIII:129611:+ 89.12026036 YHR012W annotated chrVIII:187613:- 1389.917244 YHR039C-A annotated chrVIII:498731:- 30.45776688 YHR199C-A annotated chrVIII:382356:- 0.083445937 YHR141C annotated chrVIII:298418:- 198.976836 YHR097C annotated chrVIII:104751:+ 0.292060778 YHL001W annotated chrVIII:255689:- 528.1293332 YHR077C annotated chrVIII:107875:+ 674.0345535 YHR001W-A annotated chrVIII:315810:- 286.3864547 YHR101C annotated chrVIII:189778:- 151.7881588 YHR041C annotated chrVIII:251233:+ 92.41637486 YHR076W annotated chrVIII:262372:- 0.083445937 YHR079C-A annotated chrVIII:138274:- 0.083445937 YHR016C annotated chrV:151105:+ 0.625844525 not_in_intron_or_TIF cnBP chrV:166806:- 0.625844525 YER007C-A cnBP chrV:540383:- 1.084797177 YER175C cnBP chrII:115534:+ 15.47922125 YBL056W cnBP chrII:170731:+ 24.44965944 YBL026W cnBP chrII:221024:+ 3.087499657 not_in_intron_or_TIF cnBP chrII:606346:+ 0.709290462 YBR191W cnBP chrII:291715:- 1.00135124 YBR025C cnBP chrII:592709:- 6.425337124 YBR181C cnBP chrXVI:218695:+ 0.876182335 YPL175W cnBP chrXVI:335968:+ 1.335134987 YPL114W cnBP chrXVI:76164:- 0.667567493 YPL249C-A cnBP chrXVI:281450:- 0.667567493 snR17b cnBP chrXVI:602284:- 4.839864327 not_in_intron_or_TIF cnBP chrXVI:729395:- 1.460303892 YPR098C cnBP chrXVI:777582:- 0.834459367 not_in_intron_or_TIF cnBP chrXVI:937537:- 1.25168905 not_in_intron_or_TIF cnBP chrXI:96757:+ 2.795438878 YKL184W cnBP chrXI:468921:- 0.917905303 YKR015C cnBP chrVII:439462:+ 5.25709401 YGL030W cnBP chrVII:543691:+ 2.419932163 YGR029W cnBP

157 chrVII:555885:+ 0.542398588 YGR034W cnBP chrVII:143935:- 6.174999314 not_in_intron_or_TIF cnBP chrVII:497408:- 1.543749828 YGR001C cnBP chrVII:859434:- 5.799492599 YGR183C cnBP chrXIII:182669:+ 0.959628272 YML046W cnBP chrXIII:425094:+ 0.375506715 YMR079W cnBP chrXIII:652808:- 3.50472934 YMR194C-B cnBP chrXV:505980:+ 4.965033232 YOR096W cnBP chrXV:930113:+ 2.086148417 YOR326W cnBP chrXV:117635:- 0.792736398 not_in_intron_or_TIF cnBP chrXV:349520:- 1.209966082 not_in_intron_or_TIF cnBP chrIX:155301:+ 1.084797177 YIL111W cnBP chrIX:166501:+ 0.709290462 YIL106W cnBP chrXII:382387:+ 3.713344182 YLR116W cnBP chrXII:286519:- 3.212668562 YLR078C cnBP chrX:435281:+ 1.376857955 YJL001W cnBP chrIV:107156:+ 1.710641702 YDL195W cnBP chrIV:392638:+ 0.50067562 not_in_intron_or_TIF cnBP chrIV:438275:+ 1.75236467 YDL007W cnBP chrIV:655244:+ 1.960979512 YDR100W cnBP chrIV:1145148:+ 30.37432095 YDR336W cnBP chrIV:22303:- 3.75506715 not_in_intron_or_TIF cnBP chrIV:451430:- 4.130573865 YDR001C cnBP chrVIII:129590:+ 0.75101343 YHR012W cnBP chrVIII:406673:- 0.959628272 not_in_intron_or_TIF cnBP chrV:61194:+ 0.166891873 not_in_intron_or_TIF cnBP chrV:124638:- 0.667567493 YEL016C cnBP chrV:160514:+ 0.375506715 YER004W, YER005W cnBP chrV:201997:+ 46.43766376 YER023W cnBP chrV:203928:+ 0.333783747 YER024W cnBP chrV:305684:+ 9.26249897 YER073W cnBP chrV:336936:+ 1.543749828 YER088W-B cnBP chrV:362830:+ 1.75236467 YER102W cnBP chrV:374671:- 2.461655132 YER107C cnBP chrV:487859:+ 0.625844525 not_in_intron_or_TIF cnBP chrV:517901:+ 395.3668479 YER167W cnBP chrV:561245:+ 0.083445937 not_in_intron_or_TIF cnBP chrII:13949:- 0.959628272 not_in_intron_or_TIF cnBP chrII:76231:+ 0.75101343 not_in_intron_or_TIF cnBP chrII:167927:+ 0.166891873 not_in_intron_or_TIF cnBP chrII:206284:+ 0.125168905 not_in_intron_or_TIF cnBP chrII:290470:- 0.083445937 YBR025C cnBP chrII:331391:- 5.090202137 YBR046C cnBP chrII:341038:- 0.75101343 not_in_intron_or_TIF cnBP chrII:342756:+ 5.50743182 not_in_intron_or_TIF cnBP chrII:431594:- 0.50067562 not_in_intron_or_TIF cnBP chrII:443734:- 32.29357749 YBR101C cnBP

158 chrII:519067:- 0.292060778 not_in_intron_or_TIF cnBP chrII:756982:- 0.25033781 not_in_intron_or_TIF cnBP chrVI:32028:- 0.625844525 not_in_intron_or_TIF cnBP chrVI:96258:+ 1.209966082 YFL021W cnBP chrVI:109969:- 0.292060778 not_in_intron_or_TIF cnBP chrVI:176327:- 3.75506715 YFR015C cnBP chrVI:268722:+ 0.375506715 not_in_intron_or_TIF cnBP chrXIV:69244:+ 0.834459367 YNL298W cnBP chrXIV:217113:- 0.542398588 YNL231C cnBP chrXIV:250194:- 0.083445937 YNL211C cnBP chrXIV:273231:+ 0.125168905 not_in_intron_or_TIF cnBP chrXIV:349280:- 0.125168905 YNL149C cnBP chrXIV:359077:- 0.083445937 not_in_intron_or_TIF cnBP chrXIV:369425:- 0.125168905 YNL137C cnBP chrXIV:373658:+ 2.086148417 not_in_intron_or_TIF cnBP chrXIV:380702:- 16.6057414 YNL130C cnBP chrXIV:394037:+ 0.584121557 YNL124W cnBP chrXIV:427177:+ 1.710641702 not_in_intron_or_TIF cnBP chrXIV:429575:- 8.886992255 not_in_intron_or_TIF cnBP chrXIV:429599:- 0.458952652 not_in_intron_or_TIF cnBP chrXIV:583910:- 3.546452309 YNL025C cnBP chrXIV:611579:+ 4.464357612 YNL012W cnBP chrXIV:726914:- 0.125168905 not_in_intron_or_TIF cnBP chrXVI:101222:+ 0.208614842 YPL237W cnBP chrXVI:115269:+ 65.7553981 YPL230W cnBP chrXVI:146507:- 10.9314177 not_in_intron_or_TIF cnBP chrXVI:159662:+ 0.208614842 YPL208W cnBP chrXVI:196570:- 0.125168905 YPL184C cnBP chrXVI:339245:+ 0.083445937 not_in_intron_or_TIF cnBP chrXVI:339305:+ 0.083445937 not_in_intron_or_TIF cnBP chrXVI:445519:- 3.629898245 not_in_intron_or_TIF cnBP chrXVI:481183:- 4.00540496 YPL037C cnBP chrXVI:490809:- 0.333783747 YPL032C cnBP chrXVI:513415:- 0.125168905 YPL020C cnBP chrXVI:560470:+ 0.50067562 not_in_intron_or_TIF cnBP chrXVI:579895:- 0.333783747 YPR010C cnBP chrXVI:590144:- 0.125168905 YPR015C cnBP chrXVI:616651:+ 0.125168905 not_in_intron_or_TIF cnBP chrXVI:617122:+ 0.125168905 not_in_intron_or_TIF cnBP chrXVI:685775:+ 0.083445937 YPR070W cnBP chrXVI:717057:- 0.417229683 YPR091C cnBP chrXVI:883587:- 2.044425448 not_in_intron_or_TIF cnBP chrXI:93384:- 13.39307284 YKL186C cnBP chrXI:94081:- 5.215371042 not_in_intron_or_TIF cnBP chrXI:158626:- 0.083445937 not_in_intron_or_TIF cnBP chrXI:166478:+ 4.464357612 YKL150W cnBP chrXI:191089:- 0.083445937 YKL134C cnBP

159 chrXI:193010:- 38.80236055 YKL133C cnBP chrXI:225774:- 3.671621214 not_in_intron_or_TIF cnBP chrXI:231492:+ 0.458952652 YKL109W cnBP chrXI:272947:- 0.834459367 not_in_intron_or_TIF cnBP chrXI:285868:+ 0.625844525 YKL080W cnBP chrXI:290610:- 0.959628272 not_in_intron_or_TIF cnBP chrXI:408146:+ 25.74307146 YKL015W cnBP chrXI:446830:- 0.292060778 YKR004C cnBP chrXI:490071:+ 6.717397902 not_in_intron_or_TIF cnBP chrXI:633804:- 0.208614842 YKR098C cnBP chrXI:649449:+ 0.083445937 not_in_intron_or_TIF cnBP chrVII:51958:+ 0.375506715 YGL238W cnBP chrVII:59556:+ 0.125168905 YGL233W cnBP chrVII:140633:+ 0.083445937 not_in_intron_or_TIF cnBP chrVII:167430:+ 6.00810744 YGL178W cnBP chrVII:253205:- 95.17009077 YGL136C cnBP chrVII:262616:- 0.083445937 not_in_intron_or_TIF cnBP chrVII:293422:+ 0.083445937 YGL114W cnBP chrVII:293817:+ 0.75101343 YGL114W cnBP chrVII:345543:- 0.125168905 not_in_intron_or_TIF cnBP chrVII:380660:- 0.083445937 YGL065C cnBP chrVII:383554:+ 94.46080031 YGL063W cnBP chrVII:414313:+ 0.041722968 YGL045W cnBP chrVII:423867:- 0.333783747 not_in_intron_or_TIF cnBP chrVII:427172:- 0.375506715 YGL037C cnBP chrVII:436364:+ 39.9288807 not_in_intron_or_TIF cnBP chrVII:504217:- 0.125168905 not_in_intron_or_TIF cnBP chrVII:574777:+ 1.919256543 not_in_intron_or_TIF cnBP chrVII:593472:+ 0.292060778 not_in_intron_or_TIF cnBP chrVII:594140:- 1.043074208 not_in_intron_or_TIF cnBP chrVII:607018:- 0.375506715 not_in_intron_or_TIF cnBP chrVII:658296:+ 0.083445937 YGR089W cnBP chrVII:682969:- 0.083445937 not_in_intron_or_TIF cnBP chrVII:730063:+ 0.083445937 not_in_intron_or_TIF cnBP chrVII:772104:+ 0.417229683 YGR141W cnBP chrVII:792327:- 0.625844525 YGR150C cnBP chrVII:875660:+ 0.125168905 not_in_intron_or_TIF cnBP chrVII:914790:- 0.667567493 YGR210C cnBP chrVII:962303:+ 0.667567493 not_in_intron_or_TIF cnBP chrVII:1002936:+ 0.166891873 not_in_intron_or_TIF cnBP chrVII:1052638:- 0.333783747 not_in_intron_or_TIF cnBP chrXIII:56618:+ 0.25033781 YML106W cnBP chrXIII:112660:- 0.208614842 YML076C cnBP chrXIII:204536:- 0.584121557 not_in_intron_or_TIF cnBP chrXIII:242968:- 0.50067562 YML015C cnBP chrXIII:273520:- 0.458952652 not_in_intron_or_TIF cnBP chrXIII:301516:- 0.417229683 YMR015C cnBP

160 chrXIII:493957:+ 0.125168905 not_in_intron_or_TIF cnBP chrXIII:551142:- 6.049830409 YMR142C cnBP chrXIII:559830:+ 128.5484654 not_in_intron_or_TIF cnBP chrXIII:637453:+ 0.667567493 YMR189W cnBP chrXIII:647112:+ 1.168243113 YMR192W cnBP chrXIII:649541:- 0.125168905 not_in_intron_or_TIF cnBP chrXIII:653956:- 0.166891873 not_in_intron_or_TIF cnBP chrXIII:701813:+ 0.166891873 YMR217W cnBP chrXIII:810474:- 0.292060778 YMR272C cnBP chrXIII:845438:- 0.083445937 not_in_intron_or_TIF cnBP chrXV:87614:+ 0.667567493 YOL123W cnBP chrXV:139434:+ 0.208614842 not_in_intron_or_TIF cnBP chrXV:418633:+ 0.125168905 not_in_intron_or_TIF cnBP chrXV:423675:- 1.293412018 YOR049C cnBP chrXV:437796:- 0.458952652 not_in_intron_or_TIF cnBP chrXV:506801:- 16.68918733 YOR097C cnBP chrXV:518764:- 2.711992942 not_in_intron_or_TIF cnBP chrXV:527237:+ 0.417229683 YOR109W cnBP chrXV:605429:+ 0.292060778 not_in_intron_or_TIF cnBP chrXV:674437:- 0.083445937 YOR180C cnBP chrXV:720499:+ 2.75371591 not_in_intron_or_TIF cnBP chrXV:733355:- 7.76047211 YOR207C cnBP chrXV:754237:- 0.166891873 not_in_intron_or_TIF cnBP chrXV:819159:+ 0.125168905 YOR264W cnBP chrXV:845022:+ 13.05928909 not_in_intron_or_TIF cnBP chrXV:1059709:- 0.125168905 not_in_intron_or_TIF cnBP chrXV:1062512:- 1.877533575 not_in_intron_or_TIF cnBP chrIX:54026:+ 0.834459367 YIL153W cnBP chrIX:127053:- 1.126520145 not_in_intron_or_TIF cnBP chrIX:134040:+ 8.636654445 YIL121W cnBP chrIX:225116:+ 0.166891873 not_in_intron_or_TIF cnBP chrIX:261766:+ 1.710641702 not_in_intron_or_TIF cnBP chrIX:334600:- 0.208614842 not_in_intron_or_TIF cnBP chrIX:360694:- 0.166891873 not_in_intron_or_TIF cnBP chrXII:50297:+ 1.877533575 YLL043W cnBP chrXII:90638:+ 0.25033781 YLL026W cnBP chrXII:155741:- 0.792736398 YLR002C cnBP chrXII:232370:+ 0.166891873 not_in_intron_or_TIF cnBP chrXII:316897:+ 0.292060778 YLR088W cnBP chrXII:331859:- 0.584121557 YLR095C cnBP chrXII:380684:+ 0.458952652 YLR115W cnBP chrXII:382366:+ 1.418580923 YLR116W cnBP chrXII:491545:+ 3.838513087 not_in_intron_or_TIF cnBP chrXII:522715:+ 14.56131595 YLR185W cnBP chrXII:609447:- 7.885641015 YLR233C cnBP chrXII:791911:- 0.083445937 not_in_intron_or_TIF cnBP chrXII:823170:+ 0.292060778 not_in_intron_or_TIF cnBP

161 chrXII:918954:- 0.458952652 YLR398C cnBP chrXII:928301:- 0.667567493 not_in_intron_or_TIF cnBP chrXII:965102:+ 0.083445937 not_in_intron_or_TIF cnBP chrXII:982469:- 71.84695147 not_in_intron_or_TIF cnBP chrXII:987203:+ 7.969086952 YLR426W cnBP chrXII:987240:+ 0.542398588 YLR426W cnBP chrIII:84709:+ 0.083445937 not_in_intron_or_TIF cnBP chrIII:120379:+ 0.125168905 not_in_intron_or_TIF cnBP chrIII:169189:- 0.208614842 not_in_intron_or_TIF cnBP chrIII:228656:+ 0.834459367 YCR063W cnBP chrIII:290580:+ 0.166891873 YCR095W-A cnBP chrX:31820:+ 12.18310675 YJL213W cnBP chrX:209415:+ 0.333783747 YJL111W cnBP chrX:237005:+ 0.25033781 YJL100W cnBP chrX:349225:+ 0.50067562 not_in_intron_or_TIF cnBP chrX:422631:+ 4.631249485 not_in_intron_or_TIF cnBP chrX:455748:+ 0.083445937 not_in_intron_or_TIF cnBP chrX:506239:+ 0.125168905 not_in_intron_or_TIF cnBP chrX:535286:- 0.125168905 not_in_intron_or_TIF cnBP chrX:632996:- 59.1631691 YJR109C cnBP chrX:649521:- 0.125168905 not_in_intron_or_TIF cnBP chrX:712053:- 0.125168905 not_in_intron_or_TIF cnBP chrIV:104615:+ 0.625844525 not_in_intron_or_TIF cnBP chrIV:122159:+ 465.4614348 YDL189W cnBP chrIV:130328:+ 0.208614842 YDL185W cnBP chrIV:188133:- 0.166891873 YDL148C cnBP chrIV:232837:+ 1.00135124 YDL128W cnBP chrIV:235107:+ 4.631249485 YDL127W cnBP chrIV:247245:- 0.125168905 YDL119C cnBP chrIV:268933:- 2.127871385 not_in_intron_or_TIF cnBP chrIV:331277:+ 31.87634781 YDL070W cnBP chrIV:333709:+ 0.25033781 not_in_intron_or_TIF cnBP chrIV:381064:- 0.166891873 YDL040C cnBP chrIV:392681:+ 0.584121557 not_in_intron_or_TIF cnBP chrIV:456688:- 1.376857955 YDR005C cnBP chrIV:540585:+ 1.293412018 YDR041W cnBP chrIV:676357:+ 1.335134987 YDR110W cnBP chrIV:698041:- 0.125168905 YDR123C cnBP chrIV:721691:- 0.125168905 not_in_intron_or_TIF cnBP chrIV:768400:- 1.50202686 not_in_intron_or_TIF cnBP chrIV:929253:+ 2.086148417 YDR232W cnBP chrIV:1021386:+ 0.125168905 YDR280W cnBP chrIV:1080238:- 0.125168905 YDR309C cnBP chrIV:1081204:+ 0.50067562 not_in_intron_or_TIF cnBP chrIV:1111887:+ 0.458952652 not_in_intron_or_TIF cnBP chrIV:1114334:+ 2.378209195 not_in_intron_or_TIF cnBP chrIV:1333316:- 0.125168905 YDR435C cnBP

162 chrIV:1490517:- 1.084797177 not_in_intron_or_TIF cnBP chrIV:1519505:- 0.125168905 YDR541C cnBP chrI:126071:+ 0.166891873 YAL016W cnBP chrI:134871:- 0.083445937 YAL010C cnBP chrI:142343:+ 4.297465739 YAL003W cnBP chrVIII:97941:- 0.166891873 YHL007C cnBP chrVIII:148769:- 5.674323694 not_in_intron_or_TIF cnBP chrVIII:335607:- 0.75101343 YHR112C cnBP chrVIII:372283:+ 0.292060778 YHR134W cnBP chrVIII:400568:- 0.25033781 YHR151C cnBP chrVIII:442922:+ 0.083445937 YHR169W cnBP chrVIII:491866:- 0.083445937 not_in_intron_or_TIF cnBP

163 Table II-S5. SacCer 3 coordinates of lariat junction reads

List of BPs detected through LJ reads from the Lariat-seq data. BP positions were determined from locations of the 3' most end of LJ reads and sequence information as described in the methods. The 'reads_at_dist_to_bp' field represents the number of LJ reads ending at various positions from the reported BP, the first in the list is zero away. The final two fields are marked 1, if the 5'SS or BP was previously annotated in the SGD annotations or by Meyer et al. respectively, or 0 otherwise.

strand chrom 5'ss bp reads_at_dist_to_bp bp_seq anno_5'ss anno_bp - chrII 47146 47074 2,0,2 GTTACTAATATG 1 1 + chrII 125155 125231 49 ATTACTAACATT 1 1 + chrII 170677 170768 0,11 AGTACTAACGTT 1 1 - chrIII 178213 177956 1,6 GATACTAACAAC 1 1 + chrIV 122078 122159 19 CGTACTAACAAC 1 0 - chrIV 215384 215274 19 TTTACTAACGAG 0 0 + chrIV 230020 230262 122,12,5 TTTACTAACAAA 1 1 - chrIV 239509 239421 4,9,353 AATACTAACAAT 1 1 + chrIV 331189 331277 23 ATCACTAACCTG 0 0 + chrIV 399362 399470 100,5,0,0,0,0,3 AATACTAACCAT 1 1 - chrIV 715358 715267 109 TATACTAACAAA 1 1 - chrIX 99385 99153 1,6,33 AATACTAACAAA 1 1 + chrIX 225875 225994 2 TTTACTAATATT 0 0 + chrIX 317018 317138 5 TTTACTAACAGG 0 1 - chrIX 348494 348383 1,16 TTTACTAACTAT 1 1 - chrV 166874 166787 4 ATTACTAACATC 1 1 - chrV 248671 248563 2 CATTCTAACATT 0 0 + chrV 362733 362835 2 AATTCTAACGCA 1 0 - chrV 505180 505049 0,0,2 AGTACTAACCAG 0 0 - chrVI 221414 221303 30,0,0,0,0,0,0,0,0,8 TATACTAACAGA 1 1 - chrVII 73137 73035 10 ATTACTAACAAG 1 1 + chrVII 249887 250015 35 GTTACTAACAGG 1 1 - chrVII 497458 497390 0,0,5 CTTACTAACTGT 1 1 - chrVII 1061028 1060825 10 GATACTAACTTT 0 0 + chrVII 1084883 1085006 14 CTTACTAACTGA 1 1 + chrVIII 129529 129617 2 AATACTAACATA 1 1 - chrVIII 138408 138281 5,65 TGTACTAACAAC 1 1 + chrVIII 251156 251231 0,1 GAGACTAACTTT 1 1 - chrVIII 505516 505289 15 TTTACTAACAAG 1 1 + chrX 73797 74179 2 TTTACTAACAAC 1 1 + chrX 236903 237010 3 ATCACTGACATA 0 0 164 + chrX 435228 435309 2 ATTACTAACTAA 1 1 + chrX 608307 608548 32,3,3 TTTACTAACAAA 1 1 - chrX 703054 702881 5 TTTACTAACGAG 1 1 - chrXI 355283 355153 0,4 CATACTTACAGT 0 0 - chrXI 430596 430520 3,5 GCTACTAACTAT 1 1 - chrXII 327399 327294 5 TAGACTAACGTT 1 1 + chrXII 382301 382388 6 TTTACTTACTAG 0 0 - chrXII 707892 707769 0,0,2 CGTACTGACATT 0 0 - chrXIII 23658 23500 3,30 TTTACTAACAGT 1 1 + chrXIII 236592 236788 72,2,1 GCCACTAACAAT 1 1 - chrXIII 243056 242969 0,0,4 AATACTGACAAT 0 0 + chrXIII 424998 425114 255,10,6 TTTACTAACAAA 1 1 - chrXIII 500151 499899 2,42 CTTACTAACAAA 1 1 - chrXIII 721345 721232 313 AATACTAACAGC 1 1 + chrXIV 48294 48378 10,0,3 ATTACTAACAAT 1 1 - chrXIV 237531 237419 0,0,9 TTTACTGACCTA 0 0 - chrXIV 380781 380701 11 ATTACTAATCTG 1 0 + chrXIV 502164 502269 0,0,10 GATACTGACTAT 0 0 + chrXIV 616067 616229 4,0,4 TCAACTTACTGT 0 0 - chrXV 552874 552738 27,23,51 ATTACTAACTGG 1 1 + chrXV 780121 780265 604,27,100 TTGACTAACACA 1 1 - chrXVI 76223 76014 6 GTTACTAACATA 1 1 - chrXVI 281503 281386 18 TTGACTAACACA 1 1 - chrXVI 582701 582570 76,0,0,0,0,0,0,0,0,0,0,10,59 TATACTAACAAA 1 1 + chrXVI 623578 623665 481,39,30 TATACTAACAAG 1 1 + chrXVI 833694 833783 18 AAAACTAACAAT 1 1 + chrXVI 943051 943174 6,0,1,0,5 CTTACTAACTGA 1 1

165 Table II-S6. Novel splice junctions with entropy ≥ 2 bits.

Junction field is the chromosome, first splice site coordinate, second splice site coordinate, and strand joined by colons. The first splice site is always the one more 5' on the chromosome (lower number), so if the strand is “+” the first SS is the 5'SS and if the strand is “-” the first SS is the 3'SS. a_ss1 and a_ss2 are 0 for unannotated splice sites and 1 for annotated splice sites. a_ss_pair is 1 if this pair of splice sites is annotated as a splice junction in the existing SGD gff annotations. K_SS_1, K_SS_2 and K_SS_pair are set to: 0 if not annotated and not found by Kawashima et al., 1 if annotated by Kawashima et al., 2 if not annotated but occurring in a Kawashima et al. junction from WT or UPF1 null, and 3 if not annotated, but occurring in a Kawashima et al. junction not from WT or UPF1 null. intron_containing_gene is set to 0 if not from an intron containing gene and 1 if it is.

intron_co a_SS a_SS a_SS_p stran entro k_SS k_SS k_SS_p ntaining_ junction 1 2 air d py 5ss_long 3ss_long 1 2 air gene chrXVI:717048:7171 GTATGCTT CGCTATAA 46:- 0 0 0 - 2.61 TT AG 0 0 0 0 chrVII:72983:73137: GTACGTTG CTTAAGAA - 0 1 0 - 4.21 CC AG 2 1 2 1 chrXIV:494321:4949 GTACGTAA CTCCATCT 73:- 0 1 0 - 3.24 AA AG 3 1 3 1 chrIV:1401795:14021 GTTGGTAC ATTATCGT 84:+ 0 1 0 + 3.66 GT AG 2 1 2 1 chrXVI:406645:4070 GTATGTCC AATTTAAA 19:+ 1 0 0 + 3.54 AT AG 1 2 0 1 chrXI:618373:618526 GTTTGTTT TTTTGTAC :- 1 0 0 - 4.19 GT AG 1 2 0 1 chrV:131775:131880 GTATGTTT AACTTCAA :+ 1 0 0 + 3.50 GA AG 1 2 0 1 chrXIV:443658:4441 GTATGTTA AACCATCT 71:- 0 1 0 - 2.32 AA AG 0 1 0 1 chrIV:601387:601496 ATTGACTA AACCAACC :+ 0 0 0 + 3.73 TC AC 0 0 0 0 chrII:110218:110505 GTAAGTAT ATTGAGGA :- 0 1 0 - 2.86 CC AG 3 1 3 1 chrXIV:557609:5576 GTATGTAT ATTTGCCC 98:+ 1 0 0 + 2.16 TC AG 1 2 0 1 chrII:426515:426630 GTAAGTCA ATTAACTT :- 2 0 0 - 3.88 GG AG 1 2 0 1 chrXII:242320:24269 GTATGTAC GTCCACCA 0:+ 1 0 0 + 2.00 AC AG 1 2 0 1 chrII:393179:393507 GTATGTAC TTTATCTAA :+ 1 0 0 + 4.78 AC G 1 2 0 1 chrV:423823:423951 GTAGAGGC TAATTTTTA :+ 0 1 0 + 2.85 AA G 0 1 0 0 chrX:608481:608581 GTAGGTCC TTTTTTGC :+ 0 1 0 + 2.72 AC AG 0 1 0 1 chrXIV:331451:3318 GTATAATC TTCTATTTA 37:+ 0 1 0 + 3.38 TG G 2 1 2 1 chrXVI:138724:1388 GTATGTTA GTACAGTC 68:+ 1 0 0 + 4.10 TC AG 1 3 0 1 chrVII:436311:43637 GTATGTAT ATCTTTAC 4:+ 0 0 0 + 3.90 TT AG 0 0 0 1 chrXVI:412311:4129 GTATGGAG ATTGGAAC 95:+ 0 0 0 + 3.23 TT AG 2 2 2 1 chrIV:308521:308792 GCATGCAT ATTATATC :+ 0 1 0 + 2.48 AA AG 0 1 0 1 chrIII:107032:10730 GTATGTGT AGAAGTAC 4:+ 1 0 0 + 2.25 CA AG 1 0 0 1 GTATGTTA AACTAGTT chrII:60188:60697:- 0 1 0 - 2.12 AA AG 2 1 2 1 chrXVI:412255:4129 GTATGGTA ATTGGAAC 95:+ 0 0 0 + 2.95 TG AG 2 2 2 1 chrXIII:23359:23654 GTATGCGT TTTACAAC :- 1 0 0 - 3.03 TC AG 1 2 0 1 chrXII:28461:28834: GTATGACA GGATATTA - 0 0 0 - 2.25 CA AG 0 0 0 0

chrIV:1359968:13603 1 0 0 + 4.10 GTATGTTT TATCAATA 1 2 0 1 166 73:+ AT AG chrXIII:551950:5525 GTATGTTT TTTTGGTA 07:+ 1 0 0 + 4.13 TC AG 1 2 0 1 chrXIV:185490:1855 GTATGTAG ATGCACTT 66:+ 1 0 0 + 3.78 GA AG 1 2 0 1 chrII:168647:168808 GTACGTGT TTTTTCACA :+ 0 1 0 + 2.59 CT G 0 1 0 1 chrXIII:914403:9146 GTAAGTAA ATGTGGTC 48:- 0 0 0 - 2.16 GT AG 0 0 0 0 chrXIII:424996:4251 GTATGTTG CAAACACA 24:+ 1 0 0 + 4.52 TT AG 1 2 0 1 chrIV:216156:216512 GTATGTAA AGGAATTA :+ 1 0 0 + 2.92 CG AG 1 0 0 0 chrVII:311227:31152 GTACTCTT TTTTGTAC 6:+ 0 1 0 + 3.45 CC AG 3 1 3 1 chrXVI:271302:2718 GTAAGTAT GTTCATCA 96:+ 0 0 0 + 2.42 GA AG 0 0 0 0 chrVII:365508:36598 GTATGTAT TTGACCCC 5:- 0 1 0 - 3.19 AC AG 3 1 3 1 chrXI:551679:552043 GTATGTTC CTACCAAC :+ 1 0 0 + 3.03 GA AG 1 2 0 1 chrXII:242320:24316 GTATGTAC TGGTCGAC 2:+ 1 0 0 + 2.48 AC AG 1 3 0 1 chrVII:555829:55610 GTATGTTT CCGGCTTT 9:+ 1 0 0 + 2.99 GG AG 1 2 0 1 chrVII:497364:49799 GTAAGTAC TATAAAAT 9:- 1 1 0 - 2.32 AG AG 1 1 0 1 chrXI:155270:155636 GTATGTTT AAGTACGA :+ 1 0 0 + 4.46 AC AG 1 2 0 1 chrVIII:498706:4987 GTATGTCA GAAACAAC 86:- 0 1 0 - 3.90 CC AG 2 1 2 1 chrVII:253184:25324 GTATGAAC CTTATTTTA 8:- 0 0 0 - 4.80 CC G 0 0 0 0 chrIV:417220:417626 GTATGTAT ATTGTTAT :- 0 0 0 - 3.29 TT AG 0 0 0 0 chrIV:629904:630056 GTATGTTC TTTTGTCC :+ 1 0 0 + 2.66 AA AG 1 3 0 1 chrIV:601314:601470 ATACTACT CACTGAGT :+ 0 0 0 + 3.43 TA AC 0 0 0 0 chrII:691965:692125 GTATGGAA AGTTTAGA :+ 0 0 0 + 2.46 AC AG 0 0 0 0 chrXII:233641:23388 GTATGTCT ACCGCCCC 5:- 0 0 0 - 3.45 TG AG 0 0 0 0 chrXI:437836:437925 GTATGTTG CTTTAGAA :+ 1 0 0 + 2.90 TT AG 1 2 0 1 chrXII:766185:76624 GTATGTAT CATATATA 9:- 0 1 0 - 3.20 CT AG 3 1 3 1 chrXIV:557573:5576 GTACGTAA AACAATGC 84:+ 0 1 0 + 2.75 AT AG 2 1 2 1 chrVII:534472:53478 GTATCTAT TTTATCATA 1:- 1 0 0 - 2.94 AA G 1 2 0 0 chrXII:242320:24282 GTATGTAC TCTCTTCC 6:+ 1 0 0 + 2.29 AC AG 1 2 0 1 chrII:393179:393669 GTATGTAC AGTATCCA :+ 1 0 0 + 2.52 AC AG 1 0 0 1 chrII:691965:692133 GTATGGAA AGGTAAGC :+ 0 0 0 + 3.87 AC AG 0 0 0 0 GTTGTTAG TCATGTAT chrX:74112:74204:+ 0 1 0 + 4.44 AT AG 2 1 2 1 chrII:653367:653524 GTATGTTC ACGCAAAC :+ 1 0 0 + 3.61 TG AG 1 3 0 1 chrIV:399360:399495 GTATGTTG CCTTGATC :+ 1 0 0 + 2.67 TT AG 1 2 0 1 chrI:128520:129021: GTATGGAT GCAACAGC - 0 0 0 - 3.96 GT AG 0 0 0 0 chrXIV:616065:6164 GTATGTGC ACTATCAA 12:+ 0 0 0 + 2.97 AA AG 0 0 0 0 chrIV:733684:733775 GTATGTTC AAATTAAC :- 0 1 0 - 2.97 AT AG 3 1 3 1 chrXII:281426:28162 GTAGGTCA AACTATCT 8:- 0 0 0 - 2.83 TG AG 0 0 0 0 chrII:142752:142846 GTATGTTA GACCATCA :- 0 1 0 - 4.03 CT AG 2 1 2 1 chrXIV:185490:1855 GTATGTAG AGTGCGGT 78:+ 1 0 0 + 3.96 GA AG 1 2 0 1 chrIV:306804:307073 GTACGTTG GTTAATTT :- 0 0 0 - 3.46 AC AG 0 0 0 1 chrI:128523:129021: GTATGGAT CCAGCAAC - 0 0 0 - 3.11 GT AG 0 0 0 0 chrXIII:557827:5580 GTAAGATC AGTTCTTA 01:- 0 0 0 - 2.64 AG AG 0 0 0 0 chrXVI:729352:7294 GTATGTAC TCCTATGT 81:- 0 1 0 - 2.72 AG AG 2 1 2 1 chrVII:62130:62196: GTATGTCT TAGTTTAA + 1 0 0 + 3.86 GT AG 1 2 0 1 167 chrIV:307015:307765 GTATGTTA CTTACGAC :- 0 1 0 - 2.62 AA AG 0 1 0 1 chrIV:1406991:14072 GTATGTTA ACTCACTT 31:+ 0 0 0 + 2.16 CG AG 0 0 0 0 chrX:649513:649657 GTAAGTTA ATACAAAA :- 0 0 0 - 2.24 TG AG 0 0 0 0 chrXII:982457:98253 GTATGTAT GTTTGAGT 8:- 0 0 0 - 2.78 GA AG 0 0 0 0 chrXVI:673747:6743 GTATGTCT TGAACAAA 76:+ 0 0 0 + 2.50 GA AG 0 0 0 0 chrXII:856568:85705 GCATGGTA CCTTATTTA 7:+ 0 1 0 + 3.18 TG G 2 1 2 1 chrXV:867396:86758 GTATGAAA TTTTATACA 6:+ 0 1 0 + 3.88 TA G 2 1 2 1 chrVI:223439:223727 GCAGGTAG TCTTCTCC :- 1 0 0 - 3.05 CC AG 1 3 0 0 GTACGTTA TTTGAAGA chrVI:64352:64920:- 0 1 0 - 3.28 AT AG 2 1 2 1 chrII:604513:604930 GTACGTAT TTTTTAGG :+ 1 0 0 + 2.55 TA AG 1 0 0 1 chrVII:439096:43932 GTATAACA ATTTCAAC 3:+ 0 1 0 + 4.16 TG AG 2 1 2 1 chrVII:439098:43931 ATAACATG TATCGTTT 3:+ 0 0 0 + 3.80 AT AC 2 2 2 1 chrXIV:62360:62923: GTACGTAT GACGTTGC - 0 1 0 - 3.59 AA AG 2 1 2 1 chrXV:505936:50599 GTATGTTA AGGAAAAA 5:+ 1 0 0 + 3.34 TT AG 1 2 0 1 chrII:170619:170804 GTATGCTT TTCCACTC :+ 0 1 0 + 4.07 TT AG 0 1 0 1 chrVIII:505242:5055 GTATGTTG ATATAGAA 16:- 0 1 0 - 3.08 AT AG 2 1 2 1 chrII:342697:342789 GTGTGTTA TATAATCA :+ 0 0 0 + 3.63 GT AG 0 0 0 0 chrXIV:272386:2736 GTTTGTGT ATTTCGAC 01:- 0 0 0 - 2.28 AC AG 0 0 0 0 chrII:170675:170757 GTATGTTC TTGAAAAA :+ 1 0 0 + 3.87 AT AG 1 2 0 1 chrXIV:728553:7289 GTAAGTAT GCCAGCAC 21:+ 0 0 0 + 3.35 TT AG 0 0 0 0 chrXIII:559781:5601 GTATGTCT ATTTTTGA 57:+ 0 0 0 + 5.14 GT AG 0 0 0 0 chrII:565749:566935 ATGGATTT CTGTACGG :+ 0 0 0 + 2.25 TT AC 0 0 0 0 chrXII:898547:89864 GTATAAAA ATTATTGC 5:+ 0 1 0 + 3.51 AA AG 2 1 2 0 chrVII:439380:43947 GTAAGTAC TTTGAGAA 9:+ 0 0 0 + 4.35 AC AG 2 2 2 1 chrXII:457115:46625 GTTAAAAA GTTGTTGC 4:- 0 0 0 - 2.92 GC AG 0 0 0 0 chrIV:340809:341183 GTATGCAG TTGGTTAT :- 1 0 0 - 3.82 AA AG 1 0 0 0 GTATGTTA CGATATTG chrII:60207:60697:- 0 1 0 - 2.65 AA AG 3 1 3 1 chrXI:109576:109890 GTATGTCA CAGGCTAA :+ 1 0 0 + 2.82 AG AG 1 2 0 1 chrXV:423656:42373 GTATGGTA TGTCGTAC 4:- 0 0 0 - 2.85 CC AG 0 0 0 0 chrIX:317016:317171 GTATGAGA ATTTAAAC :+ 0 1 0 + 4.39 AT AG 2 1 2 1 chrV:362911:363092 GTACTGCA ATTAAAAT :+ 0 1 0 + 2.37 AT AG 3 1 3 0 chrXIII:551950:5525 GTATGTTT TGGTAAGA 10:+ 1 0 0 + 2.16 TC AG 1 0 0 1 chrIII:101604:10170 GTATGTAT TGCGTTCA 0:- 0 1 0 - 2.59 AT AG 3 1 3 1 chrII:168553:168808 GTTGGTAG TTTTTCACA :+ 0 1 0 + 3.45 CA G 3 1 3 1 GTACGTAT GTTTTTATA chrXI:93365:93465:- 0 1 0 - 5.06 AA G 2 1 2 0 chrIV:216156:216521 GTATGTAA GATTTCAT :+ 1 0 0 + 3.08 CG AG 1 0 0 0 chrXII:855876:85642 GTATGTGG CAAAACAT 7:+ 1 0 0 + 4.37 AC AG 1 0 0 0 chrXI:437836:437913 GTATGTTG AGAATCAA :+ 1 0 0 + 2.99 TT AG 1 2 0 1 chrXVI:795028:7953 GTATGTAC TTGTCAAA 77:+ 1 0 0 + 2.66 AA AG 1 2 0 1 chrXIV:494523:4946 GTAATGGT TTTATTATA 32:- 1 0 0 - 2.12 AA G 1 0 0 1 chrIV:491687:491898 GCATGTTT CTTTTTTTA :+ 0 1 0 + 4.33 AT G 2 1 2 1 chrXV:778858:77925 GTATGAAT ATTTTGAT 2:- 1 0 0 - 4.31 AT AG 1 3 0 1

chrX:469182:469405 1 0 0 - 3.10 GCAGGTAA TGTAGTAT 1 2 2 1 168 :- AC AG

chrVIII:148115:1485 GTATGCGT TTCATTAC 08:- 1 0 0 - 4.02 TT AG 1 2 0 1 chrVIII:103616:1038 GTAAGGTG GTAACTTG 56:- 0 0 0 - 3.11 AG AG 0 0 0 0 chrVII:383484:38358 GTATGTAT ATGGTAGT 0:+ 0 0 0 + 3.31 GA AG 0 0 0 0 chrXVI:795133:7953 GTAAGGGA TTTAAAAC 94:+ 0 1 0 + 3.90 GA AG 2 1 2 1 chrXV:349496:34959 GTATGTTT TTACTTTTA 8:- 0 0 0 - 2.73 TT G 0 0 0 0 chrXIV:144846:1452 GTATGTTT AAGCTATC 54:- 0 1 0 - 3.20 AT AG 2 1 2 1 chrXII:242320:24277 GTATGTAC TGGCTGCT 5:+ 1 0 0 + 3.33 AC AG 1 0 0 1 chrXI:625900:625986 GTAAGTAG AGAACTAA :+ 1 0 0 + 3.18 AA AG 1 2 0 1 chrVII:62130:62183: GTATGTCT CGATCAAA + 1 0 0 + 3.91 GT AG 1 2 0 1 chrVII:920663:92112 GTATGTTA GTTCACCA 9:+ 1 0 0 + 2.65 TA AG 1 2 0 1 chrXII:522889:52302 GTATGCCT TTTTAAATA 8:+ 0 1 0 + 3.42 GA G 3 1 3 1 chrVII:365525:36596 GTCTATTTT ATTATTAC 9:- 1 0 0 - 2.82 A AG 1 2 0 1 chrXIV:62369:62923: GTACGTAT TCAATTAG - 0 1 0 - 2.50 AA AG 0 1 0 1 chrIV:122076:122167 GTATGTTG AACAACTA :+ 1 0 0 + 2.11 AA AG 1 0 0 0 chrIV:306804:307765 GTATGTTA GTTAATTT :- 0 1 0 - 3.76 AA AG 0 1 0 1 chrXVI:303560:3036 ATTGGTTT TGCTTCTG 24:+ 0 0 0 + 2.00 GC AC 0 0 0 0 chrXV:505936:50625 GTATGTTA CATTTACA 2:+ 1 0 0 + 2.34 TT AG 1 0 0 1 chrIV:655202:655272 GTATGCTT TGATAATC :+ 0 0 0 + 3.06 CC AG 0 0 0 0 chrXII:987140:98734 GTATGTAA AATGGCAT 9:+ 1 0 0 + 3.45 AG AG 1 2 0 1 chrIV:122076:122194 GTATGTTG AGTATATA :+ 1 0 0 + 2.00 AA AG 1 0 0 0 chrXVI:243488:2440 GTATGTTT TACTTAGA 25:- 0 0 0 - 3.55 CT AG 0 0 0 0 chrII:168599:168808 GTACGAAT TTTTTCACA :+ 0 1 0 + 2.69 TG G 0 1 0 1 chrXV:373898:37412 GTAAGCAT ATCCTATA 2:+ 0 0 0 + 2.48 TC AG 0 0 0 0 chrXIII:666921:6670 GTATGTGT AGGCTAAC 17:- 0 1 0 - 2.04 GA AG 2 1 2 1

169

170

Appendix III: BP Identification in Metazoans

171 Abstract

Pre-mRNA splicing is required to produce the full complement of protein diversity observed in cells by removing introns from transcripts to produce mature mRNAs. It has been shown in fly and human that introns longer than 10 Kb can be removed by sequential splicing reactions, known as recursive splicing. We have discovered the first endogenous case of recursive splicing in a short intron, only 383 nt long. By performing Branch-seq to locate fly branch points (BPs), we observed the second intron of RPL13A is spliced out via two lariats. These lariats have a recursive splice site motif located between them, as expected for a recursively spliced intron. The recursive splicing event places the two snoRNAs in the RPL13A second intron into distinct lariats, perhaps impacting the processing of these snoRNAs.

Introduction

Recursive splicing is the process by which an intron is spliced in two or more separate segments. It involves a 5' splice site (5'SS) that splices to a 3' splice site (3'SS) juxtaposed to another 5'SS located inside of the intron. The first splicing reaction regenerates a 5'SS so another splicing reaction can occur. Recursive splicing was first discovered in the Ubx gene in Drosophila melanogaster (Hatton, Subramaniam, & Lopez,

1998; Langmead, Trapnell, Pop, & Salzberg, 2009) and has subsequently been detected in additional fly genes and in human genes (Burnette, Miyamoto-Sato, Schaub, Conklin, &

172 Lopez, 2005; Duff et al., 2015; Kelly et al., 2015; Sibley et al., 2015) through a combination of computational predictions and analyses of experimental data.

We performed Branch-seq to study the impact of BP regulation on the outcome of splicing decisions in fly. Drosophila melanogaster introns are on average 81 nt long (Lim &

Burge, 2001) and Branch-seq works best on lariat loops less than 100 nt in length. This combination made fly an ideal organism for BP sequencing using Branch-seq.

An additional advantage of Branch-seq is that unlike computational BP prediction methods that rely on proximity of BP motifs to annotated 3'SS to predict BP locations, Branch-seq’s untargeted experimental approach has the ability to locate BPs at arbitrary intronic locations. Here we report the first example of recursive splicing inside a short intron.

Methods

Cell culture

S2R+ cells were a kind gift from Dr. Jessica Hurt. Cells were grown at room temperature in the dark in Schneider's Drosophila Medium (modified with L-Glutamine,

VWR cat# CA12001-982) with 10% FBS (heat inactivated) and Penicillin-Streptomycin.

ldbr RNAi

Cells were grown for 3-5 days after application of dsRNA against debranching enzyme (ldbr). The dsRNA constructs DRSC10933 and DRSC36280 were obtained from the

173 Drosophila RNAi Screening Center (DRSC) and amplified according to their protocols http://www.flyrnai.org/DRSC-PRO.html.

ldbr qPCR

RNA was isolated from S2R+ cells using Trizol (Invitrogen). Reverse transcription was performed with random hexamer primers and PCR was performed with the primers bellow: actin_L: TCTGGGTATGATCTGGACGA actin_R: CAGACCATCCTTGAACGACA ldbr_L: ACGACACCATAGAGGGCATC ldbr_R: CCACTGTAGTATTTGTAAAAGGAGCA

Branch-seq

Branch-seq was performed as in Chapter 2 with the following modifications. For the

2D gel: D1=6%, D2=20%. The arc was isolated as one sample (not split in to top, middle, and bottom). In the first Branch-seq experiment, the WT arc, WT “tRNA”, ldbr knockdown

(KD) arc, and KD dot were sequenced. In the second Branch-seq experiment, the KD arc was sequenced and a band corresponding to smaller material was cut from the RT gel to try to increase the percentage of reads mapping to the fly genome.

2D Gels

Worm 2D gels were run under the same conditions as the fly and yeast (Chapter 2)

2D gels. Mouse 2D gel conditions: D1 was 6% and run at 100V for 2hr, D2 was 10% and run

174 at 200V for 90 min. Mouse D1 and D2 gels were small, pre-cast, denaturing PAGE gels from

Invitrogen.

Sequencing

Branch-seq libraries were sequenced (150 nt by 150 nt paired end reads) on the

MiSeq in the MIT Bio Micro Center and on the Reddien Lab MiSeq. Reads were mapped as in Chapter 2 using Bowtie to dm3 (Langmead et al., 2009).

Results Knockdown of ldbr does not result in a noticeable accumulation of lariat

RNA

Knockdown of ldbr by RNAi in Drosophila S2R+ cells did not result in a noticeable accumulation of 2D arc RNA (Fig. III-1A) despite a 90% knockdown of ldbr RNA level (Fig.

III-1B). Similarly, an arc was not visible in RNA from worms that were null for ldbr (Fig . III-

S1A and see supplemental note). This was surprising since we observed a striking difference in 2D arc intensity in WT versus dbr1∆ yeast RNA (Fig. 2-1A). However, yeast heterozygous for DBR1 do not accumulate lariats (Chapman & Boeke, 1991; Hatton et al.,

1998), so it is possible that even low levels of debranchase protein are sufficient to debranch all available lariats. Additionally, if ldbr protein has a long half-life, the RNA levels of ldbr would not be a good proxy for amount of debranching activity in the sample.

175

Figure III-1: ldbr knockdown by RNAi in S2R+ cells does not cause a noticeable accumulation of lariat RNA. (A) 2D gels of RNA from WT cells (left) and cells knocked down for ldbr (right). (B) qPCR quantification of ldbr knockdown efficiency.

176 Though we did not observe an abundant accumulation of lariat RNA in the RNAi treated S2R+ cell RNA, we did observe a faint arc in both WT and knockdown 2D gels (Fig.

III-1A). Using that arc material we proceeded with Branch-seq library preparation. Branch- seq was performed twice in an attempt to increase the quality of the data (see methods).

Fly Branch-seq reads largely do not map to the fly genome

Unfortunately, a very low fraction of the total reads mapped uniquely to the fly genome. In the second Branch-seq sample, an order of magnitude more reads mapped uniquely to the fly genome than the first sample, but this fraction was still only 3%.

Mapping only the first 30 nt (bases 1-30) or second 30 nt (bases 31-60) of the reads did not improve the mapping statistics (Fig. III-2A). The reads did not appear to come from phiX

(spike in control for sequencing), E.coli, mouse, nor human (Fig. III-2A). BLASTing

(Altschul, Gish, Miller, Myers, & Lipman, 1990; Burnette et al., 2005; Duff et al., 2015; Kelly et al., 2015; Sibley et al., 2015) the reads suggested homology to snRNAs, but turned up species such as dolphin, which we do not work with in the lab.

In all, we obtained 96,185 reads that did map to the fly genome. Many of these reads map to introns and correctly identify annotated 5'SS. Surprisingly, the BP end reads are often very close to annotated 3'SS. In the case of CG9796 the BP end reads end at a typical

BP motif (Fig. III-2B), but in most cases the BP reads are located a few nucleotides from the

3'SS, making it appear that the tail of the lariat was not digested (Fig. III-2C). Additionally,

Branch-seq seems to have sequenced some snoRNAs (Fig. III-3A, top). The second sequencing experiment selected for shorter RNA fragments than the first experiment, explaining the differences observed in the first and second sequencing runs. 177

Fly Branch-seq reads identify the first recursive splice site in a short intron

Surprisingly, the Branch-seq reads identify a recursive splice site in the second intron of RPL13A which is only 383 nucleotides long (Fig. III-3A). The recursive splice site contains the AG|GT sequence typical of recursive splice sites, where the “|” represents the location of the 3'SS to 5'SS boundary. As is the case in most of the other introns observed, the BP reads map very close to the 3'SS. Interestingly, the recursive splice site is located in between two snoRNAs in the RPL13A intron. The UCSC Genome Browser depicts this intron as having an alternative 3'SS (Fig. III-3B) which coincides with the location of the recursive splice site.

178

179 Figure III-2: Rare Branch-seq reads that map to the fly genome often map to introns. (A) Read mapping statistics for first (left) and second (right) Branch-seq experiments. Reads that only map to one genomic location are blue. Reads for which the first 30 nt mapped uniquely to the genome were used for downstream analyses. (B) Reads that map to CG9796 identify the annotated 5'SS and a BP with a typical BP motif (boxed). Note CG9796 is on the reverse strand. 5'SS reads in pink, BP reads in blue. A single nucleotide polymorphism can be observed as the solid vertical blue stripe in the 5'SS reads. Dotted liens show 5'SS and BP read ends. (C) Reads that map to introns often map to the annotated 5'SS and to the annotated 3'SS with lower accuracy (dotted line for 3' end reads). Example genes shown: CG17836, B52, and InR. All reads are 30nt (serve as scale bars).

180

Figure III-3: Branch-seq identifies a recursive splice site in a short intron. (A) The first Branch-seq experiment identified the AG|GT ratchet site in the RPL13A intron. As in Figure III-2C, the BP reads are very close to the 3'SS. The second experiment sequenced the snoRNAs inside the intron. (B) UCSC Genome Browser screen shot of the annotated gene structure of RPL13A depicts an alternative 3'SS in the ratchet intron (Kent et al., 2002; Lim & Burge, 2001) http://genome.ucsc.edu.

181 Discussion

Our discovery of recursive splicing in RPL13A is the first report of recursive splicing in an intron <10 Kb in length. Studies to date have only computationally predicted and subsequently found experimental evidence of recursive splice sites in introns ≥10 Kb. The recursive splicing in RPL13A might be used to regulate the levels of the two snoRNAs inside the intron, since snoRNA placement inside of introns, specifically snoRNA to BP distance, is known to be important for snoRNA processing. Further experiments are needed to determine if these snoRNAs are located in the lariat loops or lariat tails. Additionally, the recursive splicing may dictate 3'SS choice, but further experiments are needed to determine if the entire 383 nt intron can be removed by splicing that produces a single lariat.

To find additional fly BPs using Branch-seq, it would be advantageous to produce higher quality Branch-seq data. First, to increase the number of reads mapping to the fly genome it would be preferable to start with more lariat RNA. Isolation of nuclear RNA should increase the proportion of lariat RNA to linear RNA, as lariats are presumably more abundant in the nucleus than in the cytoplasm. Additionally, deleting ldbr (and any homologs of ldbr), rather than using RNAi should increase the half-life of lariats, allowing more lariat RNA to be captured. Second, since the BP reads in the existing data are located in very close proximity to the 3'SS, RNase R should be used to digest the lariat tails. RNase

R treatment may also remove contaminating snoRNAs from the Branch-seq samples.

182 Supplemental note

Debranching enzyme depletion in worms and mouse does not result in an arc on 2D gel

WT and debranching enzyme null worm RNA did not show any striking differences when run on a 2D gel (Fig. III-S1A). This observation leads to two main possibilities. Either lariat RNA did not accumulate in the deletion strain or lariats did accumulate, but did not migrate differently than linear RNA in the 2D gel due to the very small average size of 60 nt of worm introns (Lim & Burge, 2001). Similarly, despite good knockdown of debranching enzyme using shRNAs in mouse embryonic stem cells (mESCs), no lariat arc was easily discernable (Fig. III-S1B). However, the mouse RNA should be run on larger, higher percentage 2D gels because mouse introns are much larger than fly, worm, or yeast introns.

183

Figure III-S1: 2D gels for worm and mECS RNA. (A) 2D gels on worm total RNA. Gel running conditions are the same as for fly. (B) Knockdown of DBR1 by shRNAs and 2D gels on mESC total RNA samples. D1: 6%, D2: 10%.

184 Acknowledgments

We thank Anna Corrionero Saiz for the worm RNA and Paul Boutz for the mESC RNA and knockdown quantification. We thank Jessica Hurt, Kerry Kelley, Karen Traverse, Mary-

Lou Pardue, Frank Mason, Ky Lowenhaupt, Iva Kronja, and Jessica Von Stetina for training and supplies related to fly cell culture. We thank the MIT Bio Micro Center and the Reddien

Lab at MIT for MiSeq sequencing.

References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410. doi:10.1016/S0022- 2836(05)80360-2 Burnette, J. M., Miyamoto-Sato, E., Schaub, M. A., Conklin, J., & Lopez, A. J. (2005). Subdivision of large introns in Drosophila by recursive splicing at nonexonic elements. Genetics, 170(2), 661–674. doi:10.1534/genetics.104.039701 Chapman, K. B., & Boeke, J. D. (1991). Isolation and characterization of the gene encoding yeast debranching enzyme. Cell, 65(3), 483–492. doi:10.1016/0092-8674(91)90466-C Duff, M. O., Olson, S., Wei, X., Garrett, S. C., Osman, A., Bolisetty, M., et al. (2015). Genome- wide identification of zero nucleotide recursive splicing in Drosophila. Nature, 521(7552), 376–379. doi:10.1038/nature14475 Hatton, A. R., Subramaniam, V., & Lopez, A. J. (1998). Generation of alternative Ultrabithorax isoforms and stepwise removal of a large intron by resplicing at exon- exon junctions. Molecular Cell, 2(6), 787–796. Kelly, S., Georgomanolis, T., Zirkel, A., Diermeier, S., O'Reilly, D., Murphy, S., et al. (2015). Splicing of many human genes involves sites embedded within introns. Nucleic Acids Research. doi:10.1093/nar/gkv386 Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., & Haussler, D. (2002). The human genome browser at UCSC. Genome Research, 12(6), 996–1006. doi:10.1101/gr.229102 Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25. doi:10.1186/gb-2009-10-3-r25 Lim, L. P., & Burge, C. B. (2001). A computational analysis of sequence features involved in recognition of short introns, 98(20), 11193–11198. doi:10.1073/pnas.201407298 Sibley, C. R., Emmett, W., Blazquez, L., Faro, A., Haberman, N., Briese, M., et al. (2015). Recursive splicing in long vertebrate genes. Nature, 521(7552), 371–375. doi:10.1038/nature14466

185