bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Ancient exapted transposable elements drive nuclear localisation of lncRNAs

2

3 Joana Carlevaro-Fita1,2,4

4 Taisia Polidori1,2,4

5 Monalisa Das1,2,4

6 Carmen Navarro3

7 Rory Johnson1,2*

8 9 10 11 1. Department of Biomedical Research (DBMR), University of Bern, 3008 Bern, Switzerland

12 2. Department of Medical Oncology, Inselspital, University Hospital and University of Bern, 3010 Bern, 13 Switzerland

14 3. Department of Computer Science and Artificial Intelligence, University of Granada, Spain

15 4. Equal contribution 16

17 *Correspondence: [email protected]

18 Keywords: ; long noncoding RNA; lncRNA; evolution; exaptation.

19

20

1

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

21 Abstract

22

23 The sequence domains underlying long noncoding RNA (lncRNA) activities, including their 24 characteristic nuclear localisation, remain largely unknown. One hypothetical source of such 25 domains are neofunctionalised fragments of transposable elements (TEs), otherwise known as RIDLs 26 (Repeat Insertion Domains of Long Noncoding RNA). We here present a transcriptome-wide map of 27 putative RIDLs in human, using evidence from insertion frequency, strand bias and evolutionary 28 conservation of sequence and structure. In the exons of GENCODE v21 lncRNAs, we identify a total 29 of 5374 RIDLs in 3566 gene loci. RIDL-containing lncRNAs are over-represented in functionally- 30 validated and disease-associated genes. By integrating this RIDL map with global lncRNA 31 subcellular localisation data, we uncover a cell-type-invariant and dose-dependent relationship 32 between LINE2 and MIR elements and nuclear-cytoplasmic distribution. Experimental validations 33 establish causality, showing that these RIDLs drive the nuclear localisation of their host lncRNA. 34 Together these data represent the first map of TE-derived functional domains, and suggest a global 35 role for TEs as nuclear localisation signals of lncRNAs.

2

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

36 Introduction

37 The human genome contains many thousands of long noncoding RNAs, of which at least a fraction 38 are likely to have evolutionarily-selected biological functions (Ulitsky and Bartel 2013). Our current 39 thinking is that, similar to proteins, lncRNAs functions are encoded in primary sequence through 40 “domains”, or discrete elements that mediate specific aspects of lncRNA activity, such as molecular 41 interactions or subcellular localisation (Guttman and Rinn 2012; Rory Johnson and Guigó 2014; Mercer 42 and Mattick 2013). Mapping these domains is thus a key step towards the understanding and prediction of 43 lncRNA function in healthy and diseased cells.

44 One possible source of lncRNA domains are transposable elements (TEs) (Rory Johnson and Guigó 45 2014). TEs have been a major contributor to genomic evolution through the insertion and 46 neofunctionalisation of sequence fragments – a process known as exaptation (Bourque 2009)(Feschotte 47 2008). This process has contributed to the evolution of diverse features in genomic DNA, including 48 transcriptional regulatory motifs (Bourque et al. 2008; R. Johnson et al. 2006), microRNAs (Roberts, 49 Cardin, and Borchert 2014), gene promoters (Faulkner et al. n.d.; Huda et al. 2011), and splice sites (Lev- 50 Maor et al. 2003; Sela et al. 2007).

51 We recently proposed that TEs may contribute pre-formed functional domains to lncRNAs. We 52 termed these “RIDLs” – Repeat Insertion Domains of Long noncoding RNAs (Rory Johnson and Guigó 53 2014). As RNA, TEs interact with a rich variety of proteins, meaning that in the context of lncRNA they 54 could plausibly act as protein-docking sites (Blackwell et al. n.d.). Diverse evidence also points to repetitive 55 sequences forming intermolecular Watson-Crick RNA:RNA and RNA:DNA hybrids (Gong and Maquat 56 2011; Holdt et al. 2013; Rory Johnson and Guigó 2014). However, it is likely that bona fide RIDLs represent 57 a small minority of the many exonic TEs, with the remainder being phenotypically-neutral “passengers”.

58 A small but growing number of RIDLs have been described, reviewed in (Rory Johnson and Guigó 59 2014). These are found in lncRNAs with clearly demonstrated functions, including the X-chromosome 60 silencing transcript XIST (Elisaphenko et al. 2008), the oncogene ANRIL (Holdt et al. 2013) and the 61 regulatory antisense UchlAS (Carrieri et al. 2012). In each case, domains of repetitive origin are necessary 62 for a defined function: the structured A-repeat of XIST, of retroviral origin, recruits the PRC2 silencing 63 complex (Elisaphenko et al. 2008); Watson-Crick hybridisation between RNA and DNA Alu elements 64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense mRNA (Carrieri et al. 2012). In parallel, a number of transcriptome-wide maps of lncRNA- 66 linked TEs have been presented, showing how TEs have contributed extensively to lncRNA gene evolution 67 (Kapusta et al. 2013; D. Kelley and Rinn 2012)(Schmitt et al. 2016)(Hezroni et al. 2015). However, there 68 has been no attempt so far to create passenger-free maps of RIDLs with evidence for functionality in the 69 context of mature lncRNA molecules.

70 Subcellular localisation, and the domains controlling it, exert a powerful influence over lncRNA 71 functions (reviewed in (L.-L. Chen 2016)). For example, transcriptional regulatory lncRNAs must be 72 located in the nucleus and chromatin, while those regulating microRNAs or translation must be present in 73 the cytoplasm (K. Zhang et al. 2014). This is reflected in diverse patterns of lncRNA localization (Cabili et 74 al. 2015; Mas-Ponte et al. 2017b). If lessons learned from mRNA are also valid for lncRNAs, then short 75 sequence motifs binding to RNA binding proteins (RBPs) will be an important mechanism (Martin and 76 Ephrussi 2009). This was recently demonstrated for the BORG lncRNA, where a pentameric motif was

3

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

77 shown to mediate nuclear retention (B. Zhang et al. 2014). Most interestingly, another study implicated an 78 inverted pair of Alu elements in nuclear retention of lincRNA-P21 (Chillón and Pyle 2016). This raises the 79 possibility that RIDLs may function as regulators of lncRNA localisation.

80 In the present study, we create a human transcriptome-wide catalogue of putative RIDLs. We show 81 that lncRNAs carrying these RIDLs are enriched for functional genes. Finally, we provide in silico and 82 experimental evidence that certain RIDL types, derived from ancient transposable elements, confer a potent 83 nuclear localisation on their host transcripts.

84

85

4

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

86 Materials and Methods

87 All operations were carried out on human genome version GRCh38/hg38, unless stated otherwise.

88

89 Exonic TE curation

90 RepeatMasker annotations were downloaded from the UCSC Genome Browser (version hg38) on 91 December 31st 2014, and GENCODE version 21 lncRNA GTF annotations were downloaded from 92 www.gencodegenes.org (Harrow et al. 2012). We developed a pipeline, called Transposon.Profile, to 93 annotate exonic and intronic TEs of a given gene annotation. Transposon.Profile is largely based on 94 BEDTools’ intersect and merge functionalities(Quinlan and Hall 2010). Transposon.Profile merges the 95 exons of all transcripts belonging to the given gene annotation, henceforth referred to as “exons”. The set 96 of introns was curated by subtracting the merged exonic sequences from the full gene spans, and only 97 retaining those introns that belonged to a single gene. These intronic regions were assigned the strand of 98 the host gene.

99 The RepeatMasker annotation file was then compared to these exons, and intersections are recorded 100 and then classified into one of 6 categories: TSS (transcription start site), overlapping the first exonic 101 nucleotide of the first exon; splice acceptor, overlapping exon 5’ end; splice donor, overlapping exon 3’ 102 end; internal, residing within an exon and not overlapping any intronic sequence; encompassing, where an 103 entire exon lies within the TE; TTS (transcription termination site), overlapping the last nucleotide of the 104 last exon. In every case, the TEs are separated by strand relative to the host gene: + where both gene and 105 TE are annotated on the same strand, otherwise -. The result is the “Exonic TE Annotation” (Supplementary 106 File S1).

107

108 RIDL identification

109 Using this Exonic TE Annnotation, we identified the subset of those individual TEs with evidence 110 for functionality. For certain analysis, an equivalent Intronic TE Annotation was also employed, being the 111 Transposon.Profile output for the equivalent intron annotation described above. Three different types of 112 evidence were used: enrichment, strand bias and evolutionary conservation.

113 In enrichment analysis, the exon/intron ratio of the fraction of nucleotide coverage by each repeat 114 type was calculated. Any repeat type that has >2-fold exon/intron ratio was considered as a candidate. All 115 exonic TE instances belonging to such TE types are defined as RIDLs.

116 In strand bias analysis, a subset of Exonic TE Annotation was used, being the set of non-splice 117 junction crossing TE instances (“noSJ”). This additional filter was employed to guard against false positive 118 enrichments for TEs known to provide splice sites (Lev-Maor et al. 2003; Sela et al. 2007). For all TE 119 instances, the “relative strand” was calculated: positive, if the annotated TE strand matches that of the host 120 transcript; negative, if not. Then for every TE type, the ratio of relative strand sense/antisense was 121 calculated. Statistical significance was calculated empirically: entire gene structures were randomly re- 122 positioned in the genome using bedtools shuffle, and the intersection with the entire RepeatMasker 123 annotation was re-calculated. For each iteration, sense/antisense ratios were calculated for all TE types. A 124 TE type was considered to have significant strand bias, if its true ratio exceeded (positively or negatively)

5

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

125 all of 1000 simulations. All exonic instances of these TE types that also have the appropriate sense or 126 antisense strand orientation are defined as RIDLs.

127 In evolutionary analysis, four different annotations of evolutionarily-conserved regions were 128 treated similarly, using unfiltered Exonic TE Annotations. Primate, Placental Mammal and Vertebrate 129 PhastCons elements based on 46-way alignments were downloaded as BED files from UCSC Genome 130 Browser(Siepel et al. 2005)s, while the ECS conserved regions from obtained from Supplementary Data of 131 Smith et al (Smith et al. 2013). For each TE type we calculated the exonic/intronic conservation ratio. To 132 do this we used IntersectBED (Quinlan and Hall 2010) to overlap exonic locations with TEs, and calculate 133 the total number of nucleotides overlapping. We performed a similar operation for intronic regions. Finally, 134 we calculated the following ratio:

135 rei=(nt of overlap between conserved elements and exons)/(nt of overlap between conserved elements and 136 introns)

137 To estimate the background, the exonic and intronic BED files were positionally randomized 1000 times

138 using bedtools shuffle, each time recalculating rei. We considered to be significantly conserved those TE

139 types where the true rei was greater or less than every one of 1000 randomised rei values. All exonic instances 140 of these TE types that also intersect the appropriate evolutionary conserved element are defined as RIDLs.

141 All RIDL predictions were then merged using mergeBED and any instances with length <10 nt 142 were discarded. The outcome, a BED format file with coordinates for hg38, is found in Supplementary File 143 S2.

144

145 RIDL orthology analysis

146 In order to assess evolutionary history of RIDLs, we used chained alignments of human to chimp 147 (hg19ToPanTro4), macaque (hg19ToRheMac3), mouse (hg19ToMm10), rat (hg19ToRn5), cow 148 (hg19ToBosTau7). Due to availability of chain files, RIDL coordinates were first converted from hg38 to 149 hg19. Orthology was defined by LiftOver utility used at default settings.

150

151 Comparing RIDL-carrying lncRNAs versus other lncRNAs

152 In order to test for functional enrichment amongst lncRNAs hosting RIDLs, we tested for statistical 153 enrichment of the following traits in RIDL- carrying lncRNAs compared to other lncRNAs (see below) by 154 Fisher's exact test: 155 A) Functionally-characterised lncRNAs: lncRNAs from GENCODE v21 that are present in lncRNAdb 156 (Quek et al. 2015). 157 B) Disease-associated genes: lncRNAs from GENCODE v21 that are present in at least in one of the 158 following databases or public sets: LncRNADisease (G. Chen et al. 2013), Lnc2Cancer (Ning et al. 159 2016), Cancer LncRNA Census (CLC) (Carlevaro-fita et al. bioRxiv doi: 10.1101/152769) 160 C) GWAS SNPs: We collected SNPs from the NHGRI-EBI Catalog of published genome-wide 161 association studies (Hindorff et al. 2009; Welter et al. 2014) (https://www.ebi.ac.uk/gwas/home). We 162 intersected its coordinates with lncRNA exons and promoters (promoters defined as 1000nt upstream 163 and 500nt downstream of GENCODE-annotated TSS).

6

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

164 For defining a comparable set of “other lncRNAs” we sampled from the rest of GENCODE v21 165 lncRNAs a set of lncRNAs matching RIDL-carrying lncRNAs exonic length distribution. We performed 166 the subsampling using matchDistribution script: https://github.com/julienlag/matchDistribution.

167

168 Subcellular localisation analysis

169 Processed RNAseq data from human cell fractions were obtained from ENCODE in the form of RPKM 170 (reads per kilobase per million mapped reads) quantified against the GENCODE v19 annotation(Djebali et 171 al. 2012b; Mas-Ponte et al. 2017a). Only transcripts common to both the v21 and v19 annotations were 172 considered. For the following analysis only one transcript per gene was considered, defined as the one with 173 largest number of exons. Cytoplasmic/nuclear ratio expression for each transcript was defined as 174 (cytoplasm polyA+ RPKM)/(nuclear polyA+ RPKM), and only transcripts having non-zero values in both 175 were considered. For each RIDL type and cell type in turn, the cytoplasmic/nuclear ratio distribution of 176 RIDL-containing to non-RIDL-containing lncRNAs was compared using Wilcox test. Only RIDLs having 177 at least 3 host expressed transcripts in at least one cell type were tested. Resulting p-values were globally 178 adjusted to False Discovery Rate using the Benjamini-Hochberg method (Benjamini and Hochberg 1995).

179

180 Cell lines and reagents

181 Human cervical cancer cell line HeLa and human lung cancer cell line A549 were cultured in Dulbecco’s 182 Modified Eagle’s medium (Sigma-Aldrich, # D5671) supplemented with 10% FBS and 1% penicillin

183 streptomycin at 37ºC and 5 % CO2. Anti-GAPDH antibody (Sigma-Aldrich, # G9545) and anti-histone H3 184 antibody (Abcam, # ab24834) were used for Western blot analysis.

185

186 Gene synthesis and cloning of lncRNAs

187 The three lncRNA sequences (RP11-5407, Linc00173, RP4-806M20.4) containing wild-type RIDLs, 188 or corresponding mutated versions where RIDL sequence has been randomised (“Mutant”), were 189 synthesized commercially (BioCat GmbH). The sequences were cloned into pcDNA 3.1 (+) vector within

190 the NheI and XhoI restriction enzyme sites. The clones were checked by restriction digestion and Sanger 191 sequencing. The sequence of the wild type and mutant clones are provided in Supplementary File S3.

192

193 Sub-cellular fractionation

194 Each pair of the wild type and the corresponding mutant lncRNA clones were transfected 195 independently in separate wells of a 6 well plate. Transfections were carried out with 2 g of total plasmid 196 DNA in each well using Lipofectamine 2000. 48 h post-transfection, cells from each well were harvested, 197 pooled and re-seeded into a 10 cm dish and allowed to grow till 100% confluence. The nuclear and 198 cytoplasmic fractionation was carried out as described previously (Suzuki et al. 2010) with minor 199 modifications. In brief, cells from 10 cm dishes were harvested by scraping and washed with 1X ice cold 200 PBS. For fractionation, cell pellet was re-suspended in 900 μl of ice-cold 0.1% NP40 in PBS and triturated 201 7 times using a p1000 micropipette. 300 μl of the cell lysate was saved as the whole cell lysate. The

7

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

202 remaining 600 μl of the cell lysate was centrifuged for 30 sec on a table top centrifuge and the supernatant 203 was collected as “cytoplasmic fraction”. 300 μl from the cytoplasmic supernatant was kept for RNA 204 isolation and the remaining 300 μl was saved for protein analysis by western blot. The pellet containing the 205 intact nuclei was washed with 1 ml of 0.1% NP40 in PBS. The nuclear pellet was re-suspended in 200 μl 206 1X PBS and subjected to a quick sonication of 3 pulses with 2 sec ON-2 sec OFF to lyse the nuclei and 207 prepare the “nuclear fraction”. 100 μl of nuclear fraction was saved for RNA isolation and the remaining 208 100 μl was kept for western blot.

209

210 RNA isolation and real time PCR

211 The RNA from each nuclear and cytoplasmic fraction was isolated using Quick-RNA MiniPrep kit 212 (ZYMO Research, # R1055). The RNAs were subjected to on-column DNAse I treatment and clean up 213 using the manufacturer’s protocol. For A549 samples, additional units of DNase were employed, due to 214 residual signal in –RT samples. The RNA from each fraction was converted to cDNA using GoScript 215 reverse transcriptase (Promega, # A5001) and random hexamer primers. The expression of each of the 216 individual transcripts was quantified by qRT-PCR (Applied Biosystems® 7500 Real-Time) using indicated 217 primers (Supplementary File S4) and GoTaq qPCR master mix (Promega, # A6001). Human GAPDH 218 mRNA and MALAT1 lncRNA were used as cytoplasmic and nuclear markers, respectively.

219

220 Western Blotting

221 The protein concentration of each of the fractions was determined, and equal amounts of protein (50 222 μg) from whole cell lysate, cytoplasmic fraction, and nuclear fraction were resolved on 12 % Tris-glycine 223 SDS-polyacrylamide gels and transferred onto polyvinylidene fluoride (PVDF) membranes (VWR, # 224 1060029). Membranes were blocked with 5% skimmed milk and incubated overnight at 4ºC with anti- 225 GAPDH antibody as a cytoplasmic marker and anti p-histone H3 antibody as nuclear marker. Membranes 226 were washed with PBS-T (1X PBS with 0.1 % Tween 20) followed by incubation with HRP-conjugated 227 anti-rabbit or anti-mouse secondary antibodies respectively. The bands were detected using SuperSignal™ 228 West Pico chemiluminescent substrate (Thermo Scientific, # 34077).

8

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

229 Results

230 The objective of this study was to create a map of repeat insertion domains of long noncoding 231 RNAs (RIDLs) and link them to lncRNA functions. We previously hypothesised that RIDLs could function 232 by mediating interactions between host lncRNA and DNA, RNA or protein molecules (Rory Johnson and 233 Guigó 2014) (Figure 1A). However, any attempt to map RIDLs is confronted with the problem that they 234 will likely represent a small minority amongst many non-functional “passenger” transposable elements 235 (TEs) residing in lncRNA exons (Figure 1B). Therefore, it is necessary to identify RIDLs by some signature 236 of selection in lncRNA exons. In this study we use three such signatures: exonic enrichment, strand bias, 237 and evolutionary conservation (Figure 1B).

238

239 A map of exonic TEs in Gencode v21 lncRNAs

240 Our first aim was to create a comprehensive map of TEs within the exons of GENCODE v21 human 241 lncRNAs (Figure 2A). Altogether 5,520,018 distinct TE insertions were intersected against 48684 exons 242 from 26414 transcripts of 15877 GENCODE v21 lncRNA genes, resulting in 46474 exonic TE insertions 243 in lncRNA (Figure 1B). Altogether 13121 lncRNA genes (82.6%) carry at least one exonic TE fragment in 244 their mature transcript.

245 We defined TEs residing in lncRNA introns to represent neutral background, since they do not 246 contribute to functionality of the mature processed transcript, but nevertheless should model any large-scale 247 genomic biases in TE abundance. Thus we also created a control TE intersection dataset with 31,004 248 Gencode lncRNA introns, resulting in 562,640 intron-overlapping TE fragments (Figure 2A). This analysis 249 is likely to be conservative, since some intronic sequences could contribute to unidentified lncRNA exons.

250 Comparing intronic and exonic TE data, we see that lncRNA exons are slightly depleted for TE 251 insertions: 29.2% of exonic nucleotides are of TE origin, compared to 43.4% of intronic nucleotides (Figure 252 2B), similar to previous studies (Kapusta et al. 2013). This may reflect selection against disruption of 253 lncRNA transcripts by TEs, and hence is evidence for general lncRNA functionality.

254

255 Contribution of TEs to lncRNA gene structures

256 TEs have contributed widely to both coding and noncoding gene structures by the insertion of 257 elements such as promoters, splice sites and termination sites (Sela et al. 2007). We next classified inserted 258 TEs by their contribution to lncRNA gene structure (Figure 2C,D). It should be borne in mind that this 259 analysis is dependent on the accuracy of underlying GENCODE annotations, which are often incomplete 260 at 5’ and 3’ ends (J. Lagarde, B. Uszczyńska et al, Manuscript In Press). Altogether 4993 (18.9%) transcripts 261 are promoted from within a TE, most often those of Alu and ERVL-MaLR classes (Figure 2E). 7497 262 (28.4%) lncRNA transcripts are terminated by a TE, most commonly by Alu, ERVL-MaLR and LINE1 263 classes. 8494 lncRNA splice sites (32.2%) are of TE origin, and 2681 entire exons are fully contributed by 264 TEs (10.1%) (Figure 2E). These observations support several known contributions of TEs to gene structural 265 features (Sela et al. 2007). Nevertheless, the most frequent case is represented by 22031 TEs that lie 266 completely within an exon and do not overlap any splice junction (“inside”).

267

9

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

268 Evidence for selection on certain exonic TE types

269 This exonic TE map represents the starting point for the identification of RIDLs, defined as the 270 subset of TEs with evidence for functionality in the context of mature lncRNAs. In this and subsequent 271 analyses, TEs are grouped by type as defined by RepeatMasker. We utilise three distinct sources of evidence 272 for selection on TEs: exonic enrichment, strand bias and evolutionary conservation (Figure 1B).

273 We first asked whether particular TE types are more frequently found within lncRNA exons, 274 compared to the background represented by intronic sequence (D. Kelley and Rinn 2012). Thus we 275 calculated the ratio of exonic / intronic sequence coverage by TEs (Figure 3A). We found signals of 276 enrichment, >2-fold, for numerous repeat types, including classes (HERVE-int, 277 HERVK9-int, HERV3-int, LTR12D) in addition to others such as ALR/Alpha, BSR/Beta and REP522. 278 Surprisingly, quite a number of simple repeats are also enriched in lncRNA, including GC-rich repeats. A 279 weaker but more generalized trend of enrichment is also observed for various MLT repeat classes. These 280 findings are consistent with previous analyses by Kelley and Rinn (D. Kelley and Rinn 2012).

281 Interestingly, presently-active LINE1 elements are depleted in lncRNA exons (Figure 3A). It is 282 possible that this reflects selection against disruption to normal gene expression, where numerous weak 283 polyadenylation signals lead to premature transcription termination when the LINE1 element lies on the 284 same strand as the overlapping gene (Perepelitsa-Belancio and Deininger 2003). Another explanation may 285 be low transcriptional processivity exhibited by the LINE1 ORF2 in the sense strand, leading to reduced 286 gene expression and consequent negative selection (Perepelitsa-Belancio and Deininger 2003).

287 In a similar way, we investigated whether exonic TEs display a strand preference relative to host 288 lncRNA (Rory Johnson and Guigó 2014). We took care to avoid a major source of bias: as shown above, 289 many TSS and splice sites of lncRNA are contributed by TE, and such cases would lead to clear strand bias 290 in the resulting exonic TE sequences. To avoid this, we removed any TE insertions that overlap an exon- 291 intron boundary. We calculated the relative strand overlap of all remaining TEs in lncRNA exons. Statistical 292 significance was assessed by randomisation (see Materials and Methods) (Figure 3B). We carried out the 293 same operation for lncRNA introns (Supplementary Figure S1). In lncRNA exons, a number of TE types 294 are enriched in either sense or antisense. They are dominated by LINE1 family members, possibly for the 295 reasons mentioned below. Other significantly enriched TE types include LTR78, MLT1B, and MIRc 296 (Figure 3B).

297 In introns, we observed a widespread depletion of same-strand TE insertions (Supplementary 298 Figure S1). This effect is modest yet statistically significant, and is especially true for LINE1 elements, 299 possibly for reasons mentioned above. In contrast, almost no TE types were significantly enriched on the 300 sense strand in introns.

301 To test for TE type-specific conservation, we turned to two sets of predictions of evolutionarily- 302 conserved elements. First, the widely-used PhastCons conserved elements, based on phylogenetic hidden 303 Markov model (Siepel et al. 2005) calculated on primate, placental mammal and vertebrate alignments; 304 second, the more recently published “Evoluationarily Conserved Structures” (ECS) set of conserved RNA 305 structures (Smith et al. 2013). Importantly, the PhastCons regions are defined based on sequence 306 conservation alone, while the ECS are defined by phylogenetic analysis of RNA structure evolution.

10

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

307 To look for evidence of evolutionary conservation on exonic TEs, we calculated the frequency of 308 their overlap with evolutionarily conserved genomic elements compared to intronic TEs of the same type. 309 To assess statistical significance, we again used positional randomisation (see inset in Figure 3C). This 310 pipeline was applied independently to the PhastCons (Figure C3, Supplementary Figure S1B,C,D) and ECS 311 (Supplementary Figure S1) data. The majority of TEs do not exhibit significantly different conservation 312 when found in exons or introns (Grey points). Depending on conservation data used, the method detects 313 significant conservation for 17 to 40 TE types (Figure 3C). We also found a small number of TEs with high 314 evolutionary rates specifically in lncRNA exons, namely the young AluSz, AluSx and AluJb (PhastCons) 315 and L1M4c and AluSx1 (ECS) (coloured red in Figure 3C and Supplementary Figure S1).

316

317 An annotation of RIDLs

318 We next combined all TE classes with evidence of functionality into a draft annotation of RIDLs, 319 enriched for functional TE elements. This annotation combined altogether 99 TE types with at least one 320 types of selection evidence. For each TE / evidence pair, only those TE instances satisfying that evidence 321 were included. In other words, if MIRb elements were found to be associated with vertebrate PhastCons 322 elements, then only those instances of exonic MIRb elements overlapping such a PhastCons element would 323 be included in the RIDL annotation, and all other exonic MIRs would be excluded. An example is CCAT1 324 lncRNA oncogene: it carries three exonic MIR elements, of which one is defined as a RIDL based on its 325 overlapping a PhastCons element (Figure 4A).

326 After removing redundancy, the final RIDL annotation consists of 5374 elements, located within 327 3566 distinct lncRNA genes (Figure 3D). The most predominant TE families are MIR and L2 repeats, 328 representing 2329 and 1143 RIDLs (Figure 4B). The majority of both are defined based on evolutionary 329 evidence (Figure 4B). In contrast, RIDLs composed by ERV1, low complexity, satellite and simple repeats 330 families are more frequently identified due to exonic enrichment (Figure 4B). The entire RIDL annotation 331 is available in Supplementary File S2.

332 We also examined the evolutionary history of RIDLs. Using 6-mammal alignments, their depth of 333 evolutionary conservation could be inferred (Supplementary Figure S2). 12% of instances appear to be 334 Great Ape-specific, with no orthologous sequence beyond chimpanzee. 47% are primate-specific, while the 335 remaining 40% are identified in at least one non-primate mammal. The wide timeframe for appearance of 336 RIDLs is consistent with the wide diversity of TE types, from ancient MIR elements to presently-active 337 LINE1 (Jurka, Zietkiewicz, and Labuda 1995; Konkel, Walker, and Batzer 2010; Smith et al. 2013).

338 Instances of genomic TE insertions typically represent a fragment of the full consensus sequence. 339 We hypothesised that particular regions of the TE consensus will be important for RIDL activity, 340 introducing selection for these regions that would distinguish them from unselected, intronic copies. To test 341 this, we compared insertion profiles of RIDLs to intronic instances, for each TE type, and used the 342 correlation coefficient (CC) as a quantitative measure of similarity (Figure 4C and Supplementary File S5). 343 For 17 cases, a CC<0.9 points to possible selective forces acting on RIDL insertions. An example is the 344 macrosatellite SST1 repeat where RIDL copies in 41 lncRNAs show a strong preference inclusion of the 345 3’ end, in contrast to the general 5’ preference observed in introns (Figure 4C). This suggests a possible 346 functional relevance for the 1000-1500 nt region of the SST1 consensus.

11

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

347

348 RIDL-carrying lncRNAs are enriched for functions and disease roles

349 We next looked for evidence to support the RIDL annotation by investigating the properties of 350 RIDL-carrying lncRNAs. We first asked whether RIDLs are randomly distributed amongst lncRNAs, or 351 else clustered in a smaller number of genes. Figure 4D shows that the latter is the case, with a significant 352 deviation of RIDLs from a random distribution. These lncRNAs carry a mean of 1.15 RIDLs / kb of exonic 353 sequence (median: 0.84 RIDLs/kb) (Supplementary Figure S3).

354 Are RIDL-lncRNAs more likely to be functional? To address this, we compared lncRNA genes 355 carrying one or more RIDLs, to a length-matched set of non-RIDL lncRNAs (Supplementary Figure S4). 356 RIDL-lncRNAs are (1) over-represented in the reference database for functional lncRNAs, lncRNAdb 357 (Quek et al. 2015) (Figure 4E) and (2) are significantly enriched for lncRNAs associated with cancer and 358 other diseases (Figure 4F). Finally, we found that RIDL-lncRNAs are enriched for trait/disease-associated 359 SNPs in both their promoter and exonic regions (Figure 4G).

360 In addition to CCAT1 (Figure 4A) (Nissan et al. 2012) are a number of deeply-studied RIDL- 361 containing genes. XIST, the X-chromosome silencing RNA contains seven internal RIDL elements. As we 362 pointed out previously (Rory Johnson and Guigó 2014), these include an intriguing array of four similar 363 pairs of MIRc / L2b repeats. The prostate cancer-associated UCA1 gene has a transcript isoform promoted 364 from an LTR7c, as well as an additional internal RIDL, thereby making a potential link between cancer 365 gene regulation and RIDLs. The TUG1 gene, involved in neuronal differentiation, contains highly 366 evolutionarily-conserved RIDLs including Charlie15k and MLT1K elements (Rory Johnson and Guigó 367 2014). Other RIDL-containing lncRNAs include MEG3, MEG9, SNHG5, ANRIL, NEAT1, CARMEN1 and 368 SOX2OT. LINC01206, located adjacent to SOX2OT, also contains numerous RIDLs. A full list of RIDL- 369 carrying lncRNAs can be found in Supplementary File S6.

370

371 Correlation between RIDLs and subcellular localisation of host transcript

372 The location of a lncRNA within the cell is of key importance to its molecular function (Cabili et 373 al. 2015; Derrien et al. 2012)(Mas-Ponte et al. 2017a). We next asked whether RIDLs might define lncRNA 374 localisation (Hacisuleyman et al. 2016; B. Zhang et al. 2014)(Chillón and Pyle 2016) (Figure 5A). 375 Beginning with normalised subcellular RNAseq data from the LncATLAS database (Mas-Ponte et al. 376 2017a), based on 10 ENCODE cell lines (Djebali et al. 2012a), we comprehensively tested each of the 99 377 RIDL types for association with localisation of their host transcript.

378 After correcting for multiple hypothesis testing, this approach identified four distinct TEs that 379 significantly correlate with nuclear localisation: L1PA16, L2b, MIRb and MIRc (Figure 5B). For example, 380 44 lncRNAs carrying L2b RIDLs have 6.9-fold higher relative nuclear/cytoplasmic ratio in IMR90 cells, 381 and this tendency is observed in six difference cell types (Figure 5B,C). Furthermore, the degree of nuclear 382 localisation increases in lncRNAs as a function of the number of RIDLs they carry (Figure 5D). We also 383 found a significant relationship between GC-rich elements and cytoplasmic enrichment across three 384 independent cell samples. The GC-RIDL containing lncRNAs have between 2 and 2.3-fold higher relative 385 expression in cytoplasm in these cells (Supplementary Figure S5). These data suggest that certain types of 386 RIDL may regulate the subcellular localisation of lncRNAs.

12

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

387

388 RIDLs play a causative role in lncRNA nuclear localisation

389 To more directly test whether RIDLs play a causative role in nuclear localisation, we selected three 390 lncRNAs, based on: (i) presence of L2b, MIRb and MIRc RIDLs; (ii) moderate expression; (iii) nuclear 391 localisation, as inferred from RNAseq (Figure 6A,B and Supplementary Figure S6). Nuclear localisation 392 of these candidates was validated using qRTPCR (Figure 6C and Supplementary Figure S7).

393 We designed an assay to compare the localisation of transfected lncRNAs carrying wild-type 394 RIDLs, and mutated versions where RIDL sequence has been randomised (“Mutant”) (Figure 6D, full 395 sequences available in Supplementary File 3). Wild-type and Mutant lncRNAs were overexpressed in 396 cultured cells and their localisation evaluated by fractionation. Overexpressed copies were many fold 397 greater than endogenous lncRNAs (Supplementary Figure S8). Fractionation quality was verified by 398 Western blotting (Figure 6E) and qRT-PCR (Figure 6F), and we found negligible levels of DNA 399 contamination in cDNA samples (Supplementary Figure S9).

400 Comparison of Wild-Type and Mutant lncRNAs demonstrated a potent and consistent impact of 401 RIDLs on nuclear-cytoplasmic localisation: for all three candidates, loss of RIDL sequence resulted in 402 relocalisation of the host transcript from nucleus to cytoplasm (Figure 6F). Overexpressed Wild-Type 403 lncRNA copies had slightly lower levels of nuclear enrichment compared to endogenous transcripts, and 404 the two types were distinguished by primers matching a plasmid-encoded region (Supplementary File S4). 405 We repeated these experiments in another cell line, A549, and observed similar effects (Supplementary 406 Figure S7). These results demonstrate a role for L2b, MIRb and MIRc elements in nuclear localisation of 407 lncRNAs.

408

13

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

409 Discussion

410 Recent years have seen a rapid increase in the number of annotated lncRNAs. However our 411 understanding of their molecular functions, and how such functions are encoded in primary RNA 412 sequences, lag far behind. Two recent conceptual developments offer hope for resolving the sequence- 413 function code of lncRNAs: First, the idea that the subcellular localisation of lncRNAs is a readily 414 quantifiable characteristic that holds important clues to function; Second, that the abundant transposable 415 element (TE) content of lncRNAs may contribute to functionality.

416 In this study we have linked these two ideas, by showing that TEs can drive the nuclear localisation 417 of lncRNAs. A global correlation analysis of TEs and RNA localisation data revealed a handful of TEs, 418 most notably LINE2b, MIRb and MIRc, strongly correlate with the degree of nuclear retention of their host 419 transcripts. This correlation is observed in multiple cell types, and scales with the number of TEs present. 420 A causative link was established experimentally, confirming that the indicated TEs have a two- to four-fold 421 effect on transcript localisation.

422 These data are consistent with our previously-published hypothesis that exonic TE elements may 423 function as lncRNA domains. In this “RIDL hypothesis”, transposable elements are co-opted by natural 424 selection to form “Repeat Insertion Domains of LncRNA”, that is, fragments of sequence that confer 425 adaptive advantage through some change in the activity of their host lncRNA. We proposed that RIDLs 426 may serve as binding sites for proteins or other nucleic acids, and indeed a growing body of evidence 427 supports this (Rory Johnson and Guigó 2014). In the context of localisation, RIDLs could mediate nuclear 428 retention due to their described interactions with nuclear proteins (D. R. Kelley et al. 2014), although it is 429 also conceivable that the effect occurs through hybridisation to complementary repeats in genomic DNA.

430 The localisation RIDLs discovered – MIR and LINE2 - are both ancient and contemporaneous, 431 being active prior to the mammalian radiation (Cordaux and Batzer n.d.). Both have previously been 432 associated with acquired roles in the context of genomic DNA (Jjingo et al. 2014)(R. Johnson et al. 2006). 433 Although the evolutionary history of lncRNA remains an active area of research and accurate dating of 434 lncRNA gene birth is challenging, it appears that the majority of human lncRNAs were born after the 435 mammalian radiation (Hezroni et al. 2015)(Hezroni et al. 2017)(Necsulea et al. 2014)(Washietl, Kellis, and 436 Garber 2014). This would mean that MIR and LINE2 RIDLs were pre-existing sequences that were exapted 437 by newly-born lncRNAs. This corresponds to the “latent” exaptation model proposed by Feschotte and 438 colleagues(Chuong, Elde, and Feschotte 2017). Given that nuclear retention is at odds with the primary 439 needs of TE RNAs to be exported to the cytoplasm, we propose that the observed nuclear localisation 440 activity is a more modern feature of L2b/MIR RIDLs, which is unrelated to their original roles. However it 441 is also possible that for other cases the reverse could be true – a pre-existing lncRNA exapts a newly- 442 inserted TE. Our use of evolutionary conservation to identify RIDLs likely biases our analysis against the 443 discovery of modern TE families, such as Alus.

444 This study marks a step in the ongoing efforts to map the domains of lncRNAs. Previous studies 445 have utilised a variety of approaches, from integrating experimental protein-binding data (Van Nostrand et 446 al. 2016)(Hu et al. 2017)(Li et al. 2014), to evolutionarily conserved segments (Smith et al. 2013)(Seemann 447 et al. 2017). Previous maps of TEs have highlighted their profound roles in lncRNA gene evolution 448 (Kapusta et al. 2013)(D. Kelley and Rinn n.d.)(Hezroni et al. 2015). However the present RIDL annotation 449 stands apart in attempting to identify the subset of TEs with evidence for selection. We hope that the present

14

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

450 RIDL map will prove a resource for future studies to better understand functional domains of lncRNAs. 451 Although various evidence suggests that the RIDL annotation is a useful enrichment group of functional 452 TE elements, it most likely contains substantial false positive and false negative rates that will be improved 453 in future.

454 This study, and the concept of RIDLs in general, may help to address two longstanding questions 455 in the lncRNA field.

456 How does lncRNA primary sequence evolve new functions so rapidly? The insertion of TE-derived “pre- 457 fabricated” sequence domains, with protein-, RNA- or DNA-binding capacity, confers rapid acquisition of 458 new molecular activities. There is a variety of evidence that TE sequences contain abundant, latent activity 459 of this type (reviewed in (Rory Johnson and Guigó 2014)).

460 Why do lncRNAs have nuclear-restricted localisation? Although they are readily detected in the cytoplasm, 461 it is nevertheless true that lncRNAs tend to have higher nuclear/cytoplasmic ratios compared to mRNAs 462 (Clark et al. 2012)(Ulitsky and Bartel 2013)(Derrien et al. 2012)(Mas-Ponte et al. 2017a). This is true across 463 various human and mouse cell types. Although this may partially be explained by decreased stability 464 (Mukherjee et al. 2017), it is likely that RNA sequence motifs also contribute to nuclear localisation 465 (Chillón and Pyle 2016)(B. Zhang et al. 2014). Here we show that this is the case, and that the enrichment 466 of certain RIDL types in lncRNA mature sequences is likely to be a major contributor to lncRNA nuclear 467 retention.

468 In summary therefore, we have made available a first annotation of selected RIDLs in lncRNAs, 469 and described a new paradigm for TE-derived fragments as drivers of nuclear localisation in lncRNAs.

470

15

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

471 Figure Legends

472

473 Figure 1: Repeat insertion domains of lncRNAs (RIDLs).

474 (A) In the “Repeat Insertion Domain of LncRNAs” (RIDL) model, exonically-inserted fragments of 475 transposable elements contain pre-formed protein-binding (red), RNA-binding (green) or DNA- 476 binding (blue) activities, that contribute to functionality of the host lncRNA (black). RIDLs are 477 likely to be a small minority of exonic TEs, coexisting with large numbers of non-functional 478 “passengers” (grey). 479 (B) RIDLs (dark orange arrows) will be distinguished from passenger TEs by signals of selection, 480 including: (1) simple enrichment in exons; (2) a preference for residing on a particular strand 481 relative to the host transcript; (3) evolutionary conservation. Selection might be identified by 482 comparing exonic TEs to a neutral population, for example those residing in lncRNA introns (light 483 coloured arrows).

484

485 Figure 2: An exonic transposable element annotation with the GENCODE v21 lncRNA catalogue.

486 (A) Statistics for the exonic TE annotation process using GENCODE v21 lncRNAs. 487 (B) The fraction of nucleotides overlapped by TEs for lncRNA exons and introns, and the whole 488 genome. 489 (C) Overview of the annotation process. The exons of all transcripts within a lncRNA gene annotation 490 are merged. Merged exons are intersected with the RepeatMasker TE annotation. Intersecting TEs 491 are classified into one of six categories (bottom panel) according to the gene structure with which 492 they intersect, and the relative strand of the TE with respect to the gene. 493 (D) Summary of classification breakdown for exonic TE annotation. 494 (E) Classification of TE classes in exonic TE annotation. Numbers indicate instances of each type.

495

496 Figure 3: Evidence for selection on transposable elements in lncRNA exons.

497 (A) Figure shows, for every TE type, the enrichment of per nucleotide coverage in exons compared to 498 introns (y axis) and overall exonic nucleotide coverage (x axis). Enriched TE types (at a 2-fold 499 cutoff) are shown in green. 500 (B) As for (A), but this time the y axis records the ratio of nucleotide coverage in sense vs antisense 501 configuration. “Sense” here is defined as sense of TE annotation relative to the overlapping exon. 502 Similar results for lncRNA introns may be found in Supplementary Figure 1. Significantly-enriched 503 TE types are shown in green. Statistical significance was estimated by a randomisation procedure, 504 and significance is defined at an uncorrected empirical p-value < 0.001 (See Material and Methods). 505 (C) As for (A), but here the y axis records the ratio of per-nucleotide overlap by PhastCons mammalian- 506 conserved elements for exons vs introns. Similar results for three other measures of evolutionary 507 conservation may be found in Supplementary Figure 1. Significantly-enriched TE types are shown 508 in green. Statistical significance was estimated by a randomisation procedure, and significance is 509 defined at an uncorrected empirical p-value < 0.001 (See Material and Methods). An example of

16

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

510 significance estimation is shown in the inset: the distribution shows the exonic/intronic 511 conservation ratio for 1000 simulations. The green arrow shows the true value, in this case for 512 MLT1A0 type. 513 (D) Summary of TE types with evidence of exonic selection. Six distinct evidence types are shown in 514 rows, and TE types in columns. On the right are summary statistics for (i) the number of unique TE 515 types identified by each method, and (ii) the number of instances of exonic TEs from each type 516 with appropriate selection evidence. The latter are henceforth defined as “RIDLs”.

517

518 Figure 4: Annotated RIDLs and RIDL-lncRNAs

519 (A) Example of a RIDL-lncRNA gene: CCAT1. Of note is that although several exonic TE instances 520 are identified (grey), including three separate MIR elements, only one is defined a RIDL (orange) 521 due to overlap of a conserved element. 522 (B) Breakdown of RIDL instances by TE family and evidence sources. 523 (C) Insertion profile of SST1 RIDLs (blue) and intronic insertions (red). x axis shows the entire 524 consensus sequence of SST1. y axis indicates the frequency with which each nucleotide position 525 is present in the aggregate of all insertions. “CC”: Spearman correlation coefficient of the two 526 profiles. “RIDLs” / “Intronic TEs” indicate the numbers of individual insertions considered for 527 RIDLs / intronic insertions, respectively. 528 (D) Number of lncRNAs (y axis) carrying the indicated number of RIDL (x axis) given the true 529 distribution (black) and randomized distribution (red). The 95% confidence interval was computed 530 empirically, by randomly shuffling RIDLs across the entire lncRNA annotation. 531 (E) The number of RIDL-lncRNAs, and a length-matched set of non-RIDL lncRNAs, which are 532 present in the lncRNAdb database of functional lncRNAs (Quek et al. 2015). Numbers denote 533 gene counts. Significance was calculated using Fisher’s exact test. 534 (F) As for (E) for lncRNAs present in disease- and cancer-associated lncRNA databases (see Materials 535 and Methods). 536 (G) As for (E) for lncRNAs containing at least one trait/disease-associated SNP in their promoter 537 regions (darker colors) or exonic regions (lighter colors) (see Materials and Methods).

538 Figure 5: Correlation between RIDLs and host lncRNA nuclear-cytoplasmic localisation.

539 (A) We hypothesized that RIDLs may govern the subcellular localization of host lncRNAs. 540 (B) Results of an in silico screen of RIDL types (rows) against lncRNA nuclear/cytoplasmic 541 localisation in human cell lines (columns). For every RIDL type (rows) in every cell type 542 (columns), the localisation of RIDL-containing lncRNAs was compared to all other detected 543 lncRNAs. Significantly-different pairs are coloured (Benjamini-Hochberg corrected p-value < 544 0.01; Wilcoxon test). Colour scale indicates the nuclear-cytoplasmic ratio mean of RIDL-lncRNAs. 545 Numbers in cells indicate the number of considered RIDL-lncRNAs. 546 (C) The nuclear-cytoplasmic localization of lncRNAs carrying L2b RIDLs in IMR90 cells. Blue 547 indicates lncRNAs carrying ≥1 L2b RIDLs, grey indicates all other detected lncRNAs (“not”). 548 Dashed lines represent medians. Significance was calculated using Wilcoxon test. 549 (D) The nuclear-cytoplasmic ratio of lncRNAs broken down by the number of significant RIDLs that 550 they carry (L1PA16, L2b, MIRb, MIRc). Correlation coefficient (Rho) between number of RIDLs

17

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

551 and nuclear-cytoplasmic ratio of lncRNAs and the corresponding p-value (P) are indicated in the 552 plot. In each box, upper value indicates the number of lncRNAs, and lower value the median. 553 554 Figure 6: Disruption of RIDLs results in lncRNA relocalisation from nucleus to cytoplasm. 555 (A) Structures of candidate RIDL-lncRNAs. Orange indicates RIDL positions. For each RIDL, 556 numbers indicate the position within the TE consensus, and its orientation with respect to the 557 lncRNA is indicated by arrows (“>” for same strand, “<” for opposite strand). 558 (B) Experimental design 559 (C) Expression of the three lncRNA candidates as inferred from HeLa RNAseq (Djebali et al. 560 2012c). 561 (D) Nuclear/cytoplasmic localisation of endogenous candidate lncRNA copies in wild-type HeLa 562 cells, as measured by qRT-PCR. 563 (E) The purity of HeLa subcellular fractions was assessed by Western blotting against specific 564 markers. GAPDH / Histone H3 proteins are used as cytoplasmic / nuclear markers, respectively. 565 (F) Nuclear/cytoplasmic localisation of transfected candidate lncRNAs. GAPDH/MALAT1 are 566 used as cytoplasmic/nuclear controls, respectively. N indicates the number of biological 567 replicates, and error bars represent standard error of the mean. P-values for paired t-test (1 tail) 568 are shown.

569

18

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

570 Supplementary Figure Legends

571

572 Supplementary Figure S1: (A) Figure shows, for every TE type, the ratio of intronic nucleotide coverage 573 in sense vs antisense configuration. “Sense” here is defined as sense of TE annotation relative to the sense 574 of the overlapping intron. Significantly-enriched TE types are shown in green. Statistical significance was 575 estimated by a randomisation procedure, and significance is defined at an uncorrected empirical p-value < 576 0.001 (See Material and Methods). (B-D) As for (A), but y axis records the ratio of nucleotide overlap in 577 exons vs introns by PhastCons Primate-conserved elements, PhastCons placental-conserved elements, 578 PhastCons vertebrate-conserved elements and Evolutionarily conserved structure, respectively.

579 Supplementary Figure S2: Evolutionary analysis using 6-mammal alignments. Rows represent human 580 RIDLs. In each column (6 different species) they are coloured in red when they are conserved in the given 581 species and otherwise in light yellow.

582 Supplementary Figure S3: Density distribution of RIDL-carrying lncRNAs based on the number of 583 RIDLs per kb of exons that they contain. Dotted line indicates median.

584 Supplementary Figure S4: Creating an exon-length-matched sample of non-RIDL lncRNAs. (A) Exonic 585 length distribution of RIDL-carrying and non-RIDL GENCODE v21 lncRNAs (“others”). (B) Exonic 586 length distribution of RIDL-carrying lncRNAs and a length-matched sample of non-RIDL lncRNAs. The 587 latter were used for all comparisons at gene level.

588 Supplementary Figure S5: Cytoplasmic-nuclear localization in IMR90. Yellow represents lncRNAs 589 carrying one or more copies of GC-rich RIDL elements, grey represents all other detected lncRNAs. Dashed 590 lines indicate the median of each group. 591 Supplementary Figure S6: Barplot generated by LncATLAS webserver (Mas-Ponte et al. 2017b). Bars 592 indicate log2-transformed cytoplasmic/nuclear expression ratios (CN RCI) in 15 cell lines for three 593 candidate genes. Colours represent the total nuclear expression of each gene in each cell line. 594 Supplementary Figure S7: Experimental data for A549 cells. (A) Expression data for three candidate 595 lncRNAs calculated using RNAseq data (Djebali et al. 2012a). (B) Validation of nuclear/cytoplasmic 596 localisation of candidate genes in wild-type cells by qRT-PCR. (C) Validation of subcellular fractionation 597 by Western blot, using GAPDH/Histone H3 proteins as cytoplasmic/nuclear markers, respectively. (D) 598 Nuclear/cytoplasmic localisation of transfected candidate lncRNAs.

599 Supplementary Figure S8: Estimating overexpression of transfected lncRNA candidates. Data are shown 600 individually for each of four biological replicates. y axis represents the fold change of transfected gene level 601 compared to the endogeneous level estimated from wild-type cells.

602 Supplementary Figure 9: DNA contamination check in RNA Isolated from different subcellular fractions. 603 Total RNA was isolated from different subcellular fractions using RNA isolation kit as described in 604 Materials and Methods. Equal amount of RNAs with (+RT) and without (-RT) reverse transcription, were 605 subjected to PCR (40 cycles) with an exonic primer. The reaction products were electrophoresed on a 1.5 606 % agarose gel stained with SYBR safe. The T, C, N represent the RNA isolated from total, cytoplasmic and 607 nuclear fractions respectively.

608

19

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

609 Supplementary Files

610

611 Supplementary File S1: File is in BED format with coordinates for human genome assembly GRCh38 / 612 hg38. The file contains the coordinates of genomic nucleotides that overlap both TEs and merged exons of 613 lncRNA genes.

614 The 4th “name” column contains information on the TE (from RepeatMasker annotation) in comma- 615 separated format:

616 Chromosome of TE annotation, start of TE annotation, end of TE annotation, strand of TE annotation, TE 617 name, TE class, TE family, left in repeat, end in repeat, begin in repeat, hypothetical start if TE were full 618 length, hypothetical end if TE were full length.

619 In the 5th “score” column is indicated the TE classification:

620 +1 : tss.plus ; +2 : splice_acc.plus ; +3 : encompass.plus ; +4 : splice_don.plus; +5 : tts.plus ; +6 : 621 inside.plus. The sign (+-) denotes whether the annotated strand of the TE lies on the same/opposite strand 622 with respect to that of the overlapping lncRNA.

623 Supplementary File S2: File is in BED format with coordinates for human genome assembly GRCh38 / 624 hg38. It was created by merging a subset of the exonic TEs from Supplementary File S1, and thus some 625 lines may be composed of merged features.

626 The 4th “name” column contains information on the TE (from RepeatMasker annotation) in comma- 627 separated format:

628 Chromosome of TE annotation, start of TE annotation, end of TE annotation, strand of TE annotation, TE 629 name, TE class, TE family, left in repeat, end in repeat, begin in repeat, hypothetical start if TE were full 630 length, hypothetical end if TE were full length.

631 Supplementary File S3: File contains wild type and mutant synthesized plasmid sequences. Coloured 632 sequences indicate wild-type / scrambled RIDL sequences.

633 Supplementary File S4: Primer sequences and calculations of their efficiency.

634 Supplementary File S5: PDF images of insertion profiles of all RIDLs (blue) and intronic insertions (red). 635 x axis shows the entire consensus sequence of a given TE type. y axis indicates the frequency with which 636 each nucleotide position is present in the aggregate of all insertions. “CC”: Spearman correlation coefficient 637 of the two profiles. “Data” / “Reference” indicate the numbers of individual insertions considered for RIDLs 638 / intronic insertions, respectively.

639 Supplementary File S6: List of GENCODE v21 lncRNAs carrying one or more RIDLs. First column 640 contains gene ID and second column the number of RIDLs present in the lncRNA.

641

20

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

642 Acknowledgements

643 We acknowledge Deborah Re (DBMR), Silvia Roesselet (DBMR) and Marianne Zahn (Inselspital) 644 for administrative support. We wish to thank Roderic Guigo (CRG), Marc Friedlaender (SciLife Lab) and 645 Marta Mele (Harvard) for many helpful discussions and Roberta Esposito (DBMR) for help and advice 646 regarding experiments and their analysis. Samir Ounzain (CHUV) also contributed suggestions regarding 647 experimental design. Julien Lagarde (CRG) kindly provided help in gene sampling analysis. CN is 648 supported by grant TIN-2013-41990-R from the Spanish Ministry of Economy, Industry and 649 Competitiveness (MINECO). This research was supported by the NCCR RNA & Disease funded by the 650 Swiss National Science Foundation. 651 652

21

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

653 References 654 655 Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and 656 Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B 657 (Methodological) 57: 289–300. https://www.jstor.org/stable/2346101 (August 25, 2017). 658 Blackwell, Benjamin J et al. “Protein Interactions with piALU RNA Indicates Putative Participation of 659 retroRNA in the Cell Cycle, DNA Repair and Chromatin Assembly.” 660 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3383447/pdf/mge-2-26.pdf (August 25, 2017). 661 Bourque, Guillaume et al. 2008. “Evolution of the Mammalian Transcription Factor Binding Repertoire 662 via Transposable Elements.” Genome research 18(11): 1752–62. 663 http://www.ncbi.nlm.nih.gov/pubmed/18682548 (August 25, 2017). 664 ———. 2009. “Transposable Elements in Gene Regulation and in the Evolution of Vertebrate Genomes.” 665 Current Opinion in & Development 19(6): 607–12. 666 http://www.sciencedirect.com/science/article/pii/S0959437X09001725 (August 25, 2017). 667 Cabili, Moran N et al. 2015. “Localization and Abundance Analysis of Human lncRNAs at Single-Cell 668 and Single-Molecule Resolution.” Genome biology 16(1): 20. 669 http://genomebiology.com/2015/16/1/20 (January 30, 2015). 670 Carrieri, Claudia et al. 2012. “Long Non-Coding Antisense RNA Controls Uchl1 Translation through an 671 Embedded SINEB2 Repeat.” Nature 491(7424): 454–57. 672 http://www.ncbi.nlm.nih.gov/pubmed/23064229 (February 10, 2015). 673 Chen, G. et al. 2013. “LncRNADisease: A Database for Long-Non-Coding RNA-Associated Diseases.” 674 Nucleic Acids Research 41(D1): D983–86. http://www.ncbi.nlm.nih.gov/pubmed/23175614 675 (December 24, 2016). 676 Chen, Ling-Ling. 2016. “Linking Long Noncoding RNA Localization and Function.” Trends in 677 biochemical sciences. http://www.ncbi.nlm.nih.gov/pubmed/27499234 (August 25, 2016). 678 Chillón, Isabel, and Anna M. Pyle. 2016. “ Alu Elements in the Human lincRNA-p21 679 Adopt a Conserved Secondary Structure That Regulates RNA Function.” Nucleic Acids Research: 680 gkw599. http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkw599 (September 5, 2016). 681 Chuong, Edward B, Nels C Elde, and Cédric Feschotte. 2017. “Regulatory Activities of Transposable 682 Elements: From Conflicts to Benefits.” 683 https://www.nature.com/nrg/journal/v18/n2/pdf/nrg.2016.139.pdf (August 25, 2017). 684 Clark, Michael B et al. 2012. “Genome-Wide Analysis of Long Noncoding RNA Stability.” Genome 685 research 22(5): 885–98. http://genome.cshlp.org/content/22/5/885.long (June 3, 2014). 686 Cordaux, Richard, and Mark A Batzer. “The Impact of on Human Genome Evolution.” 687 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2884099/pdf/nihms-201920.pdf (August 29, 2017). 688 Derrien, Thomas et al. 2012. “The GENCODE v7 Catalog of Human Long Noncoding RNAs: Analysis 689 of Their Gene Structure, Evolution, and Expression.” Genome research 22(9): 1775–89. 690 http://genome.cshlp.org/content/22/9/1775.long (May 23, 2014). 691 Djebali, Sarah et al. 2012a. “Landscape of Transcription in Human Cells.” Nature 489(7414): 101–8. 692 http://www.nature.com/doifinder/10.1038/nature11233 (April 20, 2017). 693 ———. 2012b. “Landscape of Transcription in Human Cells.” Nature 489(7414): 101–8. 694 http://www.ncbi.nlm.nih.gov/pubmed/22955620 (May 11, 2017).

22

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

695 ———. 2012c. “Landscape of Transcription in Human Cells.” Nature 489(7414): 101–8. 696 http://www.ncbi.nlm.nih.gov/pubmed/22955620 (July 17, 2017). 697 Elisaphenko, Eugeny A et al. 2008. “A Dual Origin of the Xist Gene from a Protein-Coding Gene and a 698 Set of Transposable Elements.” 699 http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0002521&type=printable 700 (August 25, 2017). 701 Faulkner, Geoffrey J et al. “The Regulated Transcriptome of Mammalian Cells.” 702 https://www.nature.com/ng/journal/v41/n5/pdf/ng.368.pdf (August 25, 2017). 703 Feschotte, Cédric. 2008. “Transposable Elements and the Evolution of Regulatory Networks.” Nature 704 reviews. Genetics 9(5): 397–405. http://www.ncbi.nlm.nih.gov/pubmed/18368054 (September 1, 705 2017). 706 Gong, Chenguang, and Lynne E Maquat. 2011. “lncRNAs Transactivate STAU1-Mediated mRNA Decay 707 by Duplexing with 3’ UTRs via Alu Elements.” Nature 470(7333): 284–88. 708 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3073508&tool=pmcentrez&rendertype= 709 abstract (March 6, 2015). 710 Guttman, Mitchell, and John L. Rinn. 2012. “Modular Regulatory Principles of Large Non-Coding 711 RNAs.” Nature 482(7385): 339–46. http://www.nature.com/doifinder/10.1038/nature10887 712 (January 6, 2017). 713 Hacisuleyman, Ezgi et al. 2016. “Function and Evolution of Local Repeats in the Firre Locus.” Nature 714 Communications 7: 11021. http://www.nature.com/doifinder/10.1038/ncomms11021 (November 20, 715 2016). 716 Harrow, J. et al. 2012. “GENCODE: The Reference Human Genome Annotation for The ENCODE 717 Project.” Genome Research 22(9): 1760–74. http://genome.cshlp.org/cgi/doi/10.1101/gr.135350.111 718 (November 20, 2016). 719 Hezroni, Hadas et al. 2015. “Principles of Long Noncoding RNA Evolution Derived from Direct 720 Comparison of Transcriptomes in 17 Species.” Cell Reports 11(7): 1110–22. 721 http://linkinghub.elsevier.com/retrieve/pii/S2211124715004106 (September 1, 2017). 722 ———. 2017. “A Subset of Conserved Mammalian Long Non-Coding RNAs Are Fossils of Ancestral 723 Protein-Coding Genes.” Genome Biology 18(1): 162. 724 http://www.ncbi.nlm.nih.gov/pubmed/28854954 (September 2, 2017). 725 Hindorff, Lucia A et al. 2009. “Potential Etiologic and Functional Implications of Genome-Wide 726 Association Loci for Human Diseases and Traits.” Proceedings of the National Academy of Sciences 727 of the United States of America 106(23): 9362–67. http://www.ncbi.nlm.nih.gov/pubmed/19474294 728 (August 30, 2017). 729 Holdt, Lesca M et al. 2013. “Alu Elements in ANRIL Non-Coding RNA at Chromosome 9p21 Modulate 730 Atherogenic Cell Functions through Trans-Regulation of Gene Networks.” PLoS Genet 9(7). 731 http://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1003588&type=printable 732 (August 25, 2017). 733 Hu, Boqin et al. 2017. “POSTAR: A Platform for Exploring Post-Transcriptional Regulation Coordinated 734 by RNA-Binding Proteins.” Nucleic Acids Research 45(D1): D104–14. 735 https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw888 (September 2, 2017). 736 Huda, Ahsan, Nathan J. Bowen, Andrew B. Conley, and I. King Jordan. 2011. “Epigenetic Regulation of 737 Transposable Element Derived Human Gene Promoters.” Gene 475(1): 39–48.

23

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

738 http://www.sciencedirect.com/science/article/pii/S0378111910004762 (August 27, 2017). 739 Jjingo, Daudi et al. 2014. “Mammalian-Wide (MIR)-Derived Enhancers and the 740 Regulation of Human Gene Expression.” Mobile DNA 5(1): 14. 741 http://mobilednajournal.biomedcentral.com/articles/10.1186/1759-8753-5-14 (September 2, 2017). 742 Johnson, R. et al. 2006. “Identification of the REST Regulon Reveals Extensive Transposable Element- 743 Mediated Binding Site Duplication.” Nucleic Acids Research 34(14): 3862–77. 744 http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkl525 (August 25, 2017). 745 Johnson, Rory, and Roderic Guigó. 2014. “The RIDL Hypothesis: Transposable Elements as Functional 746 Domains of Long Noncoding RNAs.” RNA (New York, N.Y.) 20(7): 959–76. 747 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4114693&tool=pmcentrez&rendertype= 748 abstract (April 29, 2015). 749 Jurka, J, E Zietkiewicz, and D Labuda. 1995. “Ubiquitous Mammalian-Wide Interspersed Repeats (MIRs) 750 Are Molecular Fossils from the Mesozoic Era.” Nucleic acids research 23(1): 170–75. 751 http://www.ncbi.nlm.nih.gov/pubmed/7870583 (August 29, 2017). 752 Kapusta, Aurélie et al. 2013. “Transposable Elements Are Major Contributors to the Origin, 753 Diversification, and Regulation of Vertebrate Long Noncoding RNAs.” ed. Hopi E. Hoekstra. PLoS 754 genetics 9(4): e1003470. http://dx.plos.org/10.1371/journal.pgen.1003470 (August 25, 2017). 755 Kelley, David R, David G Hendrickson, Danielle Tenen, and John L Rinn. 2014. “Transposable Elements 756 Modulate Human RNA Abundance and Splicing via Specific RNA-Protein Interactions.” Genome 757 Biology 15(12): 537. http://www.ncbi.nlm.nih.gov/pubmed/25572935 (September 2, 2017). 758 Kelley, David, and John Rinn. 2012. “Transposable Elements Reveal a Stem Cell-Specific Class of Long 759 Noncoding RNAs.” Genome Biology 13(11): R107. 760 http://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-11-r107 (August 25, 2017). 761 ———. “Transposable Elements Reveal a Stem Cell-Specific Class of Long Noncoding RNAs.” 762 https://genomebiology.biomedcentral.com/track/pdf/10.1186/gb-2012-13-11- 763 r107?site=genomebiology.biomedcentral.com (August 25, 2017). 764 Konkel, Miriam K, Jerilyn A Walker, and Mark A Batzer. 2010. “LINEs and SINEs of Primate 765 Evolution.” Evolutionary anthropology 19(6): 236–49. 766 http://www.ncbi.nlm.nih.gov/pubmed/25147443 (August 29, 2017). 767 Lev-Maor, Galit, Rotem Sorek, Noam Shomron, and Gil Ast. 2003. “The Birth of an Alternatively 768 Spliced Exon: 3’ Splice-Site Selection in Alu Exons.” Science 300(5623). 769 http://science.sciencemag.org/content/300/5623/1288/tab-pdf (August 25, 2017). 770 Li, Jun-Hao et al. 2014. “starBase v2.0: Decoding miRNA-ceRNA, miRNA-ncRNA and protein–RNA 771 Interaction Networks from Large-Scale CLIP-Seq Data.” Nucleic Acids Research 42(D1): D92–97. 772 https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkt1248 (September 2, 2017). 773 Martin, Kelsey C, and Anne Ephrussi. 2009. “mRNA Localization: Gene Expression in the Spatial 774 Dimension.” Cell 136(4): 719–30. http://www.ncbi.nlm.nih.gov/pubmed/19239891 (August 29, 775 2017). 776 Mas-Ponte, David et al. 2017a. “LncATLAS Database for Subcellular Localisation of Long Noncoding 777 RNAs.” RNA (New York, N.Y.): rna.060814.117. 778 http://rnajournal.cshlp.org/lookup/doi/10.1261/rna.060814.117 (April 10, 2017). 779 ———. 2017b. “LncATLAS Database for Subcellular Localization of Long Noncoding RNAs.” RNA

24

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

780 23(7): 1080–87. http://rnajournal.cshlp.org/lookup/doi/10.1261/rna.060814.117 (July 19, 2017). 781 Mercer, Tim R, and John S Mattick. 2013. “Structure and Function of Long Noncoding RNAs in 782 Epigenetic Regulation.” Nature Publishing Group 20. 783 https://www.nature.com/nsmb/journal/v20/n3/pdf/nsmb.2480.pdf (August 25, 2017). 784 Mukherjee, Neelanjan et al. 2017. “Integrative Classification of Human Coding and Noncoding Genes 785 through RNA Metabolism Profiles.” Nature structural & molecular biology 24(1): 86–96. 786 http://www.nature.com/doifinder/10.1038/nsmb.3325 (September 2, 2017). 787 Necsulea, Anamaria et al. 2014. “The Evolution of lncRNA Repertoires and Expression Patterns in 788 Tetrapods.” Nature 505(7485): 635–40. http://www.ncbi.nlm.nih.gov/pubmed/24463510 (May 11, 789 2017). 790 Ning, Shangwei et al. 2016. “Lnc2Cancer: A Manually Curated Database of Experimentally Supported 791 lncRNAs Associated with Various Human Cancers.” Nucleic Acids Research 44(D1): D980–85. 792 https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkv1094 (August 30, 2017). 793 Nissan, Aviram et al. 2012. “Colon Cancer Associated Transcript-1: A Novel RNA Expressed in 794 Malignant and Pre-Malignant Human Tissues.” International Journal of Cancer 130(7): 1598–1606. 795 http://www.ncbi.nlm.nih.gov/pubmed/21547902 (August 29, 2017). 796 Van Nostrand, Eric L et al. 2016. “Robust Transcriptome-Wide Discovery of RNA-Binding Protein 797 Binding Sites with Enhanced CLIP (eCLIP).” Nature Methods 13(6): 508–14. 798 http://www.ncbi.nlm.nih.gov/pubmed/27018577 (September 2, 2017). 799 Perepelitsa-Belancio, Victoria, and Prescott Deininger. 2003. “RNA Truncation by Premature 800 Polyadenylation Attenuates Human Mobile Element Activity.” Nature Genetics 35(4): 363–66. 801 http://www.nature.com/doifinder/10.1038/ng1269 (August 25, 2017). 802 Quek, Xiu Cheng et al. 2015. “lncRNAdb v2.0: Expanding the Reference Database for Functional Long 803 Noncoding RNAs.” Nucleic acids research 43(Database issue): D168-73. 804 https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gku988 (December 24, 2016). 805 Quinlan, Aaron R, and Ira M Hall. 2010. “BEDTools: A Flexible Suite of Utilities for Comparing 806 Genomic Features.” Bioinformatics (Oxford, England) 26(6): 841–42. 807 http://www.ncbi.nlm.nih.gov/pubmed/20110278 (August 25, 2017). 808 Roberts, Justin T, Sara E Cardin, and Glen M Borchert. 2014. “Burgeoning Evidence Indicates That 809 microRNAs Were Initially Formed from Transposable Element Sequences.” Mobile Genetic 810 Elements 4(3): e29255. http://www.tandfonline.com/doi/abs/10.4161/mge.29255 (August 25, 2017). 811 Schmitt, Adam M. et al. 2016. “Long Noncoding RNAs in Cancer Pathways.” Cancer Cell 29(4): 452– 812 63. http://linkinghub.elsevier.com/retrieve/pii/S1535610816300927 (June 28, 2016). 813 Seemann, Stefan E. et al. 2017. “The Identification and Functional Annotation of RNA Structures 814 Conserved in Vertebrates.” Genome Research 27(8): 1371–83. 815 http://www.ncbi.nlm.nih.gov/pubmed/28487280 (September 2, 2017). 816 Sela, Noa et al. 2007. “Comparative Analysis of Transposed Element Insertion within Human and Mouse 817 Genomes Reveals Alu’s Unique Role in Shaping the Human Transcriptome.” 8(6). 818 https://genomebiology.biomedcentral.com/track/pdf/10.1186/gb-2007-8-6- 819 r127?site=genomebiology.biomedcentral.com (August 25, 2017). 820 Siepel, Adam et al. 2005. “Evolutionarily Conserved Elements in Vertebrate, Insect, Worm, and Yeast 821 Genomes.” Genome research 15(8): 1034–50. http://www.ncbi.nlm.nih.gov/pubmed/16024819

25

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

822 (August 25, 2017). 823 Smith, M. A., T. Gesell, P. F. Stadler, and J. S. Mattick. 2013. “Widespread Purifying Selection on RNA 824 Structure in Mammals.” Nucleic Acids Research 41(17): 8220–36. 825 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3783177&tool=pmcentrez&rendertype= 826 abstract (April 20, 2015). 827 Suzuki, Keiko et al. 2010. “REAP: A Two Minute Cell Fractionation Method.” BMC Research Notes 828 3(1): 294. http://www.ncbi.nlm.nih.gov/pubmed/21067583 (August 28, 2017). 829 Ulitsky, Igor, and David P. Bartel. 2013. “lincRNAs: Genomics, Evolution, and Mechanisms.” Cell 830 154(1): 26–46. http://linkinghub.elsevier.com/retrieve/pii/S0092867413007599 (November 20, 831 2016). 832 Washietl, Stefan, Manolis Kellis, and Manuel Garber. 2014. “Evolutionary Dynamics and Tissue 833 Specificity of Human Long Noncoding RNAs in Six Mammals.” Genome research 24(4): 616–28. 834 http://www.ncbi.nlm.nih.gov/pubmed/24429298 (May 25, 2014). 835 Welter, Danielle et al. 2014. “The NHGRI GWAS Catalog, a Curated Resource of SNP-Trait 836 Associations.” Nucleic Acids Research 42(D1): D1001–6. https://academic.oup.com/nar/article- 837 lookup/doi/10.1093/nar/gkt1229 (August 30, 2017). 838 Zhang, B. et al. 2014. “A Novel RNA Motif Mediates the Strict Nuclear Localization of a Long 839 Noncoding RNA.” Molecular and Cellular Biology 34(12): 2318–29. 840 http://mcb.asm.org/cgi/doi/10.1128/MCB.01673-13 (November 20, 2016). 841 Zhang, Kun et al. 2014. “The Ways of Action of Long Non-Coding RNAs in Cytoplasm and Nucleus.” 842 Gene 547(1): 1–9. http://www.ncbi.nlm.nih.gov/pubmed/24967943 (March 17, 2015). 843

26

bioRxiv preprint doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was A not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Passenger LncRNA TEs RNA Protein DNA RIDLs

B

Exonic Enrichment

Strand Bias

Evolutionary Conservation A B 5520018 0.4 bioRxiv preprint doi: https://doi.org/10.1101/189753RepeatMasker ; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed0.3 without permission. DNA LINE 48684 46474 Low complexity 23975 samesense LTR merged exons exonic overlap 22499 antisense 0.2 Satellite Simple repeat 26414 lncRNA SINE transcripts Gencode 21 0.1 31004 562640 273264 samesense Fraction Overlapped introns intronic overlap 289376 antisense 0.0

C D genome RepeatMasker lncrna.exonslncrna.introns

20000 Transcripts 15000

10000

Merged Number Gene 5000

TSS Donor 0 Acceptor Inside TSS TTS Exonic TE Exonic Annotation Encompassing Donor Inside TTS Acceptor Encompass E Alu 262 691 235 1017 367 304 15 625 913 454 1795 1903 1500 CR1 27 32 28 58 44 15 10 10 36 50 172 177 1000 ERV1 402 250 196 67 116 112 116 75 507 133 336 272 500 ERVK 37 11 20 10 25 4 5 4 56 9 21 13 0 ERVL 219 137 175 95 98 70 122 62 295 112 307 316 ERVL−MaLR 634 227 415 217 214 83 358 150 790 194 790 551

amily L1 315 525 167 324 243 176 135 335 710 763 1292 1276 L2 185 327 333 174 138 234 77 111 288 218 1161 1127 Low_complexity 47 31 10 5 13 0 1 0 43 7 347 293

Repeat F MIR 209 309 153 629 372 332 53 151 378 278 1905 1749 Simple_repeat 126 131 10 21 21 44 0 0 118 116 1846 1620 TcMar−Tigger 68 43 52 69 59 28 58 15 117 83 203 190 hAT−Charlie 117 89 119 124 136 66 52 15 249 173 628 630 hAT−Tip100 25 24 28 36 18 15 11 14 37 41 125 112 TTS+ TTS− TSS+ TSS− Inside+ Inside− SpliceAcc+ SpliceAcc− SpliceDon+ SpliceDon− Encompass+ Encompass− A B Strand Bias ExonicbioRxiv preprint Enrichment doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 4 L1PA5 a Enriched L1MA9 L1MC2 LTR78 L1PA3 a L1PA15 (GCC)n MLT1B Not significant L1PA16 (GGC)n 2 L1PB1 REP522 a HERVE−int Depleted MIRc LTR7C AluSx1 MER51E 0 L1M5 L1MB4 L1PA4 2 AluYc L1MCa L1PA7 L1MC1 L1MA6 G−rich L1ME2 HERV3−int Tigger2 L1MA2 Charlie2b −2 LTR12D MER92−int HERVK9−int overage L1M4b L1ME3D MLT1A0 SST1 BSR/Beta ALR/Alpha MLT1F LTR12F −4 MER49 50 60 (CTC)n MLT1F1 MER61ALTR33B MLT1J2 MLT1F2 40 MER77 MIR1_Amn

Log2 ratio sense / antisense overlap MER113 MamRep4096 Charlie1 MER51A FLAM_C MLT1J Charlie16a 30 MLT1G1 Charlie8 MLT1LMLT1H L1ME3DFLAM_AMLT1J1 LTR12C MER8 MER102c MLT1C 3.5 4.0 4.5 5.0 5.5 FRAM

MLT1G3 Frequency OldhAT1 MLT2FFAMLTR67B Tigger2aMLT1H2 MLT1H1MER52A 20 LFSINE_Vert LOR1−intLTR1A2 MER104 MER102b MLT1E1AMamRTE1LTR78 AmnSINE1LTR85c LTR12 MLT1N2LTR7 MER5A1 MLT1B MamRep137MER91A MLT1G MLT1A MER58C MER1AMLT2A2 LTR33 MLT1K Total exonic overlap MARNA MLT1E3MER41−intMER21AMER57A−int MER117 LTR16C 1000 Randomisations Charlie15a Charlie4zMER46CMER44A MLT2B2 L1ME4cMLT1A1 0 MADE2 MER63A LTR16 MLT2D MSTC L3b MER5CMER96BMER81MER112 Tigger7 LTR49−intMER103CSVA_D MER20 MLT1A0 LTR88bTigger4b LTR16E2 MLT1D 0 10 MER102aPlat_L3 Charlie4a MLT1OMER41B MLT1I (Log10 nt) Alu MER45A ERV3−16A3_I−int MER5A Tigger19a MER20BMamTip2 MLT1E2 Charlie5 MER5BAluJr4 Tigger19b LTR16A1 LTR16AMLT2B1 GA−rich L1MC4 L3AluSp Charlie18a MER4A1LTR50 MSTB1LTR8 L1MC5aAluSq MamSINE1 (A)nLTR78BMER21B L1MB5 MamRep434 (TC)nTigger15a MER2MSTD AluSc LTR104_MamMLT1MMER94 MER33Charlie1aAluSg7 AluSx4 L1ME4aAluSg −4 −2 0 2 4 Charlie1bLTR16A2LTR16E1LTR79MER58B(GT)nMLT2B3 L1MB4L1ME4b AluSz6 MER57−intMER53 MER3 L1P2MSTB L4_A_Mam(T)n(TG)n MER21C AluSq2 AluYh3MamRep605MER115Tigger13a MER1BTHE1A L1ME3GL1MC5 AluYm1L1M4a1 AluSc8L1ME3A (CT)n MER4D1 MER50 MER4−intL1ME2z AluSx3L1MB7 L4_B_MamCharlie2bHERV16−intAluYcMER61−int Tigger2 L1MD3 A−rich L1ME3 Log2 ratio exon/intron overlap MER30 L1MB2 L1M5 L2 MamGypsy2−I (CA)n MER58A HERVH−int MER39 (AC)n L1ME3B MSTA L1MCERVL−E−int THE1D MER41AL1MCcTigger3aAluSc5 L1ME1 L4_C_Mam L1ME5 ERVL−B4−int L1MC4a HAL1bMLT2B4 L1MDL1ME3Cz L1M2 MER63B Arthur1BCharlie7MLT2A1 L1MB8L1M4 AluSg4 L1MB3L1MEf (TTTC)n L1ME3E L1MD2 THE1B L1MDa L1MEh L1M3 L1PREC2 Arthur1 L1M4c Tigger3b L1ME2L1MEgL1MC1HAL1 Placental Mammal PhastCons C HAL1MEL1MEi THE1C Tigger1 MER50 L1M6 L1MEd L1M4b L1MC3 MLT2B4 MLT1A0 L1M7 (AT)n L1MA5L1MA6 L1MA9 LTR12C L1MD1 L1MA5A L1MA7 4 L1MCa L1PA15 MLT1H1L1MC3L1MC1 L1PA14 L1MA4 L1MEc L1MA5A L1MC2L1PA17L1PA11 (TA)n AluSc8 L1MA8 MER63A L1MD2 −2 L1MA2 THE1D Log2 Ratio Exonic vs Intronic C L1P3 GA−rich (AT)n L1MA3 L1MEg L1MA10 L1M1 AluSc L1PA13 (T)n AluSp L1MA4 MLT1K L1PA5 2 A−rich L1ME3A L1ME1 AluJr (TA)n L1PA16 L2b L1PA2 MIR L1PA8 L1PA3 L1PA10 L1PA7 L1MA1 0 L1PA4

xon / intron overlap AluSq2 AluJb MLT1H MLT1C AluSx1 AluSz −2 AluSx3 3.0 3.5 4.0 4.5 5.0

Total Exonic Coverage (Log10 nt) −4 L1MEc AluSg

ratio e Log2 ratio 3.5 4.0 4.5 5.0 5.5 Total exonic overlap (Log10 nt) D 10.07.55.0 2.5 Log2 Enrichment Types Individual Exonic Enrichment 20 941 Strand Bias 14 555 Primate Conserved 32 292 Placental Conserved 32 610 Vertebrate Conserved 40 905 Structure Conserved 17 4416

A2 L2a L2b L2c (A)n MIR (T)n T1J2 AluJr (AT)n TR33 TR78 MIR3 MIRb MIRc T2B4 (TA)n AluJbAluJo AluSc AluSpAluSq AluSz (CT)n L1MD TR7C SST1 A−rich FRAM L1MEf L1P L1PA3L1PA5 L L G−rich L1PB1 TR12F L MLT1B MLT1F MLT1K (AAC)n AluSc8AluSg4 AluSq2AluSx1 L1MA4L1MA5 L1MA7L1MA9L1MC1L1MC2L1MC3L1MC4 L1MD1L1MD2L1MD3L1ME1 L1MEg TR12CTR12D MER49MER50 MLT1CMLT1D THE1D (AATA)n (CCG)n(CGC)n(CGG)n (GCC)n(GGC)n L1PA15L1PA16 L L L MER1A ML (TTTA)n GA−rich L1MA5A L1ME3AL1ME4b TR16E1 MLT1A0MLT1A1 MLT1H1 ML REP522 (TTTC)n L MER41A MER51E MER61A MER63A Tigger3a BSR/Beta Charlie15a ALR/Alpha HERV3−intHERVE−int MER57−int MER61−int MER92−int HERVK9−int MER57A−int ERVL−B4−int Total 99 5374 A hg38 5 kb Scale 127,219,000 127,218,000 127,217,000 127,216,000 127,215,000 127,214,000 127,213,000 127,212,000 127,211,000 127,210,000 127,209,000 chr8 bioRxiv preprint doi: https://doi.org/10.1101/189753Basic Gene Annotation; this versionSet from GENCODE posted September Version 22 (Ensembl 16, 2017. 79) The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. CCAT1 Merged Exons Exonic TE Annotation SINE-MIR DNA-hAT-Charlie SINE-MIR Simple repeat RIDL SINE-Alu SINE-MIR RIDLs Annotation RIDL Multiz Alignment & Conservation (7 Species) candidate.RIDL.merged.oldIDs.hg38 7 Vert. El Primates Multiz Alignment & Conservation (20 Species) 20-way El

B C D RIDLs per gene Evidence 60 Conservation SST1 CC: −0.18 shuffled values 2000 RIDLs: 41 true value Conservation + Enrichment ● 95% confidence Conservation + Same Strand Intronic TEs: 93 Enrichment 1500 Enrichment + Same Strand 40 TSIX Same Strand mir-29c

#RIDL 1000 LINC00861 XIST Count TPT1-AS1 ● 20 LINC00963 500 ● 0.0 0.4 0 500 1000 1500 ● ● 0 0 ● ● Base Inclusion Frequency Position in Repeat Consensus (nt)

L1 L2 4 5 6 7 8 9 Alu MIR telo 10 centrERV1ERVKERVL Satellite Number of RIDLs per gene ERVL−MaLRhAT−Charlie hAT−Blackjack Simple_repeatTcMar−Tigger E Low_complexity F G Functionally-characterised lncRNAs Disease-associated lncRNAs GWAS SNP overlap 2.5 RIDL-lncRNAs (n=3566) RIDL-lncRNAs (n=3566) 10.0 RIDL-lncRNAs (n=3566) Other lncRNAs (n=2068) Other lncRNAs (n=2068) Other lncRNAs (n=2068) a Promoters 6 P = 0.03 a Exons 2.0 P = 0.01 7.5 172 P = 0.0001

1.5 48 200 P = 0.04 4 75 5.0 1.0 131 % of genes % of genes % of genes 69 13 2 55 2.5 0.5

0.0 0 0.0 TE Type L1ME3A GC_rich MLT1A0 LTR12D MLT1J2 L1PA16 C B D A GA.rich AT_rich MER49 L1MD3 CT.rich bioRxiv preprint LTR7C AluSx MIRb MIR3 MIRc MIR L2b L2a L2c

GM12878A549 101 13 13 94 62 81 52 48 54 11 97 10 25 HeLaS3 6 6 3 3 9 7 RIDL xlncRNAs 112 doi: 16 10 91 52 86 53 55 67 79 11 21 8 6 4 6 8 3 7 7 not certifiedbypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission. HepG2 https://doi.org/10.1101/189753 14 10 90 95 53 89 55 45 57 91 11 18 HUVEC 6 9 5 8 9 6 16 86 99 62 76 52 48 57 80 10 22 8 7 6 4 3 7 3 7 8

Cell Type

IMR90 10 12 91 95 59 67 38 52 53 76 12 10 18 7 5 3 8 3 6 13 12 84 77 53 63 42 44 45 11 78 12 10 17 K562 6 4 3 5 19 75 85 48 71 42 46 50 71 18

MCF7 6 8 6 3 4 9 6 7 7 vs 106 118 103 21 12 63 87 66 60 69 13 12 11 29 7 6 4 3 NHEK 8 SKNSH 18 12 73 79 55 61 46 46 42 10 72 20 8 5 5 6 3 6 8 7 108 114 28 11 73 80 62 60 59 13 99 10 12 28 6 5 5 5 3 6 ; All otherlncRNAs this versionpostedSeptember16,2017. Nucl/Cyto Ratio −1 0 1 2

Cyto Nucl

Log2 Nucl/Cyto Ratio Cumulative Frequency − 0.00 0.25 0.50 0.75 1.00 10 5 0 5 N − 0.75 695 6 4 3 2 1 0 P=6.7e-6 Log2 Nucl/CytoRatio 2128 not 44 L2b Number ofRIDLs 1.89 266 − The copyrightholderforthispreprint(whichwas 3 L2b inIMR90 Cyto IMR90 3.09 35 0 Nucl 5.12 C Rho=0.2 3 3 P =0 6.57 1 6

Cyto Nucl A Chr20 hg38 200 bases 58,817,800 58,817,700 58,817,600 58,817,500 58,817,400 58,817,300 58,817,200 58,817,100 Position in bioRxiv preprintRP4-806M20.4 doi: https://doi.org/10.1101/189753; this version posted September 16, 2017. The copyright3147-3372 holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. consensus MIR296 L2b >

hg38 Chr1 2 kb 911,000 911,500 912,000 912,500 913,000 913,500 914,000 914,500 915,000 915,500 3096-3275 RP11-54O7.1 137-219 3294-3371 MIRb > L2b > L2b <

Chr12 1 kb hg38 116,533,000 116,533,500 116,534,000 116,534,500 116,535,000 116,535,500

LINC00173 31-184 57-206 MIRb > MIRc >

B C

Whole cell WT endogenous 6 Cytoplasm 7.5 Nuclear control HeLa qRT-PCR HeLa RNAseq Nucleus Cytoplasmic contol 4 5.0

2.5 2 Expression (FPKM)

Log2 Nucl/Cyto Ratio 0.0 0

MALAT1 GAPDH LINC00173 LINC00173 RP4-806M20.4 RP11-54O7.1 RP4-806M20.4RP11-54O7.1

Transfect Fractionate D RIDL Wild Type N C Randomised Mutant

N=4 F P = 0.002 P = 0.01 P = 0.006 E 4

GAPDH 2

H3

Log2 Nucl/Cyto Ratio 0 WT transfected HeLa Mutant (randomised) Nuclear control −2 Cytoplasmic contol

LINC00173 RP4-806M20.4 RP11-54O7.1 MALAT1 / GAPDH