bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Binding specificities of human RNA binding towards structured and linear 2 RNA sequences 3 4 Arttu Jolma1,#, Jilin Zhang1,#, Estefania Mondragón4,#, Teemu Kivioja2, Yimeng Yin1, 5 Fangjie Zhu1, Quaid Morris5,6,7,8, Timothy R. Hughes5,6, Louis James Maher III4 and Jussi 6 Taipale1,2,3,* 7 8 9 AUTHOR AFFILIATIONS 10 11 1Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Solna, Sweden 12 2Genome-Scale Biology Program, University of Helsinki, Helsinki, Finland 13 3Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom 14 4Department of Biochemistry and Molecular Biology and Mayo Clinic Graduate School of 15 Biomedical Sciences, Mayo Clinic College of Medicine and Science, Rochester, USA 16 5Department of Molecular Genetics, University of Toronto, Toronto, Canada 17 6Donnelly Centre, University of Toronto, Toronto, Canada 18 7Edward S Rogers Sr Department of Electrical and Computer Engineering, University of 19 Toronto, Toronto, Canada 20 8Department of Computer Science, University of Toronto, Toronto, Canada 21 #Authors contributed equally 22 *Correspondence: [email protected] 23 24 25 SUMMARY 26 27 Sequence specific RNA-binding proteins (RBPs) control many important 28 processes affecting expression. They regulate RNA metabolism at multiple 29 levels, by affecting splicing of nascent transcripts, RNA folding, base modification, 30 transport, localization, translation and stability. Despite their central role in most 31 aspects of RNA metabolism and function, most RBP binding specificities remain 32 unknown or incompletely defined. To address this, we have assembled a genome- 33 scale collection of RBPs and their RNA binding domains (RBDs), and assessed their 34 specificities using high throughput RNA-SELEX (HTR-SELEX). Approximately 70% 35 of RBPs for which we obtained a motif bound to short linear sequences, whereas 36 ~30% preferred structured motifs folding into stem-loops. We also found that 37 many RBPs can bind to multiple distinctly different motifs. Analysis of the matches 38 of the motifs on human genomic sequences suggested novel roles for many RBPs in 39 regulation of splicing, and revealed RBPs that are likely to control specific classes 40 of transcripts. Global analysis of the motifs also revealed an enrichment of G and U 41 nucleotides. Masking of G and U by RBPs is expected to increase the specificity of 42 RNA folding, as both G and U can pair to two other RNA bases via canonical Watson- 43 Crick or G-U base pairs. The collection containing 145 high resolution binding 44 specificity models for 86 RBPs is the largest systematic resource for the analysis of 45 human RBPs, and will greatly facilitate future analysis of the various biological 46 roles of this important class of proteins. 47 48 49

1

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

50 INTRODUCTION 51 52 The abundance of and RNA molecules in a cell depends both on their rates 53 of production and degradation. These rates are determined directly or indirectly by the 54 sequence of DNA. The transcription rate of RNA and the rate of degradation of proteins is 55 determined by DNA and protein sequences, respectively. However, most regulatory steps 56 that control are influenced by the sequence of the RNA itself. These 57 processes include RNA splicing, localization, stability, and translation, all of which can be 58 regulated by RNA-binding proteins (RBPs) that specifically recognize short RNA 59 sequence elements (Glisovic et al., 2008). 60 RBPs can recognize their target sites using two mechanisms: they can form direct 61 contacts to the RNA bases of an unfolded RNA chain, and/or recognise folded RNA- 62 structures (Loughlin et al., 2009). These two recognition modes are not mutually 63 exclusive, and the same RBP can combine both mechanisms in recognition of its target 64 sequence. The RBPs that bind to unfolded target sequences generally bind to each base 65 independently of other bases, and their specificity can thus be well explained by a simple 66 position weight matrix (PWM) model. However, recognition of a folded RNA-sequence 67 leads to strong positional interdependencies between different bases due to base pairing. 68 In addition to the canonical Watson-Crick base pairs G:C and A:U, double-stranded RNA 69 commonly contains also G:U base pairs, and can also accommodate other non-canonical 70 base pairing configurations in specific structural contexts (Varani and McClain, 2000). 71 It has been estimated that the encodes approximately 1500 72 proteins that can associate with RNA (Gerstberger et al., 2014). Only some of the RBPs 73 are thought to be sequence specific. Many RNA-binding proteins bind only a single RNA 74 species (e.g. ribosomal proteins), or serve a structural role in ribonucleoprotein 75 complexes or the spliceosome. As RNA can fold to complex three-dimensional structures, 76 defining what constitutes an RBP is not simple. In this work, we have focused on RBPs 77 that are likely to bind to short sequence elements analogously to sequence-specific DNA 78 binding transcription factors. The number of such RBPs can be estimated based on the 79 number of proteins containing one or more canonical RNA-binding protein domains. The 80 total number is likely to be ~400 RBPs (Cook et al., 2011; Dominguez et al., 2018; Ray et 81 al., 2013). The major families of RBPs contain canonical RNA-binding protein domains 82 (RBDs) such as the RNA recognition motif (RRM), CCCH , K homology (KH) and 83 cold shock domain (CSD). In addition, smaller number of proteins bind RNA using La, 84 HEXIM, PUF, THUMP, YTH, SAM and TRIM-NHL domains. In addition, many “non- 85 canonical” RBPs that do not contain any of the currently known RBDs have been reported 86 to specifically bind to RNA (see, for example (Gerstberger et al., 2014)). 87 Various methods have been developed to determine the binding positions and 88 specificities of RNA binding proteins. Methods that use crosslinking of RNA to proteins 89 followed by immunoprecipitation and then massively parallel sequencing (CLIP-seq or 90 HITS-CLIP, reviewed in (Darnell, 2010) and PAR-CLIP (Hafner et al., 2010) can determine 91 RNA positions bound by RBPs in vivo, whereas other methods such as SELEX (Tuerk and 92 Gold, 1990), RNA Bind-n-Seq (Dominguez et al., 2018; Lambert et al., 2015) and 93 RNAcompete (Ray et al., 2009) can determine motifs bound by RBPs in vitro. Most high- 94 resolution models derived to date have been determined using RNAcompete or RNA 95 Bind-n-Seq. These methods have been used to analyze large numbers of RBPs from 96 multiple species, including generation of models for 137 human RBPs (Dominguez et al., 97 2018; Ray et al., 2013).

2

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

98 The CISBP-RNA database (Ray et al., 2013) (Database Build 0.6) currently lists 99 total of 392 high-confidence RBPs in human, but contains high-resolution specificity 100 models for only 100 of them (Ray et al., 2013). The Encyclopedia of DNA Elements 101 (ENCODE) database that contains human RNA Bind-n-Seq data, in turn, has models for 78 102 RBPs (Dominguez et al., 2018). In addition, a literature curation based database RBPDB 103 (The database of RNA-binding protein specificities) (Cook et al., 2011) contains 104 experimental data for 133 human RBPs, but mostly contains individual target- or 105 consensus sites, and only has high resolution models for 39 RBPs. Thus, despite the 106 central importance of RBPs in fundamental cellular processes, the precise sequence 107 elements bound by most RBPs remain to be determined. To address this problem, we 108 have in this work developed high-throughput RNA SELEX (HTR-SELEX) and used it to 109 determine binding specificities of human RNA binding proteins. Our analysis suggests 110 that many RBPs prefer to bind structured RNA motifs, and can associate with several 111 distinct sequences. The distribution of motif matches in the genome indicates that many 112 RBPs have central roles in regulation of RNA metabolism and activity in cells. 113 114

3

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

115 RESULTS 116 117 118 Identification of RNA-binding motifs using HTR-SELEX 119 120 To identify binding specificities of human RBPs, we established a collection of 121 canonical and non-canonical full-length RBPs and RNA binding domains. The full-length 122 constructs representing 819 putative RBPs were picked from the Orfeome 3.1 and 8.1 123 collections (Lamesch et al., 2007) based on annotations of the CisBP database for 124 conventional RBPs (Ray et al., 2013) and Gerstberger et al. (Gerstberger et al., 2014) for 125 additional unconventional RBPs. The 293 constructs designed to cover all canonical RBDs 126 within 156 human RBPs were synthesized based on Interpro defined calls 127 from ENSEMBL v76. Most RBD constructs contained all RBDs of a given protein with 15 128 amino-acids of flanking sequence (see Table S1 for details). For some very large RBPs, 129 constructs were also made that contained only a subset of their RBDs. Taken together our 130 clone collection covered 942 distinct proteins. The RBPs were expressed in E.coli as 131 fusion proteins with thioredoxin, incorporating an N-terminal hexahistidine and a C- 132 terminal SBP tag (Jolma et al., 2015). 133 To identify RNA sequences that bind to the proteins, we subjected the proteins to 134 HTR-SELEX (Figure 1A). In HTR-SELEX, a 40 bp random DNA sequence containing a 135 sample index and 5’ and 3’ primer binding sequences is transcribed into RNA using T7 136 RNA polymerase, and incubated with the individual proteins in the presence of RNase 137 inhibitors, followed by capture of the proteins using metal-affinity resin. After washing 138 and RNA recovery, a DNA primer is annealed to the RNA, followed by amplification of the 139 bound sequences using a reverse-transcription polymerase chain reaction (RT-PCR) 140 using primers that regenerate the T7 RNA polymerase promoter. The entire process is 141 repeated up to a total of four selection cycles. The amplified DNA is then sequenced, 142 followed by identification of motifs using the Autoseed pipeline (Nitta et al., 2015) 143 modified to analyze only the transcribed strand. Compared to previous methods such as 144 RNAcompete, HTR-SELEX uses a selection library with very high sequence complexity, 145 allowing identification of long RNA binding preferences. 146 The analysis resulted in generation of 145 binding specificity models for 86 RBPs. 147 Most of the results (66 RBPs) were replicated in a second HTR-SELEX experiment. The 148 success rate of our experiments was ~ 22% for the canonical RBPs, whereas the fraction 149 of the successful non-canonical RBPs was much lower (~ 1.3%; Table S1). Comparison 150 of our data with a previous dataset generated using RNAcompete (Ray et al., 2013) and 151 RNA Bind-n-Seq (Dominguez et al., 2018) and to older data that has been compiled in the 152 RBPDB-database (Cook et al., 2011) revealed that the specificities were generally 153 consistent with the previous findings (Figures S1 and S2). HTR-SELEX resulted in 154 generation of a larger number of motifs than the previous systematic studies, and 155 revealed the specificities of 38 RBPs whose high-resolution specificities were not 156 previously known (Figure 1B). Median coverage per RBD family was 24 % (Figure 1C). 157 Compared to the motifs from previous studies, the motifs generated with HTR-SELEX 158 were also wider, and had a higher information content (Figure S3), most likely due to the 159 fact that the sequences are selected from a more complex library in HTR-SELEX (see also 160 (Yin et al., 2017)). The median width and information contents of the models were 10 161 bases and 10 bits, respectively. 162 163

4

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

164 Some RBPs bind to RNA as dimers 165 166 Analysis of enriched sequences revealed that 31% of RBPs could bind to a site 167 containing a direct repeat of the same sequence (Figure S4). Such direct repeats were 168 also bound by RBPs having only a single RBD, suggesting that some RBPs could form 169 homodimers, or interact to form a homodimer when bound to RNA (Table S2). The gap 170 between the direct repeats was generally short, with a median gap of 5 nucleotides 171 (Figure S4). To determine whether the HTR-SELEX identified gap length preferences 172 were also observed in sites bound in vivo, we compared our data against existing in vivo 173 data for five RBPs for which high quality PAR-CLIP and HITS-CLIP derived data was 174 available from previous studies (Farazi et al., 2014; Hafner et al., 2010; Weyn- 175 Vanhentenryck et al., 2014), and found that preferred spacing identified in HTR-SELEX 176 was in most cases also observed in the in vivo data (Figure S5). 177 178 179 Recognition of RNA structures by RBPs 180 181 Unlike double-stranded DNA, whose structure is relatively independent of 182 sequence, RNA folds into complex, highly sequence-dependent three dimensional 183 structures. To analyze whether RBP binding depends on RNA secondary structure, we 184 identified characteristic patterns of dsRNA formation by identifying correlations between 185 all two base positions either within the motif or in its flanking regions, using a measure 186 described in Nitta et al., (Nitta et al., 2015) that is defined by the difference between 187 observed count of combinations of a given set of two bases and their expected count 188 based on a model that assumes independence of the positions (Figure 2A). The vast 189 majority of the observed deviations from the independence assumption were consistent 190 with the formation of an RNA stem-loop structure (example in Figure 2B). In addition, 191 we identified one RBP, LARP6, that bound to a predicted internal loop embedded in a 192 double-stranded RNA stem (Figure 2C, Figure S6). This binding specificity is consistent 193 with the earlier observation that LARP6 binds to stem-loops with internal loops found in 194 mRNAs encoding the collagen proteins COL1A1, COL1A2 and COL3A1 (Cai et al., 2010) 195 (Figure S6). 196 In total, 69% (59 of 86) of RBPs recognized linear sequence motifs that did not 197 appear to have a preference for a specific RNA secondary structure. The remaining 31% 198 (27 of 86) of RBPs could bind at least one structured motif (Figure 2D); this group 199 included several known structure-specific RBPs, such as RC3H1, RC3H2 (Leppek et al., 200 2013), RBMY1E, RBMY1F, RBMY1J (Skrisovska et al., 2007) and HNRNPA1 (Chen et al., 201 2016; Orenstein et al., 2018). A total of 15 RBPs bound exclusively to structured motifs, 202 whereas 12 RBPs could bind to both structured and unstructured motifs. The median 203 length of the stem region observed in all motifs was 5 bp, and the loops were between 3 204 and 15 bases long, with a median length of 11 (Figure 2E). Of the RBP families, KH and 205 HEXIM motifs we found were linear, whereas some proteins from RRM, CSD, Zinc finger 206 and LA-domain families could bind to both structured and unstructured sites (Figure S7). 207 To model RBP binding to stem-loop structures, we developed a simple stem-loop 208 model (SLM; Figure 2B). This model describes the loop as a position independent model 209 (PWM), and the stem by a nucleotide pair model where the frequency of each 210 combination of two bases at the paired positions is recorded. In addition, we developed 211 two different visualizations of the model, a T-shaped motif that describes the 212 mononucleotide distribution for the whole model, and the frequency of each set of bases

5

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

213 at the paired positions by thickness of edges between the bases (Figure 3), and a simple 214 shaded PWM where the stem part is indicated by a gray background where the darkness 215 of the background indicates the fraction of bases that pair with each other using Watson- 216 Crick or G:U base pairs (Figure 3). On average, the SLM increased the information content 217 of the motifs by 4.2 bits (Figure S8). A more detailed analysis of the number of paired 218 bases within 10 bp from the seed sequence of MKRN1 revealed, as expected from the 219 correlation structure, that >80% of individual sequence reads had more than four paired 220 bases, compared to ~15% for the control RBP (ZRANB2) for which a structured motif was 221 not identified (Figure 3). 222 223 224 Classification of RBP motifs 225 To analyze the motif collection globally, we developed PWM and SLM models for 226 all RBPs. To compare the motifs, we determined their similarity using SSTAT (Pape et al., 227 2008). To simplify the analysis, PWM models were used for this comparison even for 228 RBPs that bound to the structured motifs. We then used the dominating set method 229 (Jolma et al., 2013) to identify distinctly different motifs (Figure S9). Comparison of the 230 motifs revealed that in general, the specificities of evolutionarily related RBPs were 231 similar (Figure 4 and Figure S9). For the largest family, RRM, a total of 96 motifs were 232 represented by 47 specificity classes, whereas the smaller families CCCH, KH, CSD, and 233 HEXIM were represented by 9, 10, 6 and 1 classes, representing 17, 11, 7 and 2 individual 234 motifs, respectively (Figure S9). 235 Analysis of the dinucleotide content of all motifs revealed unexpected differences 236 in occurrence of distinct dinucleotides within the PWMs. The dinucleotides GG, GU, UG 237 and UU were far more common than other dinucleotides (Figure 4G; fold change 2.75; p 238 < 0.00225; t-test). This suggests that G and U bases are most commonly bound by RBPs. 239 This effect was not due to the presence of stem structures in the motifs, as the 240 unstructured motifs were also enriched in G and U. The masking of G and U bases by 241 protein binding may assist in folding of RNA to defined structures, as G and U bases have 242 lower specificity in base-pairing than C and A, due to the presence of the non-Watson- 243 Crick G:U base pairs in RNA. 244 Most RBPs bound to only one motif. However, 41 RBPs could bind to multiple 245 different motifs (Figure 5). In five cases, the differences between the primary and 246 secondary motif could be explained by a difference in spacing between the two half-sites. 247 In 12 cases, one of the motifs was structured, and the other linear. In addition, in eight 248 RBPs the primary and secondary motifs represented two different structured motifs, 249 where the loop length or the loop sequence varied (Figure 5). In addition, for four RBPs, 250 we recovered more than two different motifs. The most complex binding specificity we 251 identified belonged to LARP6 (Figures 5 and S10), which could bind to multiple simple 252 linear motifs, multiple dimeric motifs, and the internal loop-structure described above. 253 254 Conservation and occurrence of motif matches 255 256 We next analyzed the enrichment of the motif occurrences in different classes of 257 human transcripts. The normalized density of motifs for each factor at both strands of 258 DNA was evaluated for transcription start sites (TSSs), splice donor and acceptor sites, 259 and translational start and stop positions (see Figure S11 and Supplemental Data S1 260 for full data). This analysis revealed that many RBP recognition motifs were enriched at

6

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

261 splice junctions. The most enriched linear motif in splice donor sites was ZRANB2, a 262 known regulator of (Figure 6A) (Loughlin et al., 2009). Analysis of 263 matches to structured motifs revealed even stronger enrichment of motifs for ZC3H12A, 264 B and C to splice donor sites (Figure 6A). These results suggest a novel role for ZC3H12 265 proteins in regulation of splicing. The motifs for both ZRANB2 and ZC3H12 protein 266 factors were similar but not identical to the canonical splice donor consensus sequence 267 (ag | GURagu) that is recognized by the spliceosome, suggesting that these proteins may 268 act by binding to a subset of splice donor sites. 269 Analysis of splice acceptor sites also revealed that motifs for known components 270 of the spliceosome, such as RBM28 (Damianov et al., 2006), were enriched in introns and 271 depleted in exons. Several motifs were also enriched at the splice junction, including the 272 known regulators of splicing IGF2BP1 and ZFR (Supplemental Data S1) (Haque et al., 273 2018; Huang et al., 2018). In addition, we found several motifs that mapped to the 5’ of 274 the splice junction, including some known splicing factors such as QKI (Hayakawa-Yano 275 et al., 2017) and ELAVL1 (Bakheet et al., 2018), and some factors such as DAZL, CELF1 276 and BOLL for which a role in splicing has to our knowledge not been reported (Figure 6A 277 and Supplemental Data S1) (Rosario et al., 2017; Xia et al., 2017). 278 To determine whether the identified binding motifs for RBPs are biologically 279 important, we analyzed the conservation of the motif matches in mammalian genomic 280 sequences close to splice junctions. This analysis revealed strong conservation of several 281 classes of motifs in the transcripts (Figure 6B), indicating that many of the genomic 282 sequences matching the motifs are under purifying selection. 283 Both matches to ZRANB2 and ZC3H12 motifs were also enriched in 5’ regions of 284 the sense-strands of known transcripts, but not on the corresponding anti-sense strands. 285 In addition, no enrichment was detected in the potential transcripts that would originate 286 from the same promoters and extend in a direction opposite to that of the mRNAs (Figure 287 6C). These results suggest that ZRANB2 and ZC3H12 motifs could have a role in 288 differentiating between two RNA species that originate from bidirectional promoters (the 289 sense mRNA transcript and a potential anti-sense non-coding RNA transcript originating 290 from the same bidirectional promoter). 291 To identify biological roles of the motifs, we also used Enrichment 292 analysis to identify motifs that were enriched in specific types of mRNAs. This analysis 293 revealed that many RBP motifs are specifically enriched in particular classes of 294 transcripts. For example, we found that MEX3B motifs were enriched in involved 295 in type I interferon-mediated signaling pathway (Figure 6D). Taken together, our 296 analysis indicates that RBP motifs are biologically relevant, as matches to the motifs are 297 conserved, and occur specifically in genomic features and in transcripts having specific 298 biological roles. 299 300 301 302

7

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

303 DISCUSSION 304 305 In this work, we have determined the RNA-binding specificities of a large 306 collection of human RNA-binding proteins. The tested proteins included both proteins 307 with canonical RNA binding domains and putative RBPs identified experimentally 308 (Gerstberger et al., 2014; Ray et al., 2013). The method used for analysis involved 309 selection of RNA ligands from a collection of random 40 nucleotide sequences. Compared 310 to previous analyses of RNA-binding proteins, the HTR-SELEX method allows 311 identification of structured motifs, and motifs that are relatively high in information 312 content. The method can identify simple sequence motifs or structured , provided 313 that their information content is less than ~40 bits. However, due to the limit on 314 information content, and requirement of relatively high-affinity binding, the method does 315 not generally identify highly structured RNAs that in principle could bind to almost any 316 protein. Consistent with this, most binding models that we could identify were for 317 proteins containing canonical RBPs. 318 Motifs were identified for a total of 86 RBPs. Interestingly, a large fraction of all 319 RBPs (47%) could bind to multiple distinctly different motifs. The fraction is much higher 320 than that observed for double-stranded DNA binding transcription factors, suggesting 321 that sequence recognition and/or individual binding domain arrangement on single- 322 stranded RNA can be more flexible than on dsDNA. Analysis of the mononucleotide 323 content of all the models also revealed a striking bias towards recognition of G and U over 324 C and A. This may reflect the fact that formation of RNA structures is largely based on 325 base pairing, and that G and U are less specific in their base pairings that C and A. Thus, 326 RBPs that mask G and U bases increase the overall specificity of RNA folding in cells. 327 Similar to proteins, depending on sequence, single-stranded nucleic acids may fold 328 into complex and stable structures, or remain largely disordered. Most RBPs preferred 329 short linear RNA motifs, suggesting that they recognize RNA motifs found in unstructured 330 or single-stranded regions. However, approximately 31% of all RBPs preferred at least 331 one structured motif. The vast majority of the structures that they recognized were 332 simple stem-loops, with relatively short stems, and loops of 3-15 bases. Most of the base 333 specificity of the motifs was found in the loop region, with only one or few positions in 334 the stem displaying specificity beyond that caused by the paired bases. This is consistent 335 with the structure of fully-paired double-stranded RNA where edge hydrogen- 336 bonding information is largely inaccessible in the deep and narrow major groove. In 337 addition, we identified one RBP that bound to a more complex structure. LARP6 338 recognized an internal loop structure where two base-paired regions were linked by an 339 uneven number of unpaired bases. 340 Compared to TFs, which display complex dimerization patterns when bound to 341 DNA, RBPs displayed simpler dimer spacing patterns. This is likely due to the fact that the 342 backbone of a single-stranded nucleic acid has rotatable bonds. Thus, cooperativity 343 between two RBDs requires that they bind to relatively closely spaced motifs. 344 Analysis of the biological roles of the RBP motif matches indicated that many motif 345 matches were conserved, and specifically located at genomic features such as splice 346 junctions. In particular, our analysis suggested a new role for ZC3H12, BOLL and DAZL 347 proteins in regulating alternative splicing, and MEX3B in binding to type I interferon- 348 regulated genes. In particular, the binding of the anti-viral cytoplasmic ZC3H12 proteins 349 (Habacher and Ciosk, 2017; Lin et al., 2013) to splice junctions may have a role in their 350 anti-viral activity, as endogenous cytoplasmic mRNAs are depleted of splice donor

8

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

351 sequences. As a large number of novel motifs were generated in the study, we expect that 352 many other RBPs will have specific roles in particular biological functions. 353 Our results represent the largest single systematic study of human RNA-binding 354 proteins to date. This class of proteins is known to have major roles in RNA metabolism, 355 splicing and gene expression. However, the precise roles of RBPs in these biological 356 processes are poorly understood, and in general the field has been severely understudied. 357 The generated resource will greatly facilitate research in this important area. 358 359

9

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

360 ACKNOWLEDGEMENTS 361 362 We thank Drs. Minna Taipale and Bernhard Schmierer for the critical review of the 363 manuscript as well as Sandra Augsten, Lijuan Hu and Anna Zetterlund for the technical 364 assistance. The work was supported by a travel and project grant support (E.M., L.J.M.) 365 from the Mayo Clinic – Karolinska Institutet collaboration partnership as well as the Knut 366 and Alice Wallenberg Foundation (KAW 2013.0088) and the Swedish Research Council 367 (Postdoctoral grant, 2016-00158). 368 369 370 AUTHOR CONTRIBUTIONS 371 372 J.T., A.J and L.J.M. designed the experiments; A.J., E.M. and Y.Y. performed the 373 SELEX experiments; A.J., J.Z., J.T., T.K., T.R.H., Q.M. and F.Z. analyzed the data; J.T., A.J. and 374 J.Z. wrote the manuscript. 375 376 377 DATA AVAILABILITY 378 379 All next generation sequencing data have been deposited to European Nucleotide 380 Archive (ENA) under Accession PRJEB25907. All computer programs and scripts used are 381 either published or available upon request. Requests for materials should be addressed 382 to J.T. ([email protected]). 383 384 385 DECLARATION OF INTERESTS 386 387 The authors declare no competing interests. 388 389

10

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

390 REFERENCES 391 392 Bakheet, T., Hitti, E., Al-Saif, M., Moghrabi, W.N., and Khabar, K.S.A. (2018). The AU-rich 393 element landscape across human transcriptome reveals a large proportion in introns and 394 regulation by ELAVL1/HuR. Biochim Biophys Acta 1861, 167-177. 395 Cai, L., Fritz, D., Stefanovic, L., and Stefanovic, B. (2010). Binding of LARP6 to the 396 conserved 5' stem-loop regulates translation of mRNAs encoding type I collagen. J Mol 397 Biol 395, 309-326. 398 Chen, Y., Zubovic, L., Yang, F., Godin, K., Pavelitz, T., Castellanos, J., Macchi, P., and Varani, 399 G. (2016). Rbfox proteins regulate microRNA biogenesis by sequence-specific binding to 400 their precursors and target downstream Dicer. Nucleic Acids Res 44, 4381-4395. 401 Cook, K.B., Kazan, H., Zuberi, K., Morris, Q., and Hughes, T.R. (2011). RBPDB: a database of 402 RNA-binding specificities. Nucleic Acids Research 39, D301-D308. 403 Damianov, A., Kann, M., Lane, W.S., and Bindereif, A. (2006). Human RBM28 protein is a 404 specific nucleolar component of the spliceosomal . Biol Chem 387, 1455-1460. 405 Darnell, R.B. (2010). HITS-CLIP: panoramic views of protein-RNA regulation in living cells. 406 Wiley interdisciplinary reviews. RNA 1, 266-286. 407 Dominguez, D., Freese, P., Alexis, M.S., Su, A., Hochman, M., Palden, T., Bazile, C., Lambert, 408 N.J., Van Nostrand, E.L., Pratt, G.A., et al. (2018). Sequence, Structure, and Context 409 Preferences of Human RNA Binding Proteins. Mol Cell 70, 854-867 e859. 410 Farazi, T.A., Leonhardt, C.S., Mukherjee, N., Mihailovic, A., Li, S., Max, K.E., Meyer, C., Yamaji, 411 M., Cekan, P., Jacobs, N.C., et al. (2014). Identification of the RNA recognition element of 412 the RBPMS family of RNA-binding proteins and their transcriptome-wide mRNA targets. 413 RNA (New York, N.Y.) 20, 1090-1102. 414 Gerstberger, S., Hafner, M., and Tuschl, T. (2014). A census of human RNA-binding 415 proteins. Nature reviews. Genetics 15, 829-845. 416 Glisovic, T., Bachorik, J.L., Yong, J., and Dreyfuss, G. (2008). RNA-binding proteins and 417 post-transcriptional gene regulation. FEBS letters 582, 1977-1986. 418 Habacher, C., and Ciosk, R. (2017). ZC3H12A/MCPIP1/Regnase-1-related endonucleases: 419 An evolutionary perspective on molecular mechanisms and biological functions. 420 Bioessays 39. 421 Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., Rothballer, 422 A., Ascano, M., Jr., Jungkamp, A.C., Munschauer, M., et al. (2010). Transcriptome-wide 423 identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 141, 424 129-141. 425 Haque, N., Ouda, R., Chen, C., Ozato, K., and Hogg, J.R. (2018). ZFR coordinates crosstalk 426 between RNA decay and transcription in innate immunity. Nat Commun 9, 1145. 427 Hayakawa-Yano, Y., Suyama, S., Nogami, M., Yugami, M., Koya, I., Furukawa, T., Zhou, L., 428 Abe, M., Sakimura, K., Takebayashi, H., et al. (2017). An RNA-binding protein, Qki5, 429 regulates embryonic neural stem cells through pre-mRNA processing in cell adhesion 430 signaling. Genes Dev 31, 1910-1925. 431 Huang, H., Weng, H., Sun, W., Qin, X., Shi, H., Wu, H., Zhao, B.S., Mesquita, A., Liu, C., Yuan, 432 C.L., et al. (2018). Recognition of RNA N(6)-methyladenosine by IGF2BP proteins 433 enhances mRNA stability and translation. Nat Cell Biol 20, 285-295. 434 Jolma, A., Yan, J., Whitington, T., Toivonen, J., Nitta, K.R., Rastas, P., Morgunova, E., Enge, 435 M., Taipale, M., Wei, G., et al. (2013). DNA-binding specificities of human transcription 436 factors. Cell 152, 327-339.

11

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

437 Jolma, A., Yin, Y., Nitta, K.R., Dave, K., Popov, A., Taipale, M., Enge, M., Kivioja, T., Morgunova, 438 E., and Taipale, J. (2015). DNA-dependent formation of transcription factor pairs alters 439 their binding specificity. Nature 527, 384-388. 440 Korhonen, J.H., Palin, K., Taipale, J., and Ukkonen, E. (2017). Fast motif matching revisited: 441 high-order PWMs, SNPs and indels. Bioinformatics 33, 514-521. 442 Lambert, N.J., Robertson, A.D., and Burge, C.B. (2015). RNA Bind-n-Seq: Measuring the 443 Binding Affinity Landscape of RNA-Binding Proteins. Methods in enzymology 558, 465- 444 493. 445 Lamesch, P., Li, N., Milstein, S., Fan, C., Hao, T., Szabo, G., Hu, Z., Venkatesan, K., Bethel, G., 446 Martin, P., et al. (2007). hORFeome v3.1: a resource of human open reading frames 447 representing over 10,000 human genes. Genomics 89, 307-315. 448 Leppek, K., Schott, J., Reitter, S., Poetz, F., Hammond, M.C., and Stoecklin, G. (2013). Roquin 449 promotes constitutive mRNA decay via a conserved class of stem-loop recognition motifs. 450 Cell 153, 869-881. 451 Lin, R.J., Chien, H.L., Lin, S.Y., Chang, B.L., Yu, H.P., Tang, W.C., and Lin, Y.L. (2013). MCPIP1 452 ribonuclease exhibits broad-spectrum antiviral effects through viral RNA binding and 453 degradation. Nucleic Acids Res 41, 3314-3326. 454 Lorenz, R., Bernhart, S.H., Honer Zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P.F., and 455 Hofacker, I.L. (2011). ViennaRNA Package 2.0. Algorithms Mol Biol 6, 26. 456 Loughlin, F.E., Mansfield, R.E., Vaz, P.M., McGrath, A.P., Setiyaputra, S., Gamsjaeger, R., 457 Chen, E.S., Morris, B.J., Guss, J.M., and Mackay, J.P. (2009). The zinc fingers of the SR-like 458 protein ZRANB2 are single-stranded RNA-binding domains that recognize 5' splice site- 459 like sequences. Proc Natl Acad Sci U S A 106, 5581-5586. 460 Loytynoja, A., and Goldman, N. (2005). An algorithm for progressive multiple alignment 461 of sequences with insertions. Proc Natl Acad Sci U S A 102, 10557-10562. 462 Mathews, D.H. (2014). RNA Secondary Structure Analysis Using RNAstructure. Curr 463 Protoc Bioinformatics 46, 12 16 11-25. 464 Nitta, K.R., Jolma, A., Yin, Y., Morgunova, E., Kivioja, T., Akhtar, J., Hens, K., Toivonen, J., 465 Deplancke, B., Furlong, E.E., et al. (2015). Conservation of transcription factor binding 466 specificities across 600 million years of bilateria evolution. Elife 4. 467 Orenstein, Y., Ohler, U., and Berger, B. (2018). Finding RNA structure in the unstructured 468 RBPome. BMC Genomics 19, 154. 469 Pape, U.J., Rahmann, S., and Vingron, M. (2008). Natural similarity measures between 470 position frequency matrices with an application to clustering. Bioinformatics 24, 350-357. 471 Ray, D., Kazan, H., Chan, E.T., Pena Castillo, L., Chaudhry, S., Talukder, S., Blencowe, B.J., 472 Morris, Q., and Hughes, T.R. (2009). Rapid and systematic analysis of the RNA recognition 473 specificities of RNA-binding proteins. Nat Biotechnol 27, 667-670. 474 Ray, D., Kazan, H., Cook, K.B., Weirauch, M.T., Najafabadi, H.S., Li, X., Gueroussov, S., Albu, 475 M., Zheng, H., Yang, A., et al. (2013). A compendium of RNA-binding motifs for decoding 476 gene regulation. Nature 499, 172-177. 477 Rosario, R., Childs, A.J., and Anderson, R.A. (2017). RNA-binding proteins in human 478 oogenesis: Balancing differentiation and self-renewal in the female fetal germline. Stem 479 Cell Res 21, 193-201. 480 Skrisovska, L., Bourgeois, C.F., Stefl, R., Grellscheid, S.N., Kister, L., Wenter, P., Elliott, D.J., 481 Stevenin, J., and Allain, F.H. (2007). The testis-specific human protein RBMY recognizes 482 RNA through a novel mode of interaction. EMBO Rep 8, 372-379. 483 Tuerk, C., and Gold, L. (1990). Systematic evolution of ligands by exponential enrichment: 484 RNA ligands to bacteriophage T4 DNA polymerase. Science (New York, N.Y.) 249, 505- 485 510.

12

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

486 Varani, G., and McClain, W.H. (2000). The G x U wobble base pair. A fundamental building 487 block of RNA structure crucial to RNA function in diverse biological systems. EMBO Rep 488 1, 18-23. 489 Weyn-Vanhentenryck, S.M., Mele, A., Yan, Q., Sun, S., Farny, N., Zhang, Z., Xue, C., Herre, M., 490 Silver, P.A., Zhang, M.Q., et al. (2014). HITS-CLIP and integrative modeling define the 491 Rbfox splicing-regulatory network linked to brain development and autism. Cell reports 492 6, 1139-1152. 493 Xia, H., Chen, D., Wu, Q., Wu, G., Zhou, Y., Zhang, Y., and Zhang, L. (2017). CELF1 494 preferentially binds to exon-intron boundary and regulates alternative splicing in HeLa 495 cells. Biochim Biophys Acta 1860, 911-921. 496 Yin, Y., Morgunova, E., Jolma, A., Kaasinen, E., Sahu, B., Khund-Sayeed, S., Das, P.K., 497 Kivioja, T., Dave, K., Zhong, F., et al. (2017). Impact of cytosine methylation on DNA 498 binding specificities of human transcription factors. Science (New York, N.Y.) 356.

13

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

499 FIGURE LEGENDS 500 501 Figure 1. RNA HT-SELEX protocol and data-analysis 502 (A) Schematic illustration of the HTR-SELEX process. RBD or full-length RBPs expressed 503 in E.coli as HIS6-tagged fusion proteins (top left) were purified and incubated with 504 barcoded RNA selection ligands. RNA ligands bound by the proteins were recovered by 505 RT-PCR, followed by in vitro transcription to generate the RNA for the next cycle of SELEX 506 (left middle). The procedure was repeated at least three times and the ligands recovered 507 from the selection cycles were subjected to Illumina sequencing (left bottom) with data 508 analysis to generate binding specificity models (right). 509 (B) Comparison of the number of RBPs with motifs derived in the present study (HTR- 510 SELEX), RNAcompete (Ray et al., 2013), RNA Bind-n-Seq (RBNS) (Dominguez et al., 2018) 511 and RBPDB database (Cook et al., 2011). Note that our analysis revealed motifs for 38 512 RBPs for which a motif was not previously known. 513 (C) Distribution of RBP with motifs classified by structural family of the RBDs. 514 515 Figure 2. Detection of linear or structured RNA binding models 516 (A) ZRANB2 binds to a linear RNA motif. The motif of ZRANB2 is shown below the 517 triangular correlation heatmap. The heatmap illustrates deviation of the observed 518 nucleotide distributions from those predicted by a mononucleotide model where bases 519 are independent. 520 (B) MKRN1 binds preferentially to a stem-loop. Note a diagonal series of red tiles (boxed) 521 that indicates pairs of bases whose distribution deviates from the independence 522 assumption. These bases are shaded in the motif below the triangle. The interdependency 523 occurs between bases that are at the same distance from the center of the motif, 524 consistent with formation of a stem-loop structure. Right top: A RNAfold-predicted stem- 525 loop structure for a sequence that was highly enriched in the experiment. 526 (C) LARP6 binds to a complex internal loop RNA structure. The left panel indicates the 527 dinucleotide dependencies with the heatmap on top representing the preferred spacing 528 length between base pairing sequences of stem 1, whereas the right panel presents a 529 predicted structure of the bound RNA. The dashed line in the structure denotes the 530 internal base pair. 531 (D) Fraction of RBPs with linear and structured binding specificities. 532 (E) Length distribution of stem and loop for the structured motifs. 533 534 Figure 3. Comparison between linear PWM and stem loop (SLM) models. 535 Left: Visualization of the stem loop models. A T-shape model (top) shows a horizontal 536 loop and a vertical stem where the frequency of each base combination is shown. Bases 537 are aligned so that Watson-Crick base pairs orient horizontally. Pie-charts show 538 frequency of Watson-Crick (green) and G-U base pairs (light green) at each position of 539 the predicted stem. A linear visualization (bottom) where the base pairing frequency is 540 indicated by the darkness of gray shading is also shown. 541 Middle: Schematic description of the scoring process for the SLM. All possible alignment 542 positions between an 8-mer with a 4 base gap in the middle and the model are searched 543 in order to find the aligned position with the best score. When the 8-mer overlaps both 544 bases of a SLM-predicted base-pair, the score for the paired position (red tiles connected 545 by black lines) is derived from the SLM base-pair score. In cases where the kmer aligns to 546 only one base of the SLM base-pair, the score for the position (black) is derived from the 547 mononucleotide matrix.

14

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

548 Right: RNA secondary structure prediction analysis using RNAfold reveals that sequences 549 flanking MKRN1 loop sequence form base pairs (top), whereas bases on the flanks of 550 ZRANB2 matches (bottom) are mostly unpaired. 551 552 Figure 4. Comparison between the obtained motifs 553 Similar RBPs bind to similar motifs. Motifs were classified into six major categories based 554 on structural class of the RBPs (A to H). Dendrograms are based on amino-acid alignment 555 using PRANK. Within RRM family, RBPs with a distinct number of RRMs were grouped 556 and aligned separately. Motifs shown are the primary motif for each RBP. Asterisks 557 indicate a stem-loop structured motif, with the gray shading showing the strength of the 558 base pairing at the corresponding position. Two asterisks indicate that the RBP can bind 559 to a structured secondary motif. Only families with more than one representative RHT- 560 SELEX motif are shown. 561 (G) RBPs commonly prefer sequences with G or U nucleotides. Frequencies of all 562 mononucleotides (left) and dinucleotides (right) across all of the RBP motifs. Note that G 563 and U are overrepresented. 564 565 Figure 5. Many RBPs can recognize more than one motif 566 Pie chart (top) indicates fraction of RBPs that recognize more than one motif. Primary 567 (left) and secondary (right) motifs are shown, classified according to the RBP structural 568 family. Right column indicates whether the RBP binds to two linear motifs (+), two 569 structured motifs (red circle) or both types of motifs (green circle). 570 571 Figure 6. RBP motif matches are conserved and enriched in distinct sequence 572 features 573 (A) Strong enrichment of RBP motif matches at or near the splicing donor and acceptor 574 sites. Mononucleotide frequencies at splice donor and acceptor sites are shown on top, 575 above the gene schematic. Left: meta-plots indicate the enrichment of ZRANB2 and 576 ZC3H12C motif matches at splice donor sites. Right: enrichment of BOLL and DAZL at 577 splice acceptor sites. Blue dots indicate the number of matches in the sense strand at each 578 base position; black line indicates the locally weighted smoothing (LOESS) curve in 10 579 base sliding windows. Corresponding values for the anti-sense strand are shown as light 580 blue dots and dotted black line, respectively. 581 (B) The conservation of motif binding sites in sense vs. antisense strand. For each feature 582 type (acceptor, donor site) and binding motif, thousand highest affinity sites in total were 583 selected from the hundred base windows centered at the features. The total number the 584 conserved sites (x-axis) is plotted against odds ratio of conserved vs. non-conserved site 585 being in sense strand (y-axis). Those motifs for which conservation was significantly 586 associated with sense strand (one-sided Fisher’s exact test) are shown in green. The five 587 motifs with smallest p-values are named. 588 (C) Enrichment of ZRANB2 and ZC3H12C motif matches near transcription start sites 589 (TSS). Note that matches are only enriched on the sense strand downstream of the TSS. 590 (D) Gene Ontology enrichment of MEX3B motif matches. The top 100 genes with highest 591 motif-matching score density were used to conduct the Gene Ontology enrichment 592 analysis. The enriched GO terms were simplified by their similarity (cutoff=0.5). The 593 fraction of genes and their counts in the GO categories are also shown (Gene Ratio, Count, 594 respectively).

15

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

595 MATERIALS AND METHODS 596 597 598 Clone collection, cloning and protein production 599 600 Clones were either collected from the human Orfeome 3.1 and 8.1 clone libraries 601 (full length clones) or ordered as synthetic genes from Genscript (eRBP constructs). As in 602 our previous work (Jolma et al., 2013), protein-coding synthetic genes or full length ORFs 603 were cloned into pETG20A-SBP to create an E.coli expression vector that allows the RBP 604 or RBD cDNAs to be fused N-terminally to Thioredoxin+6XHis and C-terminally to SBP- 605 tags. Fusion proteins were then expressed in the Rosetta P3 DE LysS E.coli strain 606 (Novagen) using an autoinduction protocol (Jolma et al., 2015). All constructs described 607 in Table S1 were expressed and subjected to HTR-SELEX, regardless of protein level 608 expressed. Protein production was assessed in parallel by 96-well SDS-PAGE (ePage, 609 Invitrogen). The success rate of protein production was dependent on the size of the 610 proteins, with most small RBDs expressing well in E.coli. Significantly lower yield of 611 protein was observed for full-length proteins larger than 50 kDa. 612 After HIS-tag based IMAC purification, glycerol was added to a final concentration 613 of 10%. Samples were split to single-use aliquots with approximately 200 ng RBP in a 5µl 614 volume and frozen at -80°C. 615 616 617 Selection library generation 618 619 To produce a library of RNA sequences for selection (selection ligands), we first 620 constructed dsDNA templates by combining three oligonucleotides together in a three 621 cycle PCR reaction (Phusion, NEB). For information about the barcoded ligand design, see 622 Table S1. The ligand design was similar to that used in our previous work analyzing TF 623 binding specificities in dsDNA (Jolma et al., 2013) except for the addition of a T7 RNA 624 polymerase promoter in the constant flanking regions of the ligand. RNA was expressed 625 from the DNA-templates using T7 in vitro transcription (Ampliscribe T7 High Yield 626 Transcription Kit, Epicentre or Megascript-kit Ambion) according to manufacturer’s 627 instructions, after which the DNA-template was digested using RNAse-free DNAse I 628 (Epicentre) or the TURBO-DNAse supplied with the Megascript-kit. All RNA-production 629 steps included RiboGuard RNAse-inhibitor (Epicentre). 630 Two different approaches were used to facilitate the folding of RNA molecules. In 631 the protocol used in experiments where the batch identifier starts with letters “EM”, RNA- 632 ligands were heated to +70°C followed by gradual, slow cooling to allow the RNA to fold 633 into minimal energy structures, whereas in batches “AAG” and “AAH” RNA transcription 634 was not followed by such folding protocol. The rationale was that spontaneous co- 635 transcriptional RNA-folding may better reflect folded RNA structures in the in vivo 636 context. In almost all of the cases where the same RBPs were tested with both of the 637 protocols the results were highly similar. 638 639 640 HTR-SELEX assay 641 642 Selection reactions were performed as follows: ~200ng of RBP was mixed on ice 643 with ~1µg of the RNA selection ligands to yield approximate 1:5 molar ratio of protein to

16

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

644 ligand in 20µl of Promega buffer (50 mM NaCl, 1 mM MgCl2, 0.5 mM Na2EDTA and 4% 645 glycerol in 50 mM Tris-Cl, pH 7.5). The complexity of the initial DNA library is 646 approximately 1012 DNA molecules with 40 bp random sequence (~20 molecules of each 647 20 bp sequence on the top strand). The upper limit of detection of sequence features of 648 HTR-SELEX is thus around 40 bits of information content. 649 The reaction was incubated for 15 minutes at +37°C followed by additional 15 650 minutes at room temperature in 96-well microwell plates (4-titude, USA), after which the 651 reaction was combined with 50 µl of 1:50 diluted paramagnetic HIS-tag beads (His Mag 652 Sepharose excel, GE-Healthcare) that had been blocked and equilibrated into the binding 653 buffer supplemented with 0.1% Tween 20 and 0.1µg/µl of BSA (Molecular Biology Grade, 654 NEB). Protein-RNA complexes were then incubated with the magnetic beads on a shaker 655 for further two hours, after which the unbound ligands were separated from the bound 656 beads through washing with a Biotek 405CW plate washer fitted with a magnetic 657 platform. After the washes, the beads were suspended in heat elution buffer (0.5 µM RT- 658 primer, 1 mM EDTA and 0.1% Tween20 in 10 mM Tris-Cl buffer, pH 7) and heated for 5 659 minutes at 70°C followed by cooling on ice to denature the proteins and anneal the 660 reverse transcription primer to the recovered RNA library, followed by reverse 661 transcription and PCR amplification of the ligands. The efficiency of the selection process 662 was evaluated by running a qPCR reaction on parallel with the standard PCR reaction. 663 664 665 Sequencing and generation of motifs 666 667 PCR products from RNA libraries (indexed by bar-codes) were pooled together, 668 purified using a PCR-purification kit (Qiagen) and sequenced using Illumina HiSeq 2000 669 (55 bp single reads). Data was de-multiplexed, and initial data analysis performed using 670 the Autoseed algorithm (Nitta et al., 2015) that was further adapted to RNA analysis by 671 taking into account only the transcribed strand and designating uracil rather than 672 thymine. Autoseed identifies gapped and ungapped kmers that represent local maximal 673 counts relative to similar sequences within their Huddinge neighborhood (Nitta et al., 674 2015). It then generates a draft motif using each such kmer as a seed. This initial set of 675 motifs is then refined manually to identify the final seeds (Table S2), to remove artifacts 676 due to selection bottlenecks and common “aptamer” motifs that are enriched by the HTR- 677 SELEX process itself, and motifs that are very similar to each other. To assess initial data, 678 we compared the deduced motifs to known motifs, to replicate experiments and 679 experiments performed with paralogous proteins. Individual results that were not 680 supported by replicate or prior experimental data were deemed inconclusive and were 681 not included in the final dataset. Draft models were manually curated (AJ, JT, QM, TRH) 682 to identify successful experiments, and final models were generated using the seeds 683 indicated in Table S2. 684 Autoseed detected more than one seed for many RBPs. Up to four seeds were used 685 to generate a maximum of two unstructured and two structured motifs. Of these, the 686 motif with largest number of seed matches using the multinomial setting indicated on 687 Table S2 was designated the primary motif. The motif with the second largest number of 688 matches was designated the secondary motif. The counts of the motifs represent the 689 prevalence of the corresponding motifs in the sequence pool (Table S2). Only these 690 primary and secondary motifs were included in further analyses. Such additional motifs 691 are shown for LARP6 in Figure S10.

17

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

692 To find RBPs that bind to dimeric motifs, we visually examined the PWMs to find 693 direct repeat pattern of three or more base positions, with or without a gap between them 694 (see Table S2). The presence of such repetitive pattern could be either due to dimeric 695 binding, or the presence of two RBDs that bind to similar sequences in the same protein. 696 To identify structured motifs, we visually investigated the correlation diagrams 697 for each seed to find motifs that displayed the diagonal pattern evident in Figure 2B. The 698 plots display effect size and maximal sampling error, and show the deviation of 699 nucleotide pair distribution from what is expected from the distribution of the individual 700 nucleotides. For each structured motif, SLM models (Table S3) were built from 701 sequences matching the indicated seeds; a multinomial 2 setting was used to prevent the 702 paired bases from influencing each other. Specifically, when the number of occurrence of 703 each pair of bases was counted at the base-paired positions, neither of the paired bases 704 was used to identify the sequences that were analyzed. The SLMs were visualized either 705 as the T-shaped logo (Figure 3) or as a PWM type logo where the bases that constitute 706 the stem were shaded based on the total fraction of A:U, G:C and G:U base pairs. 707 For analysis of RNA structure in Figures 2 and S6, sequences matching the regular 708 expression NNNNCAGU[17N]AGGCNNN or sequences of the three human collagen gene 709 transcripts (From 5’ untranslated and the beginning the coding sequence, the start codon 710 is marked with bold typeface: COL1A1 -CCACAAAGAGUCUACAUGUCUAGGGUCUAG- 711 ACAUGUUCAGCUUUGUGG; COL1A2- CACAAGGAGUCUGCAUGUCUAAGUGCUAGA- 712 CAUGCUCAGCUUUGUG and COL3A1 - CCACAAAGAGUCUACAUGGGUCAUGUUCAG- 713 CUUUGUGG) were analyzed using “RNAstructure” software (Mathews, 2014) through the 714 web-interface 715 in:http://rna.urmc.rochester.edu/RNAstructureWeb/Servers/Fold/Fold.html using 716 default settings. All structures are based on the program’s minimum energy structure 717 prediction. For analysis in Figure 3, we extracted all sequences that matched the binding 718 sequences of MKRN1 and ZRANB2 (GUAAAKUGUAG and NNNGGUAAGGUNN, 719 respectively; N denotes a weakly specified base) flanked with ten bases on both sides 720 from the cycle four of HTR-SELEX. Subsequently, we predicted their secondary structures 721 using the program RNAfold (Vienna RNA package; (Lorenz et al., 2011)) followed by 722 counting the predicted secondary structure at each base position in the best reported 723 model for each sequence. For both RBPs, the most common secondary structure for the 724 bases within the defined part of the consensus (GUAAAKUGUAG and GGUAAGGU) was the 725 fully single stranded state (82% and 30% of all predicted structures, respectively). To 726 estimate the secondary structure at the flanks, the number of paired bases formed 727 between the two flanks were identified for each sequence. Fraction of sequences with 728 specific number of paired bases are shown in Figure 3. 729 730 731 Motif mapping 732 733 To gain insight into the function of the RBPs, we mapped each motif to the whole 734 human genome (hg38). We applied different strategies for the linear and the stem-loop 735 motifs. For the linear motifs, we identified the motif matches with MOODS (Korhonen et 736 al., 2017) with the following parameter setting: --best-hits 300000 --no-snps. For the 737 stem-loop motifs, we implemented a novel method to score sequences against the SLMs. 738 The source code is available on GitHub: https://github.com/zhjilin/rmap. 739 We identified the 300,000 best scored matches in the genome, and further 740 included any matches that had the same score as the match with the lowest score, leading

18

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

741 to at least 300,000 matches for each motif. The matches were then intersected with the 742 annotated features from the ENSEMBL database (hg38, version 91), including the splicing 743 donor (DONOR), splicing acceptor (ACCEPTOR), the translation start codon 744 (STARTcodon), the translation stop codon (STOPcodon) and the transcription starting 745 site (TSS). The above features were filtered in order to remove short introns (<50bp), and 746 features with non-intact or non-canonical start codon or stop codon. The filtered features 747 were further extended 1kb both upstream and downstream in order to place the feature 748 in the centre of all the intervals. The motif matches overlapping the features were counted 749 using BEDTOOLS (version 2.15.0) and normalized by the total number of genomic 750 matches for the corresponding motif. 751 752 753 Motif comparisons and GO analysis 754 755 To assess the similarity between publicly available motifs and our HTR-SELEX 756 data, we aligned the motifs as described in (Jolma et al., 2015) (Figure S1). To determine 757 whether RBPs with similar RBDs recognize and bind to similar targets, we compared the 758 sequences of the RBDs and their motifs. First, the RBPs were classified based on the type 759 and number of RBDs. For each class, we then extracted the amino-acid sequence of the 760 RBPs starting from the first amino acid of the first RBD and ending at the last amino acid 761 of the last RBD. We also confirmed the annotation of the RBDs by querying each amino 762 acid sequence against that SMART database, and annotated the exact coordinates of the 763 domains through the web-tools: http://smart.embl-heidelberg.de and 764 http://smart.embl-heidelberg.de/smart/batch.pl. Sequence similarities and trees were 765 built using PRANK (Loytynoja and Goldman, 2005) (parameters: -d, -o, -showtree). The 766 structure of the tree representing the similarity of the domain sequence was visualized 767 using R (version 3.3.1). 768 For identification of classes of transcripts that are enriched in motif matches for 769 each RBP, we extracted the top 100 transcripts according to the score density of each RBP 770 motif. These 100 transcripts were compared to the whole transcriptome to conduct the 771 GO enrichment analysis for each motif using the R package ClusterProfiler (version 3.0.5). 772 To analyze conservation of motif matches, sites recognized by each motif were 773 searched from both strands of 100 bp windows centered at the features of interest 774 (acceptor, donor sites) using the MOODS program (version 1.0.2.1). For each motif and 775 feature type, 1000 highest affinity sites were selected for further analysis regardless of 776 the matching strand. Whether the evolutionary conservation of the high affinity sites was 777 explained by the motifs was tested using program SiPhy (version 0.5, task 16, 778 seedMinScore 0) and multiz100way multiple alignments of 99 vertebrate species to 779 human (downloaded from UCSC genome browser, version hg38). A site was marked as 780 being conserved according to the motif if its SiPhy score was positive meaning that the 781 aligned bases at the site were better explained by the motif than by a neutral evolutionary 782 model (hg38.phastCons100way.mod obtained from UCSC genome browser). Two motifs 783 were excluded from the analysis because the number of high affinity sites that could be 784 evaluated by SiPhy was too small. The hypothesis that the motif sites in the sense strand 785 were more likely to be conserved than sites in the antisense strand was tested against the 786 null hypothesis that there was no association between site strand and conservation using 787 Fisher’s exact test (one-sided). The P values given by the tests for individual motifs were 788 corrected for multiple testing using Holm’s method. 789

19

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

790 791 Mutual information calculation 792 793 The global pattern of motifs across the features tested was analyzed by calculating 794 the mutual information (MI) between 3-mer distributions at two non-overlapping 795 positions of the aligned RNA sequences. MI can be used for such analysis, because if a 796 binding event contacts two continuous or spaced 3-bp wide positions of the sequences at 797 the same time, the 3-mer distributions at these two positions will be correlated. Such 798 biased joint distribution can then be detected as an increase in MI between the positions. 799 Specifically, MI between two non-overlapping positions (pos1, pos2) was 800 estimated using the observed frequencies of a 3-mer pair (3+3-mer), and of its 801 constituent 3-mers at both positions:

802 where P(3+3-mer) is the observed probability of the 3-mer pair (i.e. gapped or ungapped 803 6 mer). Ppos1(3-mer) and Ppos2(3-mer), respectively, are the marginal probabilities of the 804 constitutive 3-mers at position 1 and position 2. The sum is over all possible 3-mer pairs. 805 To focus on RBPs that specifically bind to a few closely related sequences, such as 806 RBPs with well-defined motifs, it is possible to filter out most background non-specific 807 bindings (e.g., selection on the shape of RNA backbone) by restricting the MI calculation, 808 to consider only the most enriched 3-mer pairs for each two non-overlapping positions. 809 Such enriched 3-mer pair based mutual information (E-MI) is calculated by 810 summing MI over top-10 most enriched 3-mer pairs. 811

812 813 814 815 816

20

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

817 SUPPLEMENTAL INFORMATION 818 819 Figure S1. The similarity of motifs between HTR-SELEX and the curated dataset. In 820 total, 33 motifs from the CISBP-RNA database (RNAcompete) were collected to compare 821 with the HTR-SELEX derived motifs (both primary and secondary motifs). Motifs were 822 presented and organized according to their protein structure family by descending 823 similarity score. Higher score indicates higher similarity between motifs. 824 825 Figure S2. Motif comparison between HTR-SELEX and RNA Bind-n-Seq. The primary 826 and secondary motifs recovered from both experiments were presented, where in the 827 RBNS the primary motifs were determined by the most abundant 5-mers in the kmer 828 population. 829 830 831 Figure S3. Higher information content and wider width distribution of HTR-SELEX 832 motifs. The available PWMs generated by RNAcompete and SELEX were collected from 833 the CISBP-RNA database for comparison. The per base information was calculated for 834 every individual position in the PWM. The overall information content of each motif is the 835 sum of all positions in the PWM. The width of each motif was generated by counting the 836 number of position in the corresponding PWM. 837 838 Figure S4. RBPs with multimeric binding sites. About one third of RBPs (31.3%, left) 839 bind to the sequence as homodimers where two identical half-sites are separated by a 840 spacing sequence. The distribution of spacing preference of all RBPs is shown (right). The 841 discontinuous distribution of the spacing length is due to the small sample size. 842 843 Figure S5. Spacing preferences between dimeric binding sites are consistent in 844 different assays. (A) For four RBPs, the same seeds were used in different assays to 845 detect the spacing preferences. The heatmaps represent the spacing information 846 extracted from HTR-SELEX, PAR-CLIP and HITS-CLIP. The results are consistent between 847 HTR-SELEX (top row) and PAR-CLIP or HITS-CLIP (bottom row). (B) Pie charts show the 848 percentage of reads containing the indicated occurrence of the motifs CAC in RBPMS 849 binding motifs as determined by HTR-SELEX (left) or PAR-CLIP (middle). Occurrence of 850 the motif CCA is also shown (right). 851 852 Figure S6. Known binding motifs of LARP6. 853 The left three structures were generated using the sequences enriched in HTR-SELEX. 854 The right three structures illustrate the predicted structures of known collagen RNA 855 sequences. The dash-line indicates the internal base pair. The number labels the position 856 of the base in the RNA sequence. 857 858 Figure S7. RBP families with and without structural specificity 859 The count of RBPs recognizing structured and unstructured binding motifs in each 860 protein structure family. 861 862 Figure S8. Information content correlation between the SLM and the mono- 863 nucleotide PWM. Left. Information content correlation per base. Right. Overall

21

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

864 information content correlation. In general, the SLM yielded higher per base information 865 content due to the base pairing in the stem. 866 867 Figure S9. Dominating set of HTR-SELEX motifs. Cystoscope (Version 3.2.1) was used 868 to visualize the dominating set on top of the relationship map between motifs with a 869 cutoff of 5e-6 for similarity, calculated by SSTAT (see the method part). Motifs in the 870 dominating set are labeled in red. 871 872 Figure S10. Various binding specificities detected for LARP6. 873 LARP6 is able to recognise and bind to distinct sequences through different strategies 874 besides binding to the internal loop structure. (A) Short and long linear motifs (B) 875 unstructured motifs with gaps. The heatmap shows the preference of spacing between 876 two half sites. 877 878 Figure S11. The mutual information (MI) meta-plots around the splicing donor and 879 acceptor sites. The splice donor and acceptor sites are placed in the centre of the 147nts 880 sequence. The detected signals close to the donor and acceptor sites are shown in red. 881 The enriched 3-mer pair based mutual information (E-MI) are also shown. 882 883 Table S1. Sequence information of proteins and DNA. 884 Table S2. PWMs of the linear motifs. 885 Table S3. PWMs of the structured motifs 886 Table S4. Dependency matrices of paired bases for the structured motifs. 887 888 Supplemental Data S1. Meta-plots of the motif matches enrichment at splice donor, 889 acceptor, TSS, start and stop codon. 890

22

bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was Jolma et al.not certified201 by9 peer Figure review) is the author/funder. 1 All rights reserved. No reuse allowed without permission.

A HT-SELEX process ~950 RBP clones Synthetic ligands

E.coli RBPs RNAs HEXIM1

≥ x 3 CARHSP1

CELF1

Library preparation ZC3H12B

THUMPD1 Library Sequencing Motif discovery QKI 100 Reads 50 Watson-Crick & G-U 0 base-pairing (%) B C Comparison to other datasets RBPs based on protein domain

RBPs protein class HTR-SELEX RNA Bind−n−Seq SELEX Previous studies 200 no motif 33 11

150 10 3 2 100 23 4 2 38 RBP Counts 50 5 1 11 143 RNAcompete 15 HTR−SELEX 0

Znf KH S1 La RRM CSD YTH SAM PUF Other CCCH RanBP HEXIM THUMP RBP Family bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was

Jolma et al.not 2019certified by peer Figure review) is the author/funder.2 All rights reserved. No reuse allowed without permission.

U G

G

A B A U

Detection of structured motifs A

Detection of linear motifs A A

G

U

C G C

G U

A A

U

G

U

G

C C

G U

G G

C C

G G

U

G

U

G

G

U

G

G G Y G U A A A K U G U A G 0.0% A G G U A A G G U A 0.0% Y A G U G A 35.7% G 23.3% A U A A K A U 71.5% G 46.7% G G U U A 107.2% A 70.0% G max error: 17.3% max error: 21.6%

ZRANB2 MKRN1

C A Recognition of bulged stem-loop structures A G 0 10 20 30 G G CAGUNAGGC G U C G Stem 2 G C G C Probability >= 99% A U U 99% > Probability >= 95% U 95% > Probability >= 90% U 90% > Probability >= 80% A 80% > Probability >= 70% G Int. bp 70% > Probability >= 60% Stem 2 G 60% > Probability >= 50% A 50% > Probability Internal basepair C G G C Stem 1 Stem 1 1.0 U G Occurence of the C G A most common spacing 0.1 A

D RBPs with structured motifs E Length distribution of stem and loop

structured Stem linear Loop 15

31.4% 10 Counts

5 68.6% 0 4 8 12 16 Length bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Jolma et al. 2019 Figure 3 MKRN1 T shape illustration of stem-loop model 3 Paired 4 Paired bases 2 or less bases Watson-Crick 1% paired bases 0% UG + GU 5 Paired 4% others position of scoring matrix bases 10 Paired 7% bases 4% 0 12345678 9 10 11 12 13 14 15 9 Paired 6 Paired Center bases bases 20% gapped kmer 15%

7 Paired 8 Paired bases bases 23% 26%

ZRANB2 8 Paired 7 Paired 9 Paired bases bases 6 Paired bases 2% 3% bases 1% 4% 5 Paired bases 10 Paired 3% base scored with mononucleotide matrix bases 0% 4 Paired base scored with SLM base-pair score bases base without overlap with the score matrix 2% gap in the kmer 3 Paired Motif with stem shaded indicates base-pair bases 0% 100 2 or less 50 Watson-Crick & G-U paired bases 0 base-pairing (%) 85% bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was Jolma et al.not 201certified by9 peer Figure review) is the author/funder.4 All rights reserved. No reuse allowed without permission. A LARP7 RRM BZNF ZCRB1 ZC3H12A MCM3AP C3H1 * ZC3H12C * HNRNPCL1 ** 1 domain ZC3H12B HNRNPC * ** RC3H2 * RALY * RC3H1 RNPC3 * ** 2 domains ZFP36 RBPMS ZC3H8 RBPMS2 3 domains ZC3H10 RBMS3 * 4 domains MKRN1 * RBM42 ZCRB1 1 C2HC BOLL RBM4 * DAZL * 1 C2H2 ZFR 1 domain DAZ4 * 2 RBZ ZRANB2 DAZ3 * C CSTF2 KH QKI SNRNP70 KHDRBS1 RBMY1F * 1 domain KHDRBS2 RBMY1J * KHDRBS3 RBMY1E * MEX3D CELF4 MEX3C RBM38 2 domains MEX3B RBM24 NOVA2 RBFOX3 * RBFOX1 3 domains IGF2BP1 * PCBP1 RBM6 * ESRP1 D SART3 CSD YBX1 * SNRPB2 CSDA SNRPA YBX2 CSTF2T CARHSP1 TARDBP E DAZAP1 HEXIM MSI2 HEXIM1 MSI1 HNRNPA0 HEXIM2 2 domains HNRNPA3 F PUF HNRNPA1 * PUM1 HNRNPA1L2 DAZ1 * PUM2 RBM4 * 100 RBM4B * 50 RBM14 Watson-Crick & G-U base-pairing (%) 0 RBMS2 * RBMS1 * ELAVL4 G mono−nucleotide di−nucleotides CELF3 AA CELF1 AC AG HNRNPL 15.2% 16.2% AU CA ELAVL3 CC A CG ELAVL1 C CU RBM46 G GA 3 domains U 11.2% GC HNRPLL 29.6% 14.1% GG 38.9% GU PABPC5 13.9% UA UC 4 domains RBM28 UG UU bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without100 permission. Jolma et al. 2019 Figure 5 50 Watson-Crick & G-U base-pairing (%) 52.3% 0 Primary motif only ● structured + structured 47.7% With secondary motif ● structured + linear Family Protein Primary Secondary + linear+ linear DAZ1 ● DAZ4 ● DAZL ● HNRNPA0 + HNRNPC ● HNRNPCL1 ● MSI2 + PABPC5 + RBM28 + RBM4 ● RBM4B ● RRM RBM6 ● RBMS1 ● RBMS2 ● RBMS3 ● RBMY1E ● RBMY1F ● RBMY1J ● RBPMS + RBPMS2 + RNPC3 ● SNRNP70 + SNRPA + SNRPB2 + TARDBP + ZCRB1 + RC3H2 ● ZC3H12A ZNF C3H1 ● ZC3H12B ● ZC3H12C ● ZNF 2xC3H1 ZC3H10 + ZNF 3xC3H1 ZFP36 + ZNF C2H2 ZFR_4 + ZNF RBZ ZRANB2 + CARHSP1 + CSD YBX1 ● YBX2 + PUF PUM2 + THUMP THUMPD1 + KH MEX3B + La LARP6 + bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was Jolma et al.not 2019 certified by peerFigure review) is the 6author/funder. All rights reserved. No reuse allowed without permission. A B Donors

exon intron intron exon ZC3H12C

ZRANB2 BOLL antisense sense antisense sense antisense sense antisense sense

odds ratio ZRANB2 0.04 0.0015 DAZAP1 SNRNP70 IGF2BP1 0.03 0.0010 0 5 10 15 0.02 0 200 400 600 800 1000 0.0005 conserved of all tested sites 0.01 normalized count normalized count 0.00 0.0000 -500 -250 0 250 500 -500 -250 0 250 500 position position Acceptors RBM4 ZC3H12C DAZL RNPC3 antisense sense antisense sense antisense sense antisense sense

0.15 0.0015 ZRANB2 odds ratio ZRANB2 0.10 0.0010 ESRP1_3 12345 6 0.05 0.0005 normalized count normalized count 0 200 400 600 800 1000 0.00 0.0000 conserved of all tested sites -500 -250 0 250 500 -500 -250 0 250 500 position position C D MEX3B

promoter TSS

ZRANB2 antisense sense antisense sense humoral immune response

1e 04

response to exogenous dsRNA Count 5e 05 3 5 normalized count 7 0e+00 9 -500 -250 0 250 500 11 position log10(p.adjust) -5.0 ZC3H12C -7.5 defense response to virus -10.0 antisense sense antisense sense blood coagulation -12.5

0.00020 positive regulation of peptidyl-serine phosphorylation of STAT protein natural killer cell activation involved in immune response 0.00015 0.00010

0.00005 normalized count 0.00000 -500 -250 0 250 position

regulation of type I interferon-mediated signaling pathway fever generation 0.04 0.08 0.12 0.16 regulation of leukocyte migration positive regulation of response to external stimulus Gene Ratio bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was Jolma et al.not certified 201 by9 peer Figure review) is the author/funder. S1 All rights reserved. No reuse allowed without permission.

100 HTR−SELEX RNAcompete Similarit ore 50 Watson-Crick & G-U 0 base-pairing (%) Primary Secondary Primary Secondary

RBFOX1 10.17 RBM42 9.4 SNRPA 7.98 7.91 RBM4 7.96 3.17 PABPC5 7.92 4.71 RRM SNRNP70 7.8 7.07 ZCRB1 7.68 7.66 HNRNPA1L2 7.17 HNRNPCL1 6.85 HNRNPA1 6.7 DAZAP1 5.39 HNRNPC 5.1 RBM24 4.88 CELF4 4.41 PCBP1 4.26 ELAVL1 3.67 MSI1 3.45 HNRNPL 2.77 RBMS3 1.38 2.17 RBM46 1.1 RBMS1 0.96 3.96 RBM28 − 0.21 RALY − RBM6 − − SART3 −

ZNF ZC3H10 3.42 5.41 TARDBP 3.88 3.66

QKI 9.86 KHDRBS1 6 KH KHDRBS2 4.24 KHDRBS3 3.25

YBX1 5.06 6.16 S YBX2 7.26 3.01 bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Jolma et al. 2019 Figure S2

100 HTR−SELEX RBNS 50 Watson-Crick & G-U 0 base-pairing (%) Primary Secondary Primary Secondary

BOLL CELF1 DAZ3 DAZAP1 ELAVL4 ESRP1 HNRNPA0 HNRNPC HNRNPCL1 HNRNPL IGF2BP1 KHDRBS2 KHDRBS3 MSI1 PCBP1 PUM1 RBFOX3 RBM24 RBM4 RBM4B RBM6 RBMS2 RBMS3 RC3H1 SNRPA TARDBP ZCRB1 ZFP36 bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Jolma et al. 2019 Figure S3

Per base information content Overall information content 200 HTR−SELEX RNAcompete+SELEX HTR−SELEX RNAcompete+SELEX

30 150

20 100 Counts Counts

50 10

0 0 0.0 0.5 1.0 1.5 2.0 10 20 30 Bits(bin=0.1) Bits(bin=1)

Motif width 100 HTR−SELEX RNAcompete+SELEX

75

50 Counts

25

0 5 10 15 20 25 Width bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Jolma et al. 2019 Figure S4 Spacing of dimeric motifs 8 RBPs with dimeric binding specificities

Dimer Monomer 6

31.3% 4 Counts

2 68.7%

0 5 10 15 Spacing bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Jolma et al. 2019 Figure S5 A 1.0 PUM2 UGUA-N-UGUA 0 5 10 15 20 Occurence of the HTR-SELEX most common spacing Hafner et al. 2010 0.1 PAR-CLIP

IGF2BP1 CA[GU]-N-CA[GU] 0 5 10 15 20 HTR-SELEX Hafner et al. 2010 PAR-CLIP

RBFOX1 GCAUG-N-GCAUG 0 5 10 15 20 HTR-SELEX Vanhentenryck et al. 2014 HITS-CLIP

RBPMS CAC-N-CAC 0 5 10 15 20 HTR-SELEX Farazi et al. 2014 PAR-CLIP B Occurrence of CAC Occurrence of CAC Occurrence of CCA in RBPMS HTR-SELEX reads in RBPMS PAR-CLIP clusters in RBPMS HTR-SELEX reads >5 >5 0 >2 1% 1% 3% 5% 4 0 4 8% 11% 7% 1 1 1 3 24% 14% 27% 3 25% 29%

0 2 2 68% 37% 40% bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Jolma et al. 2019 Figure S6 A U A G A C C G U A G G G U G U G G U U G A C A G A G C A C A U G U G A C 20 A U A U 20 G U C A C G 20 G 20 U U A 20 U A U A G C G A 20 G C A C G 30 C G C G 10 10 G C U C G 10 G U A U U A U U A C A 30 U U G C G C 30 U A U G C U U U U U A A U A U A G A U A U A A U G A C C G G G G C G A U A U U A A G A G U U G C C G C G C G U C U G C G C C G C C A U 30 G C 30 U G U C 10 C G U 10 G C C G 30 U U A U A A U C G A G G 10 G U A A A G G C A G A G C A U 40 G C G U 40 1. 2. 3. A U 40 A U A U A U A U Probability >= 99% C G A U A U 99% > Probability >= 95% C G 95% > Probability >= 90% A U C G 90% > Probability >= 80% C G A U A U 80% > Probability >= 70% C G C G C G 70% > Probability >= 60% 60% > Probability >= 50% 50% > Probability COL1A1 COL1A2 COL3A1 bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Jolma et al. 2019 Figure S7

RBPs with structured motifs

Structured Linear 30

20 RBP Counts 10

0

RRM KH Znf CSD La PUF CCCH Family HEXIM THUMP bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Jolma et al. 2019 Figure S8

Per base IC of stem Overall IC of stem 2.0 8

1.5 6 SLM SLM 1.0 4

0.5 2

0.0 0 0.0 0.5 1.0 1.5 2.0 02468 PWM PWM bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was Jolma et al.not certified 201 by9 peer Figure review) is the author/funder. S9 All rights reserved. No reuse allowed without permission.

TARDBP RBM6 RBPMS2 RBMY1E RBM24 DAZ4 RBM4 DAZ1_2 ELAVL4 RBM46 RBMS2 CELF3_2 DAZL RC3H2 RBPMS RBPMS2 TARDBP RBM38 RBMY1E DAZ3 ELAVE L1 RBMY1J CELF1_2 DAZL RBM4B ESRP1_3 RC3H2 HNRNPCL1 RBPMS2 RBMY1F ZC3H10 HNRNPC RBM6 BOLL ELAVL3 DAZ4 DAZ1_2

RBMS3 ZFP36 DAZAP1 ZFP36 SNRPA YBX1 HNRNPA1 HNRNPA3 MEX3B_3 RNPC3 HNRNPA1L2 SNRPA CSDA HNRNPA0 MSI1 RBMS1 SNRPB2 YBX2 MEX3D HNRNPA0

ZC3H12B HNRNPCL1

ZC3H12C HNRNPC RBM4 MEX3C RBFOX1_4 RNPC3 RC3H2 HEXIM2 SNRNP70CSTF2T RBMS2 RBM14

ZC3H12A RALY RBM4B MEX3B_3 RBFOX3 RNPC3 RC3H1_2 HEXIM1 SNRNP70 CSTF2 RBMS1 RBM14

ZC3H12C ZCRB1 ZCRB1 RBMY1F QKI PUM1 HNRNPL_2 ZRANB2 MCM3AP CSDARBFOX1_2 DAZ4 YBX1 ZC3H12B

ZC3H12B ZCRB1 ZCRB1 RBMY1J DAZ1_2 PUM2

ZRANB2 RBMS3 RBM4 MEX3B_3 LARP6 ZFR_4 PABPC5 PUM2 RBMS1 RBM42 KHDRBS2 THUMPD1 LARP7_4 KHDRBS1

YBX2 ZFR_4 RBM28 ZFR_4 ZFR_4 SART3 IGF2BP1 RBMS2 ESRP1_3 RBMY1J SNRPB2 MSI2 NOVA2_3 THUMPD1

CARHSP1 PABPC5 QKI MSI2 RBMY1J CARHSP1 SNRPA ZC3H10 RBM28 ZC3H12A KHDRBS3 MSI1 HNRPLL MKRN1

RBPMS PCBP1 CELF4 bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Jolma et al. 2019 Figure S10 A Simple non-structured motifs

B Variably gapped non-structured motifs 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GUGUCNGAAGU GUGUCNCUAGC

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 UUGGCNGAGCC 1.0 Occurence of the most common spacing 0.1 bioRxiv preprint doi: https://doi.org/10.1101/317909; this version posted March 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Jolma et al. 2019 Figure S11 Donor MI Acceptor MI (bp) (bp) 1 71 142 1 71 142 145 145

74 74 (bp) (bp)

mean_diag (<=15) mean_diag (<=15) 4 0.05017 4 0.04575

Donor E-MI Acceptor E-MI (bp) (bp) 1 71 142 1 71 142 145 145

74 74 (bp) (bp)

mean_diag (<=15) mean_diag (<=15) 4 0.003576 4 0.002168