1 Identification of Protein-Protected Mrna Fragments and Structured
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171439; this version posted June 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Identification of protein-protected mRNA fragments and structured excised intron RNAs in human plasma by TGIRT-seq peak calling Jun Yao1, Douglas C. Wu1, Ryan M. Nottingham, and Alan M. Lambowitz2 Institute for Cellular and Molecular Biology and Departments of Molecular Biosciences and Oncology, University of Texas at Austin, Austin, TX 78712 1 Contributed equally 2To whom correspondence should be addressed. E-mail: [email protected] 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171439; this version posted June 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Summary Human plasma contains >40,000 different coding and non-coding RNAs that are potential biomarkers for human diseases. Here, we used thermostable group II intron reverse transcriptase sequencing (TGIRT-seq) combined with peak calling to simultaneously profile all RNA biotypes in apheresis-prepared human plasma pooled from healthy individuals. Extending previous TGIRT-seq analysis, we found that human plasma contains largely fragmented mRNAs from >19,000 protein-coding genes, abundant full-length, mature tRNAs and other structured small non-coding RNAs, and less abundant tRNA fragments and mature and pre-miRNAs. Many of the mRNA fragments identified by peak calling correspond to annotated protein-binding sites and/or have stable predicted secondary structures that could afford protection from plasma nucleases. Peak calling also identified novel repeat RNAs, miRNA-sized RNAs, and putatively structured intron RNAs of potential biological, evolutionary, and biomarker significance, including a family of full-length excised introns RNAs, subsets of which correspond to mirtron pre-miRNAs or agotrons. Key words: biomarker, cell-free RNA, diagnostics, extracellular RNA, RNA-binding protein, tRNA fragment 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171439; this version posted June 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Introduction Extracellular RNAs in human plasma have been avidly pursued as potential biomarkers for cancer and other human diseases (1-7). In healthy individuals, plasma RNAs arise largely by apoptosis or secretion from cells in blood, bone marrow, lymph nodes, and liver, while in cancer and other diseases, plasma RNAs may arise by necrosis or secretion from tumors or other damaged tissues, potentially providing diagnostic information (8-14). As plasma contains active RNases (15, 16), the extracellular RNAs that persist there are thought to be protected from degradation by bound proteins, RNA structure, or encapsulation in extracellular vesicles (EVs). Although high-throughput RNA-sequencing (RNA-seq) has identified virtually all known RNA biotypes in human plasma, studies aimed at identifying disease biomarkers have focused mostly on plasma mRNAs or miRNAs. mRNAs in plasma and blood have been profiled by RNA-seq methods that enrich for or selectively reverse transcribed poly(A)-containing RNAs (10, 17) or by sequencing mRNA fragments with (13, 14, 18, 19) or without (4, 20) size selection, and it remains unclear which methods might be optimal for biomarker identification. miRNAs in plasma are analyzed by methods that enrich for small RNAs and neglect pre-miRNAs or longer transcripts (9, 19, 21-23). Additionally, almost all RNA-seq studies of plasma RNAs have used retroviral reverse transcriptases (RTs) to convert RNAs into cDNAs for sequencing on high- throughput DNA sequencing platforms. Retroviral RTs have inherently low fidelity and processivity and even those that are highly engineered have difficulty reverse transcribing through stable RNA secondary structures or post-transcriptional modifications, resulting in under-representation of 5'-RNA sequences and aborted reads that can be mistaken for RNA fragments (24). Thus, we still have an incomplete understanding of the biology of plasma RNAs and how to optimize their identification as biomarkers for human diseases. As a substitute for retroviral RTs, we have been developing RNA-seq methods using thermostable group II intron reverse transcriptases (TGIRTs) (20, 24, 25). In addition to high fidelity, processivity, and strand-displacement activity, group II intron RTs have a proficient template-switching activity that enables efficient, seamless attachment of RNA-seq adapter sequences to target RNAs without RNA tailing or ligation (25, 26). Further, unlike retroviral RTs, which tend to dissociate from RNAs at post-transcriptional modifications that affect base pairing, TGIRT enzymes pause at such modifications but eventually read through by characteristic patterns of mis-incorporation that can be used to identify the modification (20, 24, 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171439; this version posted June 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 27-31). This combination of activities enables TGIRT enzymes to give relatively uniform 5'- and 3'-sequence coverage of mRNAs when initiating conventionally from an annealed oligo(dT) primer (25) and to give full-length, end-to-end reads of tRNAs and other structured small ncRNAs, when initiating by template switching to the 3' end of the RNA (20, 24, 25, 27). In size selected RNA preparations, TGIRT-seq also profiles human miRNAs with bias equal to or less than alternative methods (25, 32). The comprehensive TGIRT-seq method used in this work to analyze human plasma RNAs employs TGIRT-template-switching for 3'-RNA-seq adapter addition followed by a single- stranded DNA ligation for 5' RNA-seq adapter addition (Fig. 1A). In a validation study using rRNA-depleted chemically fragmented human reference RNAs with External RNA Control Consortium spike-ins, TGIRT-seq gave better quantitation of mRNAs and spike ins, more uniform 5’ to 3’ coverage of mRNA sequences, detected more splice junctions, particularly near the 5' ends of mRNAs, and had higher strand specificity when compared to benchmark TruSeq- v3 datasets (24). This study also showed that TGIRT-seq enables simultaneous sequencing of chemically fragmented mRNAs together with tRNAs and other structured sncRNAs, which were poorly represented in the TruSeq datasets, even after chemical fragmentation (24). A subsequent study of chemically fragmented human cellular RNAs with customized spike ins confirmed these findings and showed that TGIRT-seq enables simultaneous quantitative profiling of mRNA fragments and sncRNAs of >60 nt, but under-represents smaller RNAs due in part to differential loss during library clean-up to remove adapter dimers (33). Recent TGIRT-seq profiling of structured RNAs in unfragmented cellular RNA preparations from human cells revealed previously unannotated sncRNAs, including novel snoRNAs, tRNA-like RNAs, and tRNA fragments (tRFs)(34). The ability of TGIRT-seq to simultaneously profile mRNAs and non-coding RNAs from small amounts of starting material without size selection is advantageous for the analysis of extracellular RNAs, which are present in low concentrations in human plasma or EVs secreted by cultured human cells. In a previous study, TGIRT-seq showed that human plasma from a healthy male individual contained predominantly full-length tRNAs and other structured small ncRNAs, which could not be seen in RNA-seq studies using retroviral RTs, together with RNA fragments derived from large numbers of protein-coding genes and lncRNAs, with higher proportions of intron and antisense RNAs than in cellular RNA datasets (20). Similarly, TGIRT- 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.171439; this version posted June 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. seq showed that highly purified EVs and exosomes secreted by cultured human cells also contained predominantly full-length tRNAs, Y RNAs, and Vault RNAs, together with low concentrations of mRNAs, including 5' terminal oligopyrimidine (5' TOP) mRNAs (30). The same TGIRT-seq workflow using template switching for facile 3'-RNA-seq adapter enabled single-stranded (ss) DNA-seq profiling of human plasma DNA, including analyses of nucleosome positioning and DNA methylation sites that inform tissue-of-origin (35), as shown previously for other ssDNA-seq methods (36, 37). Since these earlier studies, TGIRT-seq methods have undergone improvements including: (i) the use of modified RNA-seq adapters that substantially decrease adapter-dimer formation; (ii) computational and biochemical methods for remediating residual end biases,