Using Qclash to Identify Mir-484 Targets Mythili Merchant1,2, Christopher Fields3,4, Lu Li3,4, Mingyi Xie3,4,5

Using qCLASH to identify miR-484 targets Mythili Merchant1,2, Christopher Fields3,4, Lu Li3,4, Mingyi Xie3,4,5

1Department of Microbiology & Cell Science, 2College of Agricultural and Life Sciences 3Department of Biochemistry and Molecular Biology, 4UF Health Cancer Center and 5UF Genetics Institute, University of Florida, Gainesville, FL 32610, USA

Sequencing Cross-Linking ImmunoPrecipitation (HITS-CLIP), Abstract—MicroRNAs (miRNAs) are 22 nucleotide long RNA sequencing reads of miRNA and their targets were paired using molecules that play an important role in gene regulation by bioinformatic predictions that were based on base pairing. This binding to mRNA targets and silencing them. This paper focuses created a lot of scope for error and also caused problems in on a subset of miRNAs known as Transcription Start Site miRNAs (TSS-miRNAs), specifically miR-484 which has been shown to play establishing connections where non-canonical base pairing a role in cancer. An important part of this research is in occurred [5]. qCLASH has several advantages over these older determining the mRNA targets of miR-484, and understanding methods: it eliminates problems of inaccuracy because of the more about their genomic locations, base pairing patterns and ligation of the miRNA to its target, it prevents loss of large amounts possible functional roles relating to cancer progression, using of RNA and takes significantly less time [5]. bioinformatic tools. Many miR-484 targets gene have functional qCLASH was used to identify hybrids in this experiment on roles relating to apoptosis, cell proliferation and cell growth, raising the possibility of miR-484 being used as markers of disease, the HCT116 human colorectal cancer cell lines containing or for therapies in the future. DROSHA knockouts, which would produce no canonical miRNAs and would be enriched with DROSHA independent I. BACKGROUND miRNAs like TSS-miRNAs. DICER knockout cells served as a negative control, because it should produce no miRNAs other miRNAs are 22 nucleotide long RNA molecules that interact than miR-451 which has been identified as the only miRNA with mRNA to repress translation using imperfect base pairing which is DICER independent [14]. qCLASH was also [1,2,3]. They play a major role in gene regulation. Generally, performed in wildtype HCT116 cells. In order to learn more miRNAs are transcribed by RNA polymerase II, and contain a about the mRNA targets of these TSS-miRNAs, a major part of stem-loop hairpin [1,2]. These are known as primary miRNA the data required bioinformatic analysis of these hybrids. (pri-mRNA). The pri-mRNA is cleaved by DROSHA, a Before analyzing this data, however, pre-processing of the nuclease that begins miRNA processing in the nucleus, and RNA-seq data needed to be done to ensure quality of the DGCR8 to release the hairpin with a 3’ overhang, forming a sequences. The pre-processing was done on wildtype and precursor miRNA (pre-miRNA), and is then exported to the DROSHA knockout data files. These processes included cytoplasm by Exportin-5 (XPO5), where the loop is processed merging forward and reverse sequencing reads into a single by DICER [1,3]. Finally, one strand of the mature miRNA forward read using PEAR [7], adapter removal, collapsing of duplex is preferably selected by the Argonaute protein (Ago) to PCR duplicates and trimming of low quality bases using either form a complete and fully functional RNA Induced Silencing Trimmomatic [6], Cutadapt [9] or Fastx Toolkit [11]. Hyb [8] Complex (RISC) (Fig. 1) [1]. While most miRNAs are was used to identify the miRNA-mRNA hybrids. The miR-484 generated as described, there are alternative pathways which target sequences were extracted and Integrative Genome generate miRNAs, including the mirtron pathway, the Viewer [10] was used to learn about their locations. Several of Transcription Start Site (TSS)-miRNA pathway, and the miR- these mRNA were found to potentially play an important role 451 pathway [1,2]. in oncogenesis. TSS-miRNAs share promoters with protein coding genes and are derived downstream of RNA Pol II transcription start sites [3]. TSS-pre-miRNAs are produced directly by initiation and termination of RNA Pol II, and are not processed by DROSHA [2]. Two important TSS-miRNAs, miR-320a and miR-484 have been shown to play a role in oncogenesis [3,4]. However, the vast majority of these TSS-miRNAs are largely unannotated, so their targets and functions are unknown. Understanding the biogenesis and function of these TSS- miRNAs may result in better understanding of cancer pathogenesis and the creation of novel therapies [2,3]. Fig. 1. Canonical miRNA biogenesis [15]: miRNAs are transcribed by RNA Quick Crosslinking Ligation and Sequencing of Hybrids polymerase II and contain a stem-loop hairpin to form pri-mRNA. DROSHA (qCLASH) is a method used to identify mRNA targets of And DGCR8 cleave the hairpin and create a 3’ overhang which forms the pre- miRNAs. RNA and proteins are covalently bonded together miRNA. This is exported to the cytoplasm by Exportin 5 where the loop is using UV [5]. Then, the miRNA is ligated directly to its target processed by DICER. One strand of the mature miRNA is selected by Ago to form the final RISC complex. mRNA which results in the formation of a single hybrid RNA molecule [5]. In older methods like High-Throughput 2

II. METHODS low quality base trimming, followed by running hyb for hybrid identificaiton. This version of the pipeline used scripts 1,2,4,5 Performing qCLASH to ligate miRNA to mRNA targets and 6. The second version of this pipeline used cutadapt, instead of trimmomatic for adapter removal (Fig. 3). For this version, The experimental portion of this project involved using scripts 1,3,4,5 and 6 were used. qCLASH to covalently bond Ago and bound RNA while preserving RNA-protein complexes using 254 nm UV. Cells B. No PEAR using Workflow 1 (pipeline 1.2) were lysed under denaturing conditions and the Ago-RNA The second pipeline in this workflow started off directly with complex was isolated using immunoprecipitation and trimmed adapter removal, and did not use PEAR for merging reads. The first using RNase. It is important to do extensive washes in order to version of this pipeline used trimmomatic for adapter removal. remove potential contaminants prior to ligating ends. The 5’ Once trimmomatic ran, the forward assembled and unassembled end of the target mRNA was then ligated to the 3’ end of the read files were concatenated to produce maximum forward reads miRNA which resulted in the formation of a “hybrid”, which and then fastx toolkit was used for collapsing and trimming of low then had a 3’ adapter added. At last, the hybrids had 5’ adapters quality bases, followed by hyb. The first version of this pipeline ligated and were reverse transcribed into cDNA. They were used scripts 2,4,5 and 6. The second version of this pipeline used amplified using PCR to create libraries for Illumina sequencing. cutadapt, instead of trimmomatic for adapter removal (Fig. 3) and The next step after sequencing was to pre-process data and used scripts 3,4,5 and 6. prepare it for downstream analyses.

Script 1: PEAR for merging reads Determining the most efficient workflow to pre-process data Determining the most efficient and constructive way to pre- module load pear process data in order to result in the maximum identification of pear -f infile_forward.fastq.gz -r hybrids by Hyb was a very important aspect of this project. A infile_reverse.fastq.gz -o outfile_name -m 200 -n 18 -t 26 file of 5% of the sequencing reads of wildtype data was used to #m is for maximum possible length of assembled determine the most efficient method of pre-processing data, sequence using two different workflows (Fig. 2). #n is for minimum possible length of assembled sequence There were two major workflows, each of which had several #t is for minimum length of reads after trimming low different variations in order to determine the best tools and quality bases combinations to accomplish the same job, which was to #Remember to concatenate the forward unassembled and assembled files generate the maximum number of miRNA-mRNA hybrids (Fig. 2). Script 2: Trimmomatic for adapter removal

module load trimmomatic trimmomatic SE infile -baseout outfile ILLUMINACLIP:RPI_trimmomatic.fasta:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:26 #ILLUMINACLIP provides name of adapter sequence file #LEADING is minimum quality to keep a base from the beginning #TRAILING is minimum quality to keep a base from the end #SLIDINGWINDOW performs sliding window trimming and cuts off sequence when average quality of seq in

window falls below threshold #MINLEN removes reads that fall below a Fig. 2. Organization of workflows for determining most efficient method to specified min length pre-process RNA-seq data Script 3: Cutadapt for adapter removal The first workflow relied heavily on using Trimmomatic and Cutadapt as the tools for adapter removal and Fastx toolkit for module load cutadapt cutadapt -g collapsing PCR duplicates and trimming low quality bases. This AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA will be referred to as workflow 1 and was divided into two different -a GTGACTGGAGTTCCTTGGCACCCGAGAATTCCA -m 26 -o pipelines, each of which had two different versions. outfile infile #g is 5’ adapter sequence and a is 3’ A. With PEAR using Workflow 1 (pipeline 1.1) adapter sequence #m is minimum length of read The first pipeline started off by using PEAR to merge reads. PEAR produces multiple files, only two of which were of Script 4: Fastx toolkit for collapsing interest for downstream analyses - the assembled and forward unassembled files. These two files were concatenated to module load fastx_toolkit generate maximum number of sequences. The first version of fastx_collapser -v -i infile -o outfile this pipeline used trimmomatic for adapter removal on the #v prints short summary of input/output counts PEAR file, and fastx toolkit for collapsing PCR duplicates and 3

Script 5: Fastx toolkit for trimming Using hyb to identify hybrids and extract target mRNA sequences module load fastx_toolkit Running hyb generated several files, one of which was a fastx_trimmer -f 4 -l 4 -i infile -o outfile #f is first base to keep #l is last “stats” file that provided detailed information on all of the base to keep different kinds of RNA hybrids that were detected. The ones of interest for this project were the miRNA-mRNA hybrids. Using Script 6: Hyb for hybrid identification a python script, mRNA target sequences of the most abundant TSS-miRNAs were extracted and used to generate a list of high module load hyb confidence targets for miR-484, along with transcriptomic module load unafold coordinates, number of replicates and peaks (see supplementary hyb analyse in=infile db=hOH7 format=fasta #db is database of choice material for script). This analysis was done on the entire pre- processed wildtype and DROSHA knockout data files, not just The next workflow relied heavily on cutadapt and fastx toolkit, 5% of one of the files. and tested a new order of pre-processing. This workflow will be referred to as workflow 2 and was also divided into two differed Determining precise genomic location of targets Background information on what is known about the miRNA pipelines, each of which had two different versions. and target mRNA interaction was gathered using Mirtarbase C. Cutadapt and fastx toolkit for pre-processing using [12] to look at whether the high confidence mRNA targets generated by Hyb were previously characterized as miR-484 Workflow 2 (pipeline 2.1) In order to test whether the order of pre-processing impacted the targets. Targetscan was used to determine whether these number of hybrids produced, the PCR collapsing step was done interactions could be bioinformatically predicted [13]. Mapping much earlier than last time. The first version of this pipeline started these targets to the genome required obtaining genomic off by using PEAR to merge reads, and fastx toolkit to collapse coordinates and chromosome numbers. To do this, all the mRNA PCR duplicates. Then, cutadapt was used for adapter removal and target sequences were run through BLASTn [21]. This was trimming of low quality bases, followed by hyb. This version used followed by using Integrative Genomics Viewer for mapping. The scripts 1,4,7 and 6, respectively. The second version of this h38 version of the human genome was used, specifically with the pipeline followed the same process, but did not use PEAR. So, it gencode track. This track was used instead of the refseq track started off with collapsing using fastx toolkit, rather than merging because it provided a better visual of the 3’ UTR and 5’ UTR reads using PEAR (Fig. 4). The scripts used for this version were regions. Base pairing patterns of miR-484 and target mRNA were 4,7 and 6, respectively. generated using the viennad file produced by running Hyb, and by using RNAcofold [20] (Table III). D. Fastx toolkit for all pre-processing using Workflow 2 (pipeline 2.2) III. RESULTS

The second pipeline in this workflow used fastx toolkit for most Learning about the mRNA targets of miR-484 and of the pre-processing. The first version used PEAR to merge reads, understanding more about their functions was the most important and then used fastx toolkit for adapter removal, collapsing and aspect of this study, because it will open new avenues for cancer trimming, followed by hyb. Scripts used for this version were related therapies. To accomplish this, Quick Crosslinking, Ligation 1,8,4,5 and 6 respectively. The second version of this pipeline did and Sequencing of Hybrids (qCLASH) was used to covalently not use PEAR to merge reads at the start, and used only fastx for bond proteins to the bound RNA using 254 nm UV radiation. Then, adapter removal, collapsing and trimming, ending with hyb (Fig. miRNAs were ligated to their mRNA targets. This was done on 4). This version used scripts 8,4,5 and 6. HCT116 human colorectal cancer cell lines containing DROSHA knockouts, which would result in an enrichment of TSS-miRNAs, Script 7: Cutadapt for adapter removal and trimming which were the subset of miRNAs of interest. It was also done in wildtype cell lines. This resulted in the production of “hybrids” of module load cutadapt cutadapt -g miRNA and their mRNA targets, which made identification of AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGANN targets easier. 5’ and 3’ adapters were ligated and the hybrids were NN -a GTGACTGGAGTTCCTTGGCACCCGAGAATTCCANNNN -m 26 - reverse transcribed into cDNA to be amplified via PCR in order to o outfile infile #first sequence is 5’ adapter sequence and produce libraries for high throughput sequencing. A lot of pre- second sequence is 3’ adapter sequence processing needed to be done on the RNA seq data using #N’s are for trimming bioinformatics tools. It was important to use the right combination of tools with the right parameters, such that they were neither too Script 8: Fastx toolkit for adapter removal flexible nor too stringent which would allow for the generation of a maximum number of hybrids when run through hyb. This would module load fastx_toolkit allow for downstream analyses where the sequences of the most fastx_clipper -a AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA abundant TSS-miRNA’s targets would be extracted using a python - l 26 -v -i infile -o outfile script and then, mapped back to the genome, so that more about fastx_clipper -a GTGACTGGAGTTCCTTGGCACCCGAGAATTCCA their target sites and functions could be understood. - l 26 -v -i infile -o outfile 4

Cutadapt and Trimmomatic allow for a maximum number of mRNA-miRNA hybrids In order to get the maximum hybrids possible, an important aspect was to determine which set of bioinformatic analysis

tools worked best. In addition to this, it was important to determine which order would optimize results. To test this, two different workflows were developed and each workflow was tested using 5% of the entire raw sequence data file. Both workflow had two pipelines, each of which had two different versions (Fig. 2). The first workflow had two different pipelines. The first pipeline used trimmomatic or cutadapt for adapter removal, fastx toolkit for collapsing PCR duplicates and trimming low quality bases along with PEAR for merging reads. Ultimately it was run through hyb to identify hybrids. The other pipeline did the same thing, but without merging reads using PEAR (Fig. 3). The next workflow that was developed switched the order and combination of processing. The first pipeline in this workflow used fastx toolkit for collapsing PCR duplicates, but used cutadapt for trimming and adapter removal, followed by hyb. This was done with and without PEAR. The second pipeline in this workflow used fastx toolkit for adapter removal, collapsing and trimming, followed by hyb. This too was done with and without PEAR (Fig 4). Fig. 4. Visual representation of workflow 2, using pipelines 2.1 and 2.2. Using fastx toolkit for adapter removal produced between 2700 and 3000 hybrids, whereas using trimmomatic or cutadapt produced between 4500 and 4700 hybrids. These results were TABLE I consistent with or without the use of PEAR (Table I). From this, NUMBER OF HYBRIDS PRODUCED FROM EACH WORKFLOW Workflow 1 Workflow 2 it was established that using fastx toolkit as a means of adapter removal with or without PEAR produced the least number of Pipeline 1.1 Pipeline 1.2 Pipeline 2.1 Pipeline 2.2 hybrids and was least effective. This data was collected from Ver. 1 Ver. 2 Ver. 1 Ver. 2 Ver. 1 Ver. 2 Ver. 1 Ver. 2 the “stats” file which was generated as a result of running hyb 4697 4632 4729 4632 4536 4587 3018 2792 on the data. As a result, using cutadapt or trimmomatic for adapter Hyb identifies high confidence miR-484 targets removal and quality processing, followed by fastx toolkit for Continuing the analysis depended on figuring out which TSS- collapsing PCR duplicates and trimming, with or without miRNA’s were most abundantly expressed. In addition to this, it PEAR, was determined to be the most efficient workflow was important to determine the location of their mRNA targets in (workflow 1, using either pipeline 1.1 or 1.2) to obtain the the genome. Running hyb generated several different files. Some maximum number of hybrids. of the important ones were the “stats”, “hyb” and “viennad” files. A python script was used to identify the most abundantly expressed TSS-miRNA’s. This used the “.hyb” file produced by running hyb. The most expressed TSS-miRNAs were found to be miR-320a and miR-484. After this, the script was used to extract the mRNA sequences of miR-484 targets from both the wildtype and

DROSHA knockout pre-processed data files. This was done on the “viennad” file which was generated as a result of running hyb on the data (see supplementary material for python script). A ranked list was generated where targets were sorted by number of replicates and peaks, along with the gene name, transcriptome coordinates and sequence. The TSS-miRNA, miR-484, was found

to target several mRNAs in the genome. In some cases, it targeted multiple locations on the same gene. For example, in AKAP12.

3’ UTR and Exons of mRNA are the most targeted locations by

miR-484 To determine the precise locations where miR-484 targets the

mRNA, information on genomic coordinates needed to be gathered and then the mRNA targets had to be mapped back to Fig. 3. Visual representation of workflow 1, using pipelines 1.1 and 1.2. the genome. A lot of target sequences were extracted from the python script, but only the targets with two or more replicates 5 from both the wildtype and DROSHA knockout data files were MLF2 2 3 Exon used for further analyses (Table II). Data on gene name, TSPAN15 2 2 3’ UTR transcriptome coordinates, number of replicates, peaks and ZFP36 2 2 Exon sequences was already generated, so picking out the targets of AKAP12 2 2 Exon interest was not a challenge. Targetscan was used to see SERPINH1 2 2 3’ UTR whether any of these high confidence targets could be C1orf9 2 4 3’ UTR bioinformatically predicted as targets of miR-484, but there C6orf203 2 2 Exon were no significant results. Furthermore, mirtarbase was used ZNF322A 2 2 3’ UTR to determine whether any of the miRNA-mRNA interactions CCDC151 2 2 Exon that were found have been previously characterized, but most had only been previously characterized by NGS, and not experimental methods. BLASTn was used to determine TABLE III chromosome number and genomic coordinates of these targets. BASE PAIRING PATTERN FOR MIR-484 AND MRNA TARGETS This was necessary information for the next step of the Gene Name analyses, where Integrative Genomics Viewer (IGV), using the Target mRNA/miR-484 base pairing pattern (Target) Gencode track, was used to determine the location of the target in the mRNA. A majority (nearly 90%) of the locations on the target genes were found to be in the 3’ UTR or exon region, which is consistent with previous reports [18, 19]. Very few CMPK1 were found in the intron regions and none in the 5’ UTR (Fig.

5). Using RNAcofold, the base pairing patterns of miR-484 and the mRNA target were generated. Table III shows some of these patterns, with miR-484 at the top and the small portion of the RPL32 mRNA target sequence that it binds to at the bottom. Lines between bases represent Watson-Crick base pairing and a dot represents G-U wobble base pairs. The blue highlighted bases TM4SF18 are nucleotides 2-8 on miR-484 and represent seed base pairing, which is nearly perfect for all targets indicating that these are strong targets. Bases that are not in line with the rest are CRTC2 represent the mRNA bases that loop out during the base pairing with miR-484.

PNRC2

Intron

7% MLF2

Exon 3' UTR 36% TSPAN15 57%

ZFP36

3' UTR Exon Intron AKAP12

Fig. 5. Genomic locations of mRNA targets visualized with a pie chart: Most of the target locations for the miRNA on the mRNA are found in the 3’ UTR and exon region SERPINH1

TABLE II NUMBER OF REPLICATES AND HYBRIDS FOR MRNA TARGETS C1orf9 Number of Targeting Number of Gene Name hybrids location in the Replicates (Peaks) gene CMPK1 3 10 3’ UTR C6orf203 RPL32 3 8 Intron TM4SF18 3 3 3’ UTR CRTC2 2 4 3’ UTR PNRC2 2 4 3’ UTR 6

IV. DISCUSSION ZNF322A Fastx toolkit is not the recommended tool for adapter removal

In an attempt to determine the most efficient workflow for pre- CCDC151 processing of sequencing data, cutadapt and trimmomatic proved to be the most effective tools for adapter removal, along with fastx toolkit for collapsing of PCR duplicates and trimming of low Several mRNA targets are involved in cell growth, quality bases. There was a significant difference in the number of proliferation, metastasis, and apoptosis hybrids produced as a result of using cutadapt or trimmomatic To understand the possible role of miR-484 in cancer versus fastx toolkit (Table 1). It is interesting, however, to note that progression, an understanding of the functional roles of its using cutadapt for trimming instead of fastx toolkit in workflow 2 mRNA targets was necessary. 10 out of 14 mRNA targets that pipeline 2.1, produced approximately the same number of hybrids were characterized using hyb were found to have functions as when fastx toolkit was used for trimming. These results indicate relating to cell growth, metastasis, proliferation, apoptosis, etc. that cutadapt probably uses more flexible parameters in all of its A. Overexpression of CMPK1, MLF2, TSPAN15, TM4SF18 processing, be it adapter removal or trimming. However, fastx toolkit may use stringent parameters for adapter removal processes, SERPINH1, ZFP36 and ZNF322A promotes oncogenesis CMPK1 is upregulated in a variety of malignant tumors and but more flexible parameters for its trimming and collapsing. is closely associated with tumor growth [22]. It is overexpressed in human epithelial type ovarian tumors [23]. The 3’ UTR and exon region of a gene is most important for Upregulation enhances cell proliferation and cell motility [22]. targeting mechanisms It also has an effect on apoptosis and is regulated by multiple Understanding the target locations of miR-484 in the genes miRNAs, apart from miR-484 [23]. required retrieving chromosome numbers and genomic Similarly, overexpression of ZNF322A also promotes cell coordinates. Using BLASTn and Integrative Genomics Viewer proliferation and prolongs the cell cycle [30]. It also enhances (IGV), this information was gathered and used to determine cell migration and invasion. ZNF322A is overexpressed in lung precise locations of targets in the gene, i.e., whether they cancer patients and is associated with poor survival chances targeted the 3’ UTR, 5’ UTR, Exons or Introns. The results [30]. showed that most of the target locations were either in the 3’ TSPAN15, ZFP36 and SERPINH1 promote the metastatic UTR or Exon regions. There were no target regions in the 5’ ability of cells [29, 31, 32]. TSPAN15 overexpression leads to UTR, and very few in the intronic regions. Previous studies uncontrollable activation of several metastasis related genes in have shown that miRNAs usually regulate genes by binding to esophageal squamous cell carcinoma [31]. High levels of the 3’ UTR regions or coding sequence regions (exons) and act SERPINH1 is closely associated with poor clinical outcome at by destabilizing the mRNA to silence the gene [18]. It was all four stages of clear cell renal cell carcinoma [32]. TM4SF18 shown that miRNA target sites in exons was not as well- is aberrantly expressed in pancreatic cancer and regulates cell regulated as target sites in 3’ UTR’s, but, nonetheless, they exist growth as well [33]. [19]. Furthermore, overexpression of MLF2 also plays an important role in tumor initiation and metastasis. It has been found that reducing its expression in breast cancer patients The mRNA targets of miR-484 have oncogenesis relevant results in reduced primary tumor growth and metastasis [28]. functions

Once the most efficient workflow for pre-processing was B. CRTC2, PNRC2 and AKAP12 downregulation promotes determined, the entire wildtype and DROSHA knockout files were cancer progression used for downstream analysis. One of the most important aspects of this project was to determine the mRNA targets of miR-484 and AKAP12 is a tumor suppressor and lower expression results determine whether any of these have been characterized before, or in higher amounts of cell proliferation and apoptosis [24, 25]. whether there are any possible cancer related functional roles of It is downregulated in patients with skin, colon, breast and these genes. As mentioned before, a lot of these targets were not prostate cancer [24, 25, 26]. Individuals with higher AKAP12 bioinformatically predicted, but after mining databases, many levels are healthier [24]. mRNA targets appeared to be potentially linked to oncogenesis The reduced expression of PNRC2 results in increased cell related functions. 10 out of the 14 characterized targets had proliferation and cell motility, and can be indicative of reduced relevant functional roles, mostly associated with cell proliferation, survival in individuals with renal cell carcinoma [27]. metastasis or apoptosis in different types of cancers. These were all Patients with very aggressive forms of prostate cancer had characterized to be strong targets due to the number of replicates differing levels of CRTC2 compared to individuals with less and hybrids found on hyb, as well as the strong seed base pairing aggressive forms of prostate cancer, and no prostate cancer. patterns. In the future, these can be used as biomarkers for cancer High CRTC2 expression was indicative of low invasion, and could be used as targets for drug development in the future. whereas low CRTC2 expression was indicative of high levels Several previous studies have also published work on roles of miR- of invasion [34]. 484 7 in oncogenesis, and how understanding more about their mirna_end1 = int(list1[2].split('\t')[-1]) mirna_start1 = int(list1[2].split('\t')[-2]) targeting mechanisms may help in developing new therapies for motif1 = patients. A previous report found that miR-484 is list1[4].split('\t')[0][mirna_end1:].strip('.') downregulated in cervical cancer tissues, compared to normal motif2 = ''.join(map(lambda x: '\\'+x,motif1)) tissues, so overexpression could suppress cell proliferation and rex1 = re.compile(motif2) modol_seq = increase apoptosis, and thereby inhibit the development of re.search(rex1,list1[4].split('\t')[0]) cervical cancer [16]. Merhautova et. al. found miR-484 to have motif_seq = decreased expression in tumor tissue of ovarian cancer patients list1[1].split('\t')[0][modol_seq.span()[0]:modol_se but increased expression in tumor tissues of adrenocortical q.span()[1]] ## extract motif sequence list_basepattern.append('>'+list1[0]+'_dG:'+str(dG cancer patients [17]. [1:-1])+'\t'+list1[4].split('\t')[0][mirna_start1- 1:mirna_end1]+'\t'+list1[4].split('\t')[0][mirna_end In summary, several strong targets for the TSS-miRNA, miR- 1:].strip('.')+'\t'+motif_seq) dis1 = (int(t_line2[3])+1)-int(t_line2[2]) 484 were found, with known or putative roles in oncogenesis. dfm = pd.DataFrame({'gene': Some of these were downregulated, and others were [target_name]*dis1,'position': upregulated in a variety of cancers. This raises the possibility of range(int(t_line2[2]),int(t_line2[3])+1),'Number': miR-484 being a potential target for drug development and [1]*dis1}) dfm = dfm.set_index(['gene', 'position',]) other therapies in the future. return dfm

def integrate_genomecov(files): global list1, list_basepattern SUPPLEMENTARY MATERIAL dG,F = 0,0 ## Calculate the minimum free energy Python Script used to extract mRNA target sequences list1, list_basepattern = [],[] ##Temporarily get from viennad file: information about each target ## This script uses the Viennad file for call dis1=0 ##The distance between the beginning and peak analysis. the end of the reads f=pd.DataFrame() ## Create empty DataFrame dict1={} ## Hyb analyse type=mim pref=mim only output miRNA- mRNA chimera, so this script only for for mRNA as a ##Integrate genomecov_peak files into the target. dictionary if len((glob.glob('*.viennad')))<= 1: import os, re, pandas as pd, numpy as np, glob, sys, getopt, random print 'the total Viennad file is ' +str(len(glob.glob(files))) from datetime import else: datetime random.seed('lu') Current_time = str(datetime.now()).split()[0] print 'the total Viennad file are ' +str(len(glob.glob(files))) #################################################### ################### how to use this script for i in glob.glob(files): print i def main(format1): ### Input format, the script with open(i,'r+') as f1: needs to input Viennad file , replicates for line1 in f1: try: if ('microRNA' in line1) and ('mRNA' in opts, args = getopt.getopt(format1, 'h:i:g:r:m:o:') line1) and (miR_name1.upper().replace('ALL','RNA') in line1.upper()): for i in opts: if 'h' in i[0]: if (list1 != []) and ('microRNA' in list1[0]) and ('mRNA' in list1[0]) and print 'python2 test.py -i - (miR_name1.upper().replace('ALL','RNA') in r -m -o ' list1[0].upper()): sys.exit() elif dfm = genomecov_eachgene() i[0] == '-i': f= Viennad = i[1] elif i[0] == '-r': pd.concat([f,dfm],axis=1).sum(axis=1) ## f indicate the total replicates replicates1 = i[1] list1=[] elif i[0] == '-m': list1.append(line1.strip()) miR_name1 = i[1] if (list1 != []) and ('microRNA' in list1[0]) elif i[0] == '-o': output1_name = i[1] and ('mRNA' in list1[0]) and (miR_name1.upper().replace('ALL','RNA') in except getopt.GetoptError as err: list1[0].upper()): dfm = genomecov_eachgene() print 'python2 test.py -i -r f = -m -o ' sys.exit() pd.concat([f,dfm],axis=1).sum(axis=1).rename(i) dict1[i]=f return Viennad, replicates1, list1 = [] ## Empty the list after analyzing miR_name1, output1_name the contents of a Viennad file f=pd.DataFrame() ## Create empty #################################################### DataFrame f = pd.concat(map(lambda x: dict1[x],dict1.keys()),axis=1) ################### integrate viennad Rep1 = f.count(axis=1).rename('Replicate') files def genomecov_eachgene(): Peak1 = f.sum(axis=1).rename('Peak') global list1, list_basepattern f = pd.concat([f,Rep1,Peak1],axis=1) t_line2= list1[3].split('\t') f = f.sort_index() target_name = '_'.join(map(lambda x: list1[0].split('_')[x], range(-6,-2))) f = f.dropna(axis=0,how='all') dG = list1[4].split('\t')[-1] f = f[f['Replicate']>= int(replicates1)] ## Enter many replicates you need? 8

return f, list_basepattern 3],i.split('\t')[0].split('_')[-2] ## The number of position in listbasepattern #################################################### if ################### human hOH7.fasta file set(range(int(beginmotif1),int(endmotif1)+1)) & def transcript_database(): set(range(int(line2[1]),int(line2[2])+1)) != set(): with open ## Determine if there are intersect between Peak and ('/apps/hyb/20141126/data/db/hOH7_TSSmiRNA.fasta') base pattern as f1: list1_p1.append(i.split('\t')[0] + '\t' dict1 = {} + i.split('\t')[1]) name1,seq1='','' list1_m1.append(i.split('\t')[0] + '\n' for line1 in f1: + i.split('\t')[3]) if line1[0] == '>': f1_symbol = if name1 != '': open(str(Current_time)+'_'+str(output1_name)+'/Symbo dict1[name1] = seq1 l.txt','w+') name1,seq1='','' list_symbol = [] name1= line1.strip()[1:] for i in random.sample(list1_p1,len(list1_p1)): else: dG1 = float(i.split('\t')[0].split(':')[-1]) seq1 += line1.strip().upper() motif1 = i.split('\t')[1] dict1[name1] = seq1 for x in return dict1 motif1.replace('.','0').replace('(','1'): if x == '0': #################################################### list_symbol.append(x) ################### Call_Peak else: def call_peak(): F = -dG1-11 f_out = if F>=5: open(str(Current_time)+'_'+str(output1_name)+'/Call_ list_symbol.append('5') peak.txt','w+') elif F <= 0: f_out.write('gene list_symbol.append('0') name'+'\t'+'start'+'\t'+'end'+'\t'+'Replicate'+'\t'+ else: 'Peak'+'\t'+ 'sequence'+'\n') list_symbol.append(str(F)) gene = '' ## Temporary gene when analyzing replicate list_Replicate, list_Peak = [], [] start1,end1 f1_symbol.write(i.split('\t')[0]+'\t'+'\t'.join(li = 0,0 ## The beginning and end of st_symbol)+'\n') Replicate list_symbol = [] seq1='' ## Extract a partial sequence of a specific mRNA f1_motif = for x,y in f1[0].index: open(str(Current_time)+'_'+str(output1_name)+'/motif if (end1+1) == int(y) and gene == x: .txt','w+') end1 = int(y) for m in list1_m1: else: f1_motif.write(m+'\n') if start1 != 0: f1_motif.close() f_out.write(gene+'\t'+str(start1)+'\t'+str(end1)+' f1_symbol.close() \t'+str(int(max(list_Replicate)))+'\t'+str(int(max(l ist_Peak)))+'\t'+ str(dict_trans[gene][start1- #################################################### 1:end1])+'\n') ################### list_Replicate, list_Peak = [], if __name__ == '__main__': [] gene = x Viennad, replicates1, miR_name1, output1_name = start1, end1 = int(y), int(y) main(sys.argv[1:]) ### remember write the number of replicate list_Replicate.append(f1[0].loc[x,y]['Replicate'] os.mkdir(str(Current_time)+'_'+str(output1_name)) ) dict_trans = transcript_database() ## good f1 = list_Peak.append(f1[0].loc[x,y]['Peak']) integrate_genomecov(Viennad) f_out.write(gene+'\t'+str(start1)+'\t'+str(end1)+' call_peak() \t'+str(int(max(list_Replicate)))+'\t'+str(int(max(l output_symbol_motif() ist_Peak)))+'\t'+ str(dict_trans[gene][start1- 1:end1])+'\n') f_out.close() REFERENCES #################################################### [1] Xie, M., & Steitz, J. A. (2014). Versatile microRNA biogenesis in animals ################### basepattern and motif result and their viruses. RNA Biology, 10(4161), 673-681. doi:10.4161/rna.28985 def output_symbol_motif(): ## output basepattern

and motif result list1_p1, list1_m1 = [],[] ## create base [2] Xie, M., Li, M., Vilborg, A., Lee, N., Shu, M. D., Yartseva, V., Šestan, N., pattern and motif list … Steitz, J. A. (2013). Mammalian 5'-capped microRNA precursors that with generate a single microRNA. Cell, 155(7), 1568-80. doi: open(str(Current_time)+'_'+str(output1_name)+'/Call_ 10.1016/j.cell.2013.11.027 peak.txt','r+') as f_peak: for line1 in f_peak: [3] Sheng, P., Fields, C., Aadland, K., Wei, T., Kolaczkowski, O., Gu, T., Kolaczkowski, B., … Xie, M. (2018). Dicer cleaves 5'-extended microRNA line2 = line1.strip().split('\t') for i in f1[1]: ## loop of base pattern and precursors originating from RNA polymerase II transcription start sites. Nucleic motif information acids research, 46(11), 5737-5752. doi: 10.1093/nar/gky306 if line2[0] in i: ## Determine if the gene name is in the base pattern beginmotif1,endmotif1 = i.split('\t')[0].split('_')[- 9

[4] Zamudio, J. R., Kelly, T. J., & Sharp, P. A. (2014). Argonaute-bound small [21] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. RNAs from promoter-proximal RNA polymerase II. Cell, 156(5), 920-34. doi: (1990). Basic local alignment search tool. Elsevier,215(3), 403-410. 10.1016/j.cell.2014.01.041 doi:10.1016/S0022-2836(05)80360-2

[5] Gay, L. A., Sethuraman, S., Thomas, M., Turner, P. C., & Renne, R. (2018). [22] Li, W., Wang, Q., Feng, Q., Wang, F., Yan, Q., Gao, S. J., & Lu, C. (2019). Modified CrossLinking, Ligation, and Sequencing of Hybrids (qCLASH) Identifies Oncogenic KSHV-encoded interferon regulatory factor upregulates HMGB2 and Kaposi's SarcomaAssociated Herpesvirus MicroRNA Targets in Endothelial Cells. CMPK1 expression to promote cell invasion by disrupting a complex Journal of virology, 92(8), doi:10.1128/JVI.02138-17 lncRNA-OIP5-AS1/miR-218-5p network. PLoS pathogens, 15(1). doi:10.1371/journal.ppat.1007578 [6] Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics (Oxford, England), 30(15), [23] Zhou, D., Zhang, L., Lin, Q., Ren, W., & Xuab, G. (2017). Data on the 2114-20. doi: 10.1093/bioinformatics/btu170 association of CMPK1 with clinicopathological features and biological effect in human epithelial ovarian cancer. Elsevier, 13, 77-84. [7] Zhang, J., Kobert, K., Flouri, T., & Stamatakis, A. (2013). PEAR: a fast and doi:10.1016/j.dib.2017.05.022 accurate Illumina Paired-End reAd mergeR. Bioinformatics (Oxford, England), 30(5), 614-20. doi: 10.1093/bioinformatics/btt593 [24] WU Xuan, WU Tong, LI Ke, LI Yuan, HU Ting Ting, WANG Wei Feng, QIANG Su Jing, XUE Shao Bo, LIU Wei Wei. The Mechanism and Influence [8] Travis, A. J., Moody, J., Helwak, A., Tollervey, D., & Kudla, G. (2014). of AKAP12 in Different Cancers (2018). Biomedical and Environmental Hyb: a bioinformatics pipeline for the analysis of CLASH (crosslinking, Sciences, 31(12), 927-932. doi:10.3967/bes2018.127 ligation and sequencing of hybrids) data. Methods (San Diego, Calif.), 65(3), 263-73. doi: 10.1016/j.ymeth.2013.10.015 [25] Muramatsu, M., Akakura, S., Gao, L., Peresie, J., Balderman, B., & Gelman, I. H. (2018). SSeCKS/Akap12 suppresses metastatic melanoma lung [9] M. M. (n.d.). Cutadapt Removes Adapter Sequences From High- colonization by attenuating Src-mediated pre-metastatic niche crosstalk. Throughput Sequencing Reads. EMBLnet.journal, 10-12. Oncotarget, 9(71), 33515–33527. doi:10.18632/oncotarget.26067 doi:10.14806/ej.17.1.200 [26] Soh, R. Z., Lim, J. P., Samy, R. P., Chua, P. J., & Bay, B. H. (2018). A- [10] Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, kinase anchor protein 12 (AKAP12) inhibits cell migration in breast cancer. E. S., Getz, G., & Mesirov, J. P. (2011). Integrative genomics viewer. Nature Elsevier, 105(3), 364-370. doi:10.1016/j.yexmp.2018.10.010 biotechnology, 29(1), 24–26. doi:10.1038/nbt.1754 [27] Quan J, Pan X, Li Y, Hu Y, Tao L, Li Z, Zhao L, Wang J, Li H, Lai Y, [11] FASTX-Toolkit. (n.d.). Retrieved from http://hannonlab.cshl.edu/fastx Zhou L, Lin C, Gui Y, Ye J, Zhang F, Lai Y (2019). MiR-23a-3p acts as an toolkit/commandline.html oncogene and potential prognostic biomarker by targeting PNRC2 in RCC, 112(108755). Biomed Pharmacother. doi:10.1016/j.biopha.2018.11.065. [12] Chou, C. H., Shrestha, S., Yang, C. D., Chang, N. W., Lin, Y. L., Liao, K. W., … Huang, H. D. (2017). miRTarBase update 2018: a resource for [28] Dave, B., Granados-Principal, S., Zhu, R., Benz, S., Rabizadeh, S., Soon- experimentally validated microRNA-target interactions. Nucleic acids Shiong, P., … Chang, J. C. (2014). Targeting RPL39 and MLF2 reduces tumor research, 46(D1), D296–D302. doi:10.1093/nar/gkx1067 initiation and metastasis in breast cancer by inhibiting nitric oxide synthase signaling. Proceedings of the National Academy of Sciences of the United States [13] Agarwal, V., Bell, G. W., Nam, J. W., & Bartel, D. P. (2015). Predicting of America, 111(24), 8838–8843. doi:10.1073/pnas.1320769111 effective microRNA target sites in mammalian mRNAs. eLife, 4. doi:10.7554/eLife.05005 [29] Gupta, G., Bebawy, M., Pinto, T., Chellappan, D., A, M., & K, D. (2018). Role of the Tristetraprolin (Zinc Finger Protein 36 Homolog) Gene in Cancer. Begell, [14] Cheloufi, S., Santos, C. O., Chong, M. M., & Hannon, G. J. (2010). A 28(3), 217-221. doi:10.1615/CritRevEukaryotGeneExpr.2018021188. Dicer-independent miRNA biogenesis pathway that requires Ago catalysis. Nature.,465(7298), 584-589. doi:10.1038/nature09092 [30] Jen, J., Lin, L. L., Chen, H. T., Liao, S. Y., Lo, F. Y., Tang, Y. A., … Wang, Y. C. (2015). Oncoprotein ZNF322A transcriptionally deregulates [15] Tomari, Y., & Zamore, P. D. (2005). MicroRNA Biogenesis: Drosha Can't alpha-adducin, cyclin D1 and p53 to promote tumor growth and metastasis in Cut It without a Partner. Cell Press, 15(2), R61-R64. lung cancer. Oncogene, 35(18), 2357–2369. doi:10.1038/onc.2015.296 doi:10.1016/j.cub.2004.12.057 [31] Zhang, B., Zhang, Z., Li, L., Qin, Y. R., Liu, H., Jiang, C., … Zhu, Y. H. [16] Hu, Y., Xie, H., Liu, Y., Liu, W., Liu, M., & Tang, H. (2017). MiR-484 (2018). TSPAN15 interacts with BTRC to promote oesophageal squamous cell suppresses proliferation and epithelial–mesenchymal transition by targeting carcinoma metastasis via activating NF-κB signaling. Nature communications, ZEB1 and SMAD2 in cervical cancer cells. Cancer Cell International,17(36). 9(1), 1423. doi:10.1038/s41467-018-03716-9 doi:10.1186/s12935-017-0407-9 [32] Qi, Y., Zhang, Y., Peng, Z., Wang, L., Wang, K., Feng, D., … Zheng, J. [17] Merhautova, J., Hezova, R., Poprach, A., Kovarikova, A., Radova, L., (2017). SERPINH1 overexpression in clear cell renal cell carcinoma: Svoboda, M., … Slaby, O. (2015). miR-155 and miR-484 Are Associated with association with poor clinical outcome and its potential as a novel prognostic Time to Progression in Metastatic Renal Cell Carcinoma Treated with marker. Journal of cellular and molecular medicine, 22(2), 1224–1235. Sunitinib. BioMed research international, 2015, 941980. doi:10.1111/jcmm.13495 doi:10.1155/2015/941980 [33] Singhal, M., Khatibeghdami, M., Principe, D. R., Mancinelli, G. E., [18] Cannell, I. G., Kong, Y. W., & Bushell, M. (2008). How do microRNAs Schachtschneider, K. M., Schook, L. B., … Grimaldo, S. R. (2019). TM4SF18 regulate gene expression? Biochemical Society Transactions,36(6), 1224-1231. is aberrantly expressed in pancreatic cancer and regulates cell growth. PloS one, doi:10.1042/BST0361224 14(3), e0211711. doi:10.1371/journal.pone.0211711

[19] Fang, Z., & Rajewsky, N. (2011). The Impact of miRNA Target Sites in [34] Lee, H., Lee, M., & Hong, S. K. (2019). CRTC2 as a novel prognostic Coding Sequences and in 3′UTRs. Plos One,6(3). biomarker for worse pathologic outcomes and biochemical recurrence after doi:10.1371/journal.pone.0018067 radical prostatectomy in patients with prostate cancer. Investigative and clinical urology, 60(2), 84–90. doi:10.4111/icu.2019.60.2.84 [20] Proctor, J. R., & Meyer, I. M. (2013). COFOLD: an RNA secondary structure prediction method that takes co-transcriptional folding into account. Nucleic acids research, 41(9), e102. doi:10.1093/nar/gkt174