Identification and Annotation of Small RNA Genes Using Shortstack

Methods xxx (2013) xxx–xxx Contents lists available at ScienceDirect Methods journal homepage: www.elsevier.com/locate/ymeth Identification and annotation of small RNA genes using ShortStack ⇑ Saima Shahid, Michael J. Axtell Plant Biology Ph.D. Program, Huck Institutes of the Life Sciences, Penn State University, University Park, PA 16802, USA Department of Biology, Penn State University, University Park, PA 16802, USA article info abstract Article history: Highly parallel sequencing of cDNA derived from endogenous small RNAs (small RNA-seq) is a key Available online xxxx method that has accelerated understanding of regulatory small RNAs in eukaryotes. Eukaryotic regulatory small RNAs, which include microRNAs (miRNAs), short interfering RNAs (siRNAs), and Piwi-associ- Keywords: ated RNAs (piRNAs), typically derive from the processing of longer precursor RNAs. Alignment of small Small RNA RNA-seq data to a reference genome allows the inference of the longer precursor and thus the annotation High-throughput sequencing of small RNA producing genes. ShortStack is a program that was developed to comprehensively analyze Bioinformatics reference-aligned small RNA-seq data, and output detailed and useful annotations of the causal small microRNA RNA-producing genes. Here, we provide a step-by-step tutorial of ShortStack usage with the goal of intro- siRNA Genome annotation ducing new users to the software and pointing out some common pitfalls. Ó 2013 Elsevier Inc. All rights reserved. 1. Introduction also can now accept multiple small RNA-seq libraries in a single run. These improvements are layered on top of the core annotation Rapid progress in high throughout sequencing methodologies and quantification methods that were previously described [5]. has played a key role in unveiling the scope of regulatory small Based on the small RNA alignment patterns, ShortStack identifies RNAs in eukaryotes. Highly parallel sequencing of cDNAs derived and annotates both MIRNA and non-MIRNA loci, and provides de- from small RNAs (small RNA-seq) allows comprehensive assess- tailed descriptions of the small RNA populations emanating from ment of regulatory small RNA accumulation, especially when com- each locus. In this article, we provide a step-by-step tutorial of bined with alignments to a reference genome. MIRNA loci, which ShortStack usage for the current version 1.1.0 (Fig. 1). Factors produce the primary stem-loop transcripts that are processed to affecting analysis and detection sensitivity in different stages of yield mature microRNAs (miRNAs), represent the best character- the computational pipeline are addressed. ized type of small RNA gene; they are readily discerned from genome-aligned small RNA-seq data [1], and are extensively annotated in multiple species by the miRBase database [2]. However, miRNAs 2. Materials frequently comprise only a small percentage of the entire small RNA repertoire, especially in plants [3] and in animal germ-line 2.1. Basic system requirements associated tissues [4]. Unlike MIRNA loci, the loci producing other types of endogenous regulatory small RNAs have had little system- ShortStack is a command-line program for Linux or Mac OSX atic curation. Thus, there is an emerging need for computational operating systems. In this article, we assume that the reader has tools that can provide global annotations of both MIRNA and a basic working knowledge of the Bourne Again Shell (BASH) or non-MIRNA loci from small RNA-seq datasets. similar shell applications on their system. ShortStack is imple- With this need in view, ShortStack was developed for compre- mented in Perl; hence Perl should be installed in the system. By de- hensive analysis of small RNA-seq data [5]. Since the publication fault ShortStack searches for the Perl libraries in /usr/bin/perl, of the first ShortStack paper, based on version 0.4.0 of the software which is the usual location for pre-installed Perl. If the installation [5], the capabilities and performance of ShortStack have been sub- directory is different than the default one, then the first line of stantially improved. The current version (1.1.0) now is able to han- ShortStack.pl (e.g. the ‘hashbang’) should be modified to reflect dle adapter-trimming and alignment of small RNA-seq data prior that. To compile, ShortStack requires the Perl module Get- to annotation and quantification (Fig. 1). Additionally, ShortStack opt::Long. This module is already installed in nearly all Perl instal- lations, but if not, it can be obtained from the Comprehensive Perl Archive Network (CPAN) [6]. ShortStack was developed using Perl ⇑ Corresponding author. version 5.10.0; there are no known compatibility issues with other E-mail address: [email protected] (M.J. Axtell). versions of Perl but Perl 5.x is recommended. 1046-2023/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ymeth.2013.10.004 Please cite this article in press as: S. Shahid, M.J. Axtell, Methods (2013), http://dx.doi.org/10.1016/j.ymeth.2013.10.004 2 S. Shahid, M.J. Axtell / Methods xxx (2013) xxx–xxx (A) memory. Precise peak memory requirements are not easily pre- dictable, but are positively correlated with genome size, depth of Possible Inputs Possible Outputs small RNA-seq data, and the number of hairpins/MIRNA loci iden- Nuclear Organelle tified in a given run. Reference genomic 2.2. Software dependencies A sequences (FASTA) CTGTAGGC ShortStack depends on several freely available, commonly used 3’ adapter third-party software packages (Table 1). Prior installation of RNAL- sequence(s) B fold and RNAeval from the Vienna RNA package [7], and samtools 3’ adapter trimmed [8] is mandatory for running ShortStack.pl program. The alignment small RNA reads 3’ adapter trimmed D (FASTA or FASTQ) software bowtie (version 0.12.x or 1.x) and its helper program smalln RNA reads 3’ adapter trimmed D2 (FASTA or FASTQ) bowtie-build [9] must be installed if ShortStack is asked to perform Raw small RNA reads small RNA reads C (FASTA or FASTQ) D Rawn small RNA reads 1 (FASTA or FASTQ) alignment of small RNA-seq data. All of the required programs (FASTA or FASTQ) RawC2 small RNA reads (FASTA or FASTQ) (RNALfold, RNAeval, samtools, bowtie, and bowtie-build) must be C1 system-executable, which typically means installing them into a common directory for system executables, such as /usr/bin/ or / usr/local/bin/. Additionally, installation of the EMBOSS package Reference aligned 3’ adapter trimmed reads (BAM) small RNA reads E [10] is not needed to execute ShortStack, but the EMBOSS einvert- 3’ adapter trimmed D (FASTA or FASTQ) smalln RNA reads ed application can be used to enhance identification of large in- 3’ adapter trimmed D2 (FASTA or FASTQ) small RNA reads Chr1:56660-56760 Cluster_1 verted repeats that spawn small RNAs. To complete this tutorial, Chr1:78999-79001 Cluster_2 D Chr2:89999-76654 Cluster_3 1 (FASTA or FASTQ) Chr2:98765-98865 Cluster_4 Chr3:10011-12002 Cluster_5 Chr4:23332-23400 Cluster_6 each of the above-mentioned programs (RNALfold, RNAeval, sam- Chr4:66575-78883 Cluster_7 Chr5:44333-45666 Cluster_8 Annotated and quantified tools, bowtie, bowtie-build, and einverted) should be installed small RNA loci (TXT) H according to the instructions provided with the source packages, ShortStack made system executable, and tested by calling individual program Reference aligned 78,900bp 79,000bp in the terminal to ensure that the program is working properly. E reads (BAM) einverted Browser tracks (GFF3) invert_it.pl I and other details 2.2.1. Justifications for selections of dependencies RNALfold is optimized for fast RNA secondary structure predictions in local regions of very long queries, making it ideal for use in Genome-wide inverted hairpin identification in the context of entire genomes [7]. RNAeval repeats (.inv) F rapidly estimates free energies of given secondary structures [7], which is required during hairpin and MIRNA annotation by Short- Chr1:56660-56760 Cluster_1 Chr1:78999-79001 Cluster_2 Chr2:89999-76654 Cluster_3 Chr2:98765-98865 Cluster_4 Stack. ShortStack relies upon of alignments in the widely used Chr3:10011-12002 Cluster_5 Chr4:23332-23400 Cluster_6 Chr4:66575-78883 Cluster_7 Chr5:44333-45666 Cluster_8 BAM format [8] due to the indexing and fast-retrieval capabilities, Defined small RNA loci (TXT) G complete information storage, and highly extensible nature of such alignments when manipulated with the various samtools applica- (B) tions. Finally, ShortStack uses bowtie [9] as an aligner due to bow- Required inputs and resulting outputs: tie’s very fast performance and optimization for the short, trim, align, and analyze ungapped alignments required for small RNA-seq. The bowtie- build application [9] is required to build the genomic indices re- C ... C D D A + B +[ 1 n ] [ 1 ... n ]+ E + H + I quired for bowtie to function. align, and analyze D D 2.3. ShortStack installation A +[ 1 ... n ] E + H + I analyze existing alignment Installation of ShortStack simply involves extracting the contents of the source .tgz file (available from [11]). For convenience, A + E H + I a copy of the script ShortStack.pl can be moved to /usr/bin, /usr/lo- (C) cal/bin/, or elsewhere in the user’s PATH to make it system execut- Optional inputs & their effects able. Once installed, calling ShortStack.pl with no arguments will give a brief usage statement: F Increases sensitivity of hairpin detection $ ShortStack.pl Note overlapping annotations (as ‘flag_file’) or determine G ShortStack.pl version 1.1.0 small RNA abundance in pre-defined loci (‘count’ mode). USAGE: ShortStack.pl [options] genome.fasta Fig. 1. Inputs and outputs of ShortStack. (A) Schematic showing possible inputs and MODES: outputs of a ShortStack analysis. Letters in colored circles indicate different types of 1. Trim, align, and analyze: requires –untrimmedFA OR files or data. (B) Required inputs and their outputs for analysis.

Identification and Annotation of Small RNA Genes Using Shortstack

Bioinformatics Study of Lectins: New Classification and Prediction In

Trichoderma Reesei Complete Genome Sequence, Repeat-Induced Point

Introduction to Bioinformatics (Elective) – SBB1609

A Comprehensive Review and Performance Evaluation of Sequence Alignment Algorithms for DNA Sequences

GPS@: Bioinformatics Grid Portal for Protein Sequence Analysis on EGEE Grid

Software List for Biology, Bioinformatics and Biostatistics CCT

A New Graphical User Interface to EMBOSS

EBI, Expasy, EMBOSS, DTU)

Web & Grid Technologies in Bioinformatics, Computational And

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

A Converter Facilitating Genome Annotation Submission to European Nucleotide Archive Martin Norling1,2, Niclas Jareborg1,3 and Jacques Dainat1,2*

Bioinformatics Approaches for Functional Predictions in Diverse Informatics Environments. Paula Maria Moolhuijzen, Bsc This Thes