Computational Methods for Predicting Functions at the Mrna Isoform Level
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Molecular Sciences Review Computational Methods for Predicting Functions at the mRNA Isoform Level Sambit K. Mishra y, Viraj Muthye y and Gaurav Kandoi * Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA; [email protected] (S.K.M.); [email protected] (V.M.) * Correspondence: [email protected] These authors contributed equally. y Received: 13 July 2020; Accepted: 6 August 2020; Published: 8 August 2020 Abstract: Multiple mRNA isoforms of the same gene are produced via alternative splicing, a biological mechanism that regulates protein diversity while maintaining genome size. Alternatively spliced mRNA isoforms of the same gene may sometimes have very similar sequence, but they can have significantly diverse effects on cellular function and regulation. The products of alternative splicing have important and diverse functional roles, such as response to environmental stress, regulation of gene expression, human heritable, and plant diseases. The mRNA isoforms of the same gene can have dramatically different functions. Despite the functional importance of mRNA isoforms, very little has been done to annotate their functions. The recent years have however seen the development of several computational methods aimed at predicting mRNA isoform level biological functions. These methods use a wide array of proteo-genomic data to develop machine learning-based mRNA isoform function prediction tools. In this review, we discuss the computational methods developed for predicting the biological function at the individual mRNA isoform level. Keywords: alternative splicing; RNA-seq; machine learning; deep learning; recommender systems; multiple instance learning; mRNA isoforms; gene ontology 1. Introduction Cells can produce multiple mRNA isoforms from a single gene because of a post-transcriptional regulatory mechanism, known as alternative splicing (AS). mRNA isoform sequences from a single gene may differ in a few base pairs up to several exons/introns. Such differences in the sequences can manifest either in the coding region or in the untranslated regions (50 or 30) of mRNA isoforms and can be characterized by several splicing mechanisms. In the absence of alternative splicing, all the introns are excised from the mRNA isoform. The commonly known AS mechanisms [1,2] include (Figure1): Int. J. Mol. Sci. 2020, 21, 5686; doi:10.3390/ijms21165686 www.mdpi.com/journal/ijms Int. J. Mol. Sci. 2020, 21, 5686 2 of 21 Int. J. Mol. Sci. 2020, 21, x FOR PEER REVIEW 2 of 21 Figure 1. CommonCommon Alternative splicing events. The The me mechanismschanisms of of the the most common alternative splicing events: Exon Skipping, Intron Retention, Alternative 5′ splice site selection, and Alternative splicing events: Exon Skipping, Intron Retention, Alternative 50 splice site selection, and Alternative 30 3Splice′ Splice Site Site selection, selection, are are presented. presented. There There are severalare several other other alternative alternative splicing splicing events events that are that not are shown not shownhere. The here. colored The colored bars (blue, bars orange, (blue, andorange, green) and represent green) represen exons, whilet exons, the blackwhile linesthe black connecting lines connectingthem represent them intronic represent segments. intronic Exonsegments. skipping—The Exon skipping—The most prevalent most mechanism prevalent mechanism in vertebrates in vertebratesand invertebrates and invertebrates where specific where exons specific in the pre-mRNA exons in the are skippedpre-mRNA in theare mature skipped mRNA in the transcript. mature mRNAIntron retention—Common transcript. Intron retention—Common in lower metazoans andin lo plants,wer metazoans it is a process and whereplants, an it intronis a process is retained where in anthe intron mature is mRNA,retained and in the Alternative mature mRNA, 30/50 acceptor and Alternative/donor sites—A 3′/5′ acceptor/donor process that involves sites—A exons process that that are involvesflanked by exons competing that are splice flanked sites by on competing one end (3 0splice/50) and sites a fixed on one splice end site (3′ on/5′) the and opposite a fixed end,splice resulting site on thein an opposite alternative end, region resulting that in is an either alternative included region or excluded that is either in the matureincluded mRNA. or excluded in the mature mRNA. AS is a natural phenomenon and its prevalence is high in eukaryotes, such as mammals and plantsAS [3 is]. Thea natural extent phenomenon to which the and eukaryotic its prevalence genome is undergoes high in eukaryotes, AS is highly such variable. as mammals Almost and 61% plantsintron-containing [3]. The extent genes to which in Arabidopsis the eukaryotic thaliana genome, 60% multi-exon undergoes genes AS is in highlyDrosophila variable. melanogaster Almost 61%and intron-containingas much as 90% multi-exongenes in Arabidopsis genes in thaliana human, 60% are alternativelymulti-exon genes spliced in Drosophila [4,5]. Interestingly, melanogaster in and the asfission much yeast as 90%Schizosaccharomyces multi-exon genes pombein human, only are 2–3% altern ofatively the genes spliced undergo [4,5]. Interestingly, AS [6,7]. Despite in the that fission the yeastalternative Schizosaccharomyces splicing in this pombe organism, only also2–3% represents of the genes the undergo important AS contribution [6,7]. Despite for that generating the alternative novel splicinggene structures. in this Alternativelyorganism also spliced represents mRNA the isoforms important from contribution a single gene for can generating have different novel functions gene structures.and are characterized Alternatively by spliced their encoded mRNA isoforms protein isoforms from a single [8].Not gene all can mRNA have different isoforms functions generated and by areAS arecharacterized functional. by Some their isoforms encoded di proteinffer in theirisoforms biological [8]. Not properties, all mRNA such isoforms as their generated catalytic by activities, AS are functional.interactions Some of the mRNAisoforms isoform differ encoded in their proteins biologic withal properties, other proteins such and as the their sub-cellular catalytic localization activities, interactionsof their encoded of the proteins mRNA [9 ].isoform AS is responsible encoded protei for severalns with functions other withinproteins the and organism the sub-cellular including localizationregulation of of gene their expression, encoded proteins response [9]. to AS stress, is responsible mRNA stability for several and proteinfunctions diversity. within the The organism skipping includingof exon 63 regulation of SMG1 geneof gene in peripheralexpression, leukocytesresponse to because stress, mRNA of examination stability stressand protein in male diversity. medical Thestudents skipping is a remarkable of exon 63 of example SMG1 ofgene AS in [10 peripheral]. Other interesting leukocytes examples because of of AS examination include the stress genes inCASP3 male, medicalMCL1, and studentsBCL2 whichis a remarkable produce mRNAexample isoforms of AS [10]. performing Other interesting completely examples opposite of functions AS include [11– the13]. genes CASP3, MCL1, and BCL2 which produce mRNA isoforms performing completely opposite functions [11–13]. While several important cellular mechanisms and functions are attributed to AS, the underlying knowledge of how exactly splicing regulates such events is still unclear [14]. Int. J. Mol. Sci. 2020, 21, 5686 3 of 21 While several important cellular mechanisms and functions are attributed to AS, the underlying knowledge of how exactly splicing regulates such events is still unclear [14]. Due to the recent developments in massively parallel sequencing methods, a rapid collection of mRNA isoform level sequence and expression data has been generated. This wealth of data at the mRNA isoform level provides evidence confirming the differential expression of mRNA isoforms under different conditions [15–17]. Such evidence has led to the refinement and improvement in genome annotations by identifying new functions of genes attributed to an alternatively spliced mRNA isoform product which were previously unknown [14]. Most experiments that aim to characterize gene functions are typically performed at a gene level, i.e., the functional characterization is targeted only towards a given gene of interest and not specifically towards its mRNA isoforms. Such experiments do not account for the fact that a gene is a collection of multiple mRNA isoforms. Databases such as Gene Ontology (GO) [18], Uniprot Gene Ontology Annotations (Uniprot-GOA) [19], and Kyoto Encyclopedia of Genes and Genomes (KEGG) [20] that curate functional data are focused primarily on the canonical mRNA isoforms and do not have specific information on the functions of alternatively spliced mRNA isoforms. This dearth of gene function characterization at the mRNA isoform level has led to only a few hundred functional annotations for mRNA isoform functions in functional databases like GO. While most experiments aimed at annotating gene functions have lagged in describing and differentiating the functions of the mRNA isoforms of a given gene, numerous computational methods using machine learning (ML) and recommendation systems have been developed to annotate gene functions at the level of mRNA isoforms. Previously, machine learning methods have been used to address a multitude of problems, some of which include, drug target discovery,