Tools and Technologies to Characterize Isoforms at Proteome-Scale
Total Page:16
File Type:pdf, Size:1020Kb
Tools and technologies to characterize isoforms at proteome-scale Gloria Sheynkman Marc Vidal Laboratory Center for Cancer Systems Biology, Dana Faber Cancer Institute Department of Genetics, Harvard University SMRTLeiden May 2nd, 2017 CCSB CENTER FOR CANCER SYSTEMS BIOLOGY Gene numbers Number 20K of genes Complexity Ensembl Dec 2015 release Isoform numbers 100K Number of Numbersplice of formsisoforms Number 20K of genes Complexity Ensembl Dec 2015 release Combinations of splice sites produce diverse protein isoforms. Tropomyosin Alpha-1 Chain chr15:63,334,785-63,364,028 MCF-7 Cells A G A AA AAA canonical alternative casette constitutive nucleotide retained alternative alternative A A lncRNA canonical A A A promoter promoter exon exons polymorphism intron donor 3’ end polyadenylation site (5’ end) AAA A A A AAA A A A AAA A A A A AA A A A AA A A A AA A A A AA A A AAA A A A AAA A A A AAA A A A AAA A A A A AA A A The proteoform hypothesis ~100K isoforms ~1 million proteoforms Smith & Kelleher Nat Methods 2013 Splicing regulation and disease •splicing is pervasive, inherent to encoded products of the genome •splicing is highly regulated in space and time •high tissue- and developmental- specificity •“splice code” •splicing is dysregulated in many diseases, including cancer •estimates of 50% all disease variants affect splicing •splice-modulating therapies (e.g. antisense oligos) Wang et al Nature Review Genetics (2007), 6 Isoform function? 100K Number of Numbersplice of formsisoforms Number 20K of genes Complexity Ensembl Dec 2015 release Isoforms and functional divergence In vivo functions Identical “Isoforms” Different “Alloforms” Opposite “Antiforms” Examples of functionally divergent isoforms Alloforms Antiforms Bcl-X anti-apoptotic Bcl-X pro-apoptotic Christofk et al Nature 2008 Schwerk et al Mol Cell 2005 Divergent functional capabilities described in literature Isoforms for a few hundred genes Physical interactions Cellular localization Enzymatic activities Stability ….. Kelemen et al Gene 2013 Sociological biases in literature Publication Submitted Accepted - - Isoforms +/- ? Alloforms + - Antiforms Issues with making general conclusions from literature: -confirmation bias (sampling not random) -experimental approaches for characterization heterogenous -isoform identity unknown See Rolland et al Cell 2014 How widespread is isoform functional divergence in the whole proteome? Systematic identification of large numbers of isoform pairs Unbiased functional profiling for large numbers of human genes Physical interactions Enzymatic activities Cellular localization Stability ….. Landscape of protein isoform functional divergence Mostly alloforms Divergence Mostly isoforms Large numbers of pairs of isoforms encoded by common genes How widespread is isoform functional divergence in the whole proteome? Systematic identification of large numbers of isoform pairs Unbiased functional profiling for large numbers of human genes Physical interactions Enzymatic activities Cellular localization Stability ….. RNA sequencing data has been the primary means to characterize isoforms NGS splice-specific ESTs Illumina PacBio microarrays 454 Solexa SOLiD Oxford Nano. NGS (RNA-Seq) data can reveal the presence of exons and junctions, but fails to accurately reconstruct full-length isoforms. transcript reconstruction ANALYSIS to be short isoforms comprising 2–3 exons on average and thus evidenced by increased accuracy in assembling partial transcripts. represent a more tractable subset of the transcriptome. In contrast, SLIDE consults exon coordinates but ignores their Provision of transcript start and end sites gave iReckon an connectivity, performing at a level similar to methods without any advantage for the more complexANA LYSIShuman transcriptome, as prior transcript-level information. Reported transcript structures OPEN transcript assembly a 73.70 Kb Forward strand Chromosome 21 111.30 Mb 111.31 Mb 111.32 Mb 111.33 Mb 111.34 Mb 111.35 Mb 111.36 Mb STAR alignment 270.49 Transcriptome reconstruction—akin to reassembling magazine articles after theyKorf have beenNature through Methods a paper sh 2013redder. 0 Assessment of transcript reconstructionGENCODE genesmethods RPF2> for RNA-seq <U6 Augustus all 10.55 8.02 SteijgerTamara Steijger et1, Josepal Nature F Abril2,11 ,Methods Pär G Engström 20131,10,11, Felix Kokocinski3,11Cufflinks, The RGASP Consortium4, 42.92 78.57 Tim J Hubbard3, Roderic Guigó5,6, Jennifer Harrow3 & Paul Bertone1,7–9 iReckon full 13.23 12.41 mGene 7.66 7.10 mGene graph 6.54 7.10 We- evaluated 25 protocol variants of 14 independent 14 reconstruction/assembly methodsapproaches are relativelymTim adept, along with more challenging4.42 areas 8.35 computational methods for exon identification, transcript for future improvement. RPKM (scaled) reconstructionevaluated and expression-level quantification from SLIDE all 30.70 RPKM (original) 10.85 RN-A-seqhigh data. O urmethod results show thatvariability most algorithms are able RESULTS Transomics all 7.05 7.86 to identify discrete transcript components with high success We evaluated a total Tremblyof 25 transcript all reconstruction protocols,6.34 7.48 rates- butpoor that assembly performance of complete isoform on structures simulated poses basing data our analysis onTromer alternate parameter usage of 14 software4.21 7.90 a major challenge even when all constituent elements are packages on RNA-seq data sets for three species (Supplementary111.30 Mb 111.31 Mb 111.32 Mb 111.33 Mb 111.34 Mb 111.35 Mb 111.36 Mb identified. Expression-level estimates also varied widely Fig. 1, Supplementary Table 1 and SupplementaryReverse Note strand). 73.70 Kb across methods, even when based on similar transcript models. Programs were run by the original developers, with the excep- Consequently, the complexity of higher eukaryotic genomes tion of Cufflinks, iReckon and SLIDE. So that weAugustus could noassess RNA Tromer Velvet + Augustus b Transomics high Augustus no RNA Tromer imposes severe limitations on transcript recall and splice the ability of each method to interpret transcript expressionTransomics from all Augustus no RNA Velvet Transomics high product discrimination that are likely to remain limiting Transomics all Transomics high RNA-seq data without prior knowledge of gene Velvetcontent, + Augustus pro- Transomics all mGene mTim Augustus all factors for the analysis of current-generation RNA-seq data. grams were run without genome annotation, aside frommGene iReckon graph Velvet SLIDE high Augustus high and SLIDE, which require such information. Performance was Oases mGene graph SLIDE all iReckon ends High-throughput sequencing instruments necessitate a shotgun benchmarked relative to the subset of annotated exons to whichTromer mGene mTim iReckon full Oases approach for all but the shortest target molecules. Full-length RNA-seq reads mapped (coverage of q1 read pair per 100 bp)Oases and Exonerate high Velvet NextGeneid Exonerate all NextGeneid representation of most cellular RNAs from sequencing data their corresponding transcripts (Online Methods).NextGeneidAS ab initio NextGeneid NextGeneidAS NextGeneidAS NextGeneidAS NextGeneidAS ab initio requires computational reconstruction of transcript structures. Augustus high mTim Augustus all NextGeneidAS ab initio The majority of such programs infer transcript models from the Identification of annotated features Augustus all iReckon ends Trembly high iReckon full 1–4 1.0 Trembly all mGene graph accumulation of read alignments to the genome ; some take We first assessed the degree to which gene componentsExonerate high Annotation mGene SLIDE high the alternative approach of de novo reconstruction, in which reported by each algorithm matched the referenceExonerate annota all- GSTRUCT 0.6 Cufflinks SLIDE all Annotation Cufflinks GSTRUCT contiguous transcript sequences are assembled without the use tion at the nucleotide level. From the Caenorhabditis elegans Annotation 5–7 GSTRUCT Cufflinks of a reference genome . data, the methods 0.2Augustus, mGene and TransomicsiReckon displayed ends SLIDE all Exonerate all Here we present a detailed evaluation of computational excellent performance0 in detecting exonic bases iReckonbut also full SLIDE high Exonerate high methods for transcript reconstruction and quantification from reported the expression of substantial proportions of genomic mTim mTim mTim Velvet Velvet Velvet Oases Oases Oases Tromer Tromer Tromer mGene mGene mGene ab initio ab initio ab initio Cufflinks Cufflinks Cufflinks RNA-seq data, in a framework based on the Encyclopedia of DNA sequence outside of reference exonsSLIDE all (Fig. 1 and Supplementary SLIDE all SLIDE all Annotation Annotation Annotation Trembly all SLIDE high SLIDE high SLIDE high iReckon full iReckon full iReckon full GSTRUCT GSTRUCT GSTRUCT NextGeneid NextGeneid NextGeneid Augustus all Augustus all Augustus all Trembly high Exonerate all Exonerate all Exonerate all iReckon ends iReckon ends iReckon ends Elements (ENCODE) Genome Annotation Assessment Project Table 2). Recall (sensitivity) wasmGene graph generally lower for Drosophila mGene graph mGene graph Augustus high Augustus high Transomics all Transomics all Transomics all NextGeneidAS NextGeneidAS NextGeneidAS Exonerate high Exonerate high Exonerate high 8 Transomics high Transomics high Transomics high (EGASP) . Developers of leading software programs were invited melanogaster, althoughAugustus no RNA most protocols exceeded 75% for both Augustus no RNA Augustus no RNA Velvet + Augustus Velvet + Augustus to