Tools and technologies to characterize isoforms at -scale

Gloria Sheynkman Marc Vidal Laboratory Center for Cancer Systems Biology, Dana Faber Cancer Institute Department of Genetics, Harvard University SMRTLeiden May 2nd, 2017

CCSB

CENTER FOR CANCER SYSTEMS BIOLOGY numbers

Number 20K of

Complexity Ensembl Dec 2015 release Isoform numbers

100K Number of Numbersplice of formsisoforms

Number 20K of genes

Complexity Ensembl Dec 2015 release Combinations of splice sites produce diverse isoforms.

Tropomyosin Alpha-1 Chain chr15:63,334,785-63,364,028 MCF-7 Cells

A G

A AA AAA canonical alternative casette constitutive nucleotide retained alternative alternative A A lncRNA canonical A A A promoter polymorphism donor 3’ end polyadenylation site (5’ end)

AAA A A A

AAA A A A

AAA A A A

A AA A A

A AA A A

A AA A A

A AA A A

AAA A A A

AAA A A A

AAA A A A

AAA A A A

A AA A A The proteoform hypothesis

~100K isoforms

~1 million proteoforms

Smith & Kelleher Nat Methods 2013 Splicing regulation and disease

•splicing is pervasive, inherent to encoded products of the genome

•splicing is highly regulated in space and time •high tissue- and developmental- specificity •“splice code”

•splicing is dysregulated in many diseases, including cancer •estimates of 50% all disease variants affect splicing •splice-modulating therapies (e.g. antisense oligos)

Wang et al Nature Review Genetics (2007), 6 Isoform function?

100K Number of Numbersplice of formsisoforms

Number 20K of genes

Complexity Ensembl Dec 2015 release Isoforms and functional divergence

In vivo functions Identical “Isoforms”

Different “Alloforms”

Opposite “Antiforms” Examples of functionally divergent isoforms

Alloforms

Antiforms

Bcl-X anti-apoptotic Bcl-X pro-apoptotic

Christofk et al Nature 2008 Schwerk et al Mol Cell 2005 Divergent functional capabilities described in literature

Isoforms for a few hundred genes

Physical interactions

Cellular localization

Enzymatic activities

Stability

…..

Kelemen et al Gene 2013 Sociological biases in literature

Publication Submitted Accepted - - Isoforms

+/- ? Alloforms

+ - Antiforms

Issues with making general conclusions from literature: -confirmation bias (sampling not random) -experimental approaches for characterization heterogenous -isoform identity unknown See Rolland et al Cell 2014 How widespread is isoform functional divergence in the whole proteome?

Systematic identification of large numbers of isoform pairs Unbiased functional profiling for large numbers of human genes

Physical interactions

Enzymatic activities

Cellular localization

Stability ….. Landscape of protein isoform functional divergence Mostly alloforms

Divergence Mostly isoforms

Large numbers of pairs of isoforms encoded by common genes How widespread is isoform functional divergence in the whole proteome?

Systematic identification of large numbers of isoform pairs Unbiased functional profiling for large numbers of human genes

Physical interactions

Enzymatic activities

Cellular localization

Stability ….. RNA sequencing data has been the primary means to characterize isoforms

PacBio

Oxford Nano.

Illumina NGS SOLiD

454 Solexa

splice-specific microarrays

ESTs NGS (RNA-Seq) data can reveal the presence of exons and junctions, but fails to accurately reconstruct full-length isoforms.

transcript reconstruction ANALYSIS

to be short isoforms comprising 2–3 exons on average and thus evidenced by increased accuracy in assembling partial transcripts. represent a more tractable subset of the transcriptome. In contrast, SLIDE consults exon coordinates but ignores their Provision of transcript start and end sites gave iReckon an connectivity, performing at a level similar to methods without any advantage for the more complexANA LYSIS human transcriptome, as prior transcript-level information. Reported transcript structures OPEN transcript assembly a 73.70 Kb Forward strand 21 111.30 Mb 111.31 Mb 111.32 Mb 111.33 Mb 111.34 Mb 111.35 Mb 111.36 Mb STAR alignment 270.49 Transcriptome reconstruction—akin to reassembling magazine articles after theyKorf have beenNature through Methods a paper sh 2013redder.

0 Assessment of transcript reconstructionGENCODE genesmethods RPF2>

for RNA-seq

RNA-seq data, in a framework based on the Encyclopedia of DNA sequence outside of reference exonsSLIDE all (Fig. 1 and Supplementary SLIDE all SLIDE all Annotation Annotation Annotation Trembly all SLIDE high SLIDE high SLIDE high iReckon full iReckon full iReckon full GSTRUCT GSTRUCT GSTRUCT NextGeneid NextGeneid NextGeneid Augustus all Augustus all Augustus all Trembly high Exonerate all Exonerate all Exonerate all iReckon ends iReckon ends iReckon ends Elements (ENCODE) Genome Annotation Assessment Project Table 2). Recall (sensitivity) wasmGene graph generally lower for Drosophila mGene graph mGene graph Augustus high Augustus high Transomics all Transomics all Transomics all NextGeneidAS NextGeneidAS NextGeneidAS Exonerate high Exonerate high Exonerate high

8 Transomics high Transomics high Transomics high

(EGASP) . Developers of leading software programs were invited melanogaster, althoughAugustus no RNA most protocols exceeded 75% for both Augustus no RNA Augustus no RNA Velvet + Augustus Velvet + Augustus to participate in a consortium effort, the RNA-seq Genome model organisms. Performance decreased for Homo sapiens Annotation Assessment Project (RGASP), to benchmark methods data, for which trade-offs between precisionNextGeneidAS and recall were NextGeneidAS NextGeneidAS to predict and quantify expressed transcripts from RNA-seq data. more apparent. cSLIDE and iReckon must be provided with gene Results were evaluated from methods based on genome align- annotation and therefore outperformedAugustus mostCufflinks other methods.iReckon mGene mGene graph mTim SLIDE Transomics Trembly Tromer 9 3 10 2 ments (Augustus , Cufflinks , Exonerate , GSTRUCT, iReckon , Even so, iReckon attainedrc low0.63 precision (62%) at0.68 the (66%) nucleotide0.70 (79%)level, 0.59 (55%) 0.65 (61%) 0.53 (44%) 0.57 (61%) 0.34 (41%) 0.68 (65%) 0.61 (58%) 11 12 4 0.73 (87%) 0.74 (78%) 0.72 (86%) 0.74 (89%) 0.73 (82%) 0.68 (65%) 0.64 (89%) 0.74 (92%) 0.71 (74%) 0.74 (82%) mGene , mTim, NextGeneid , SLIDE , Transomics, Trembly primarily owing to thera prediction of transcript isoforms with and Tromer13) as well as de novo assembly (Oases5 and Velvet14). retained . Augustus, Exonerate, GSTRUCT, NextGeneid, Our results identify aspects of RNA-seq analysis in which current Trembly and VelvetFigure attained6 | Examples both of precisiontranscript callsand andrecall expression-level above estimates. (a) The upper tracks show RNA-seq read coverage (from STAR alignments; see Online Methods) and annotated genes. Exon predictions from the ten methods that quantified transcripts are illustrated below the annotated gene by colored boxes. Exons predicted to belong to the same transcript isoform are connected. Original and median-scaled RPKM values are presented to the 1 2 European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK. Departamentright de Genètica, and left, Facultat respectively, de Biologia, of Universitatthe transcript de Barcelona, models. For the gene RPF2, all methods reported different isoforms and expression levels. Where multiple Barcelona, Spain. 3Wellcome Trust Sanger Institute, Cambridge, UK. 4Full lists of members and affiliations appear at the end of the paper. 5Centre for Genomic Regulation, Barcelona, Spain. 6Universitat Pompeu Fabra, Barcelona, Spain. 7Genome Biology Unit, European Molecularoverlapping Biology Laboratory, isoforms Heidelberg, were identified, Germany. 8 Developmentalthat with the higher RPKM was selected for visualization, and spliced isoforms were prioritized over unspliced Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany. 9Wellcome Trust–Medical Researchones. CouncilThe noncoding Cambridge RNA Stem U6 Cell is Institute, not expressed. University ( b) Heat maps illustrate pairwise agreement between reported transcript isoforms for H. sapiens (left), of Cambridge, Cambridge, UK. 10Present address: Department of Biochemistry and Biophysics, Science for D.Life melanogaster Laboratory, Stockholm (center) University, and C. elegansStockholm, (right). Sweden. (c ) Correlation between reported RPKM values and NanoString counts (Pearson r of log-transformed 11These authors contributed equally to this work. Correspondence should be addressed to P.B. ([email protected]). values). NanoString counts were compared to the highest RPKM value reported for transcript isoforms consistent with the probe design (correlation rc) or RECEIVED 31 MARCH; ACCEPTED 23 SEPTEMBER; PUBLISHED ONLINE 3 NOVEMBER 2013; DOI:10.1038/NMETH.2714 for any isoform from the locus (correlation ra).

NATURE METHODS | VOL.10 NO.12 | DECEMBER 2013 | 1177 NATURE METHODS | VOL.10 NO.12 | DECEMBER 2013 | 1181 Iso-Seq enables direct sequencing of full-length isoforms and thus characterize transcriptome complexity

PacBio detection: fluorescence

Zero-Mode Phospholinked SMRT® Cells Waveguides Nucleotides

PacBio® RS II Trace

Eid et al. ScienceTranscripto (2009)me reconstruction—akin to reassembling magazine articles after they have been through a paper shredder.

Given a base mRNA sequence from Iso-Seq, one can predict open reading frames or “ORFs” These become candidate protein isoforms for downstream functional analysis

Gene:SSRP1, Strand: - chr11:57335895-57327743

Gencode-annotated isoforms

Isoform in ORFeome collection

brain PacBio de novo, full-length sequences heart liver

UTRs CDS’

18 The challenge of detecting low abundance isoforms

Wide dynamic range of human transcriptomes Limited knowledge of isoform complexity for low-abundance transcripts

IL = Illumina data Figure from Tim Mercer (Capture-Seq) PB = PacBio data We developed a new method to generate probes en masse for use in Probes derived from ORF sequence cDNA library hybridization-based target capture: Probes derived from ORF sequence cDNA library

Probes derived from ORF sequence pDONR223 cDNA library “ORF frag capture” + Iso-Seq Probes derived from ORF sequence pDONR223 cDNA library

ProbesProbes capture derived targeted from ORF transcripts sequence WashcDNA away libraryother transcripts pDONR223 18,000 human ORFs as clones* Probes capture targeted transcripts Wash away other transcripts pDONR223

Probes capture targeted transcripts Wash away other transcripts pDONR223 Probes capturecherry targeted transcripts Wash away other transcripts Probes capture targeted transcripts ProbesWashProbes away otherderived transcripts derived from from ORF ORF sequence sequence cDNAcDNA library library pick Enrichment of full-length isoform sequences Pipeline Enrichmenttarget of full-length isoform sequences Enrichmentgenes of full-length isoform sequences retrieve targets (baits) Enrichment of full-length isoform sequences cherry-pick entry clones from pDONR223pDONR223 Enrichment of full-length isoform sequences hORFeome

generate probes ProbesProbes capture capture targeted targeted transcripts transcriptsWashWash away away other other transcripts transcripts biotin-spiked PCR fragmentation

hyb. capture of targets from pond

qPCR and NGS QC quantify fold enrichment on-target rates

EnrichmentEnrichment of full-length of full-length isoform isoform sequences sequences Iso-Seq full-length sequencing and isoform discovery

*The ORFeome Collaboration Nature Methods (2016) w Jason Underwood, Tyson Clark Strand: + chr1:153974363 - 153858772 CREB3L4 ENSG00000143578.15 ENST00000271889.8 ENST00000368600.7 ENST00000368601.5 ENST00000368603.5 ENST00000368607.7 “ORF frag capture” + Iso-SeqENST00000431292.1 enables of high sensitivity discovery of full- ENST00000449724.5 ENST00000461688.5length isoforms ENST00000468845.1 ENST00000473340.1 ENST00000477617.5 CREB3L4 isoforms detected* ENST00000479010.1 ENST00000492729.1 isoseq c12418/1/2201 isoseq c12541/1/1956 Strand: - chr3:71583989 - 70965890 FOXP1 ENSG00000114861.18isoseq c13437/1/1827ENST00000318779.7 ENST00000318789.8 isoseq c13769/1/1929 ENST00000327590.8 Control (no probes)isoseq c15166/1/4093ENST00000460805.5 ENST00000468577.5 isoseq c15458/1/1971 ENST00000470112.1 isoseq c6168/1/2065ENST00000471386.2 ENST00000472382.5 isoseq c6522/1/2154 ENST00000475937.5 isoseq c9043/2/1809ENST00000484350.5 ENST00000485326.6 isoseq c9193/1/2021 ENST00000491238.5 isoseq c9653/1/2394ENST00000493010.2 ENST00000493089.5 ENST00000497355.5 ENST00000497553.2 ENST00000498154.5 ENST00000498215.5 ENST00000610810.4 ENST00000614176.4 ENST00000614183.1 FOXP1 isoforms detected* ENST00000615603.4 ENST00000622151.4 isoseq c13275/1/2518 isoseq c13422/1/2290 Enrichment ofisoseq c13668/1/2418 isoseq c14708/1/2431 isoseq c14947/1/2196 1000 humanisoseq c15808/1/2119 isoseq c3592/6/2213 genes isoseq c3799/3/2609 isoseq c4981/2/2598 isoseq c6079/1/2446 isoseq c6136/1/2065 isoseq c6490/1/2258 isoseq c7136/1/2257 isoseq c7423/1/2026 isoseq c8018/1/2529 isoseq c9836/1/2472

*No isoforms detected in ultra-deep-coverage PacBio21 dataset of the same brain cDNA library. “ORF frag-based” probe capture, sequencing, and discovery of full-length isoforms

target: ZNF302 How widespread is isoform functional divergence in the whole proteome?

Systematic identification of large numbers of isoform pairs Unbiased functional profiling for large numbers of human genes

Physical interactions

Enzymatic activities

Cellular localization

Stability ….. ORF-Seq: Systematic full-length isoform sequencing and cloning BRIEF COMMUNICATIONS

‘isoform space’ in more complex organisms may partly explain the Isoform discovery by paradoxical lack of correlation between organismal complexity and gene number, and underscores the need to efficiently and compre- targeted cloning, ‘deep- hensively capture the full ORFeome. Historically, determination of intron-exon boundaries in eukaryotes has been addressed mainly well’ pooling and parallel by large-scale sequencing of random cDNAs (expressed sequence tags; ESTs) followed by alignment to a reference genomic DNA sequencing sequence. Although EST collections are extremely helpful, the human isoform space remains underexplored. A targeted cloning Kourosh Salehi-Ashtiani1,2,5, Xinping Yang1,2,5, and full-length sequencing strategy could provide the desired Adnan Derti1,3,5, Weidong Tian1,3,5, Tong Hao1,2,5, information but is impractically resource-intensive. Chenwei Lin1,2, Kathryn Makowski4, Lei Shen4, Next-generation parallel sequencing technologies, such as the Roche 454 FLX, offer the prospect of sequencing at a much faster Ryan R Murray1,2, David Szeto1,2, Nadeem Tusneem4, 4 1,2 1,2 pace and lower cost than conventional Sanger sequencing–based Douglas R Smith ,MichaelECusick ,DavidEHill , capillary platforms3.Mostapplicationsdescribedsofarhaveentailed 1,3 1,2 Frederick P Roth &MarcVidal resequencing of megabase-scale genomic DNA fragments4–7 or of small sequence tags8–11.Adisadvantageofthelatterapproachisthat Describing the ‘ORFeome’ of an organism, including all major cis connectivity is lost between the reads; therefore, although the isoforms, is essential for a system-level understanding of reads can be assembled into contigs, mRNAs cannot be assembled any species; however, conventional cloning and sequencing unambiguously when splice variants are involved. Sequencing of approaches are prohibitively costly and labor-intensive. We kilobase-scale DNA fragments from complex pools in which frag- describe a potentially genome-wide methodology for efficiently ments have heterogeneous abundance has not yet been tested, nor capturing new coding isoforms using reverse transcriptase (RT)- has correct assembly of hundreds to thousands of full-length cDNAs PCR recombinational cloning,‘deep-well’ pooling and a next- in parallel from a complex mixture been proven feasible. generation sequencing platform. This ORFeome discovery Previous and ongoing full-length cDNA isolation projects aim to pipeline will be applicable to any eukaryotic species with discover one isoform per gene, without attempting to investigate asequencedgenome. the depth of ‘isoform space’.ORF-Seq: Here we describe and demonstrate the Experimental definition of the complete set of protein-codingSystematic full-length isoform sequencing and cloning transcript sequences (‘ORFeome’) is fundamental for complete understanding of any organism, but this has not been achieved to date for any metazoan. Adding to the uncertainty, many eukaryotic 5′ primer genes exhibit , leading to a diversity of 3′ primer RT-PCR open reading frames (ORFs) encoded by a single gene. Currently, Gateway B74% of human genes and B13% of Caenorhabditis elegans genes cloning 1,2 are predicted to undergo alternative splicing . Expansion of the ORFeome PCR products Minipool ORFeome Single colony primer pairs (one well) minipool arrays isolates RT-PCR and Gateway cloning Figure 1 | The isoform discovery pipeline. First, ORFs are captured in RT-PCR experiments, cloned and transformed into Escherichia coli. Minipools of transformants for each gene may contain different isoforms. Second, deep- well pools are constructed by pooling the PCR-amplified ORF sequence from one transformant for each of many genes. This method of pooling ensures normalization of ORFs and avoids concurrent sequencing of multiple isoforms. Third, parallel sequencing is carried out separately on each deep well. The 454 obtained reads are assembled using an SBA algorithm. Resulting ORF contigs Annotation of new Alignment and Deep-well ORF isoforms assembly sequencing are filtered for the presence of noncanonical splice acceptor/receptor sites ‘Deep wells’ of and prior presence in sequence databases to identify unique ‘novel’ isoforms. pooled ORFs

1Center for Cancer Systems Biology, Department of Cancer Biology, Dana Farber Cancer Institute, 44 Binney Street, Boston,Original Massachusetts ORF-Seq: 02115, Salehi-Ashtiani USA. 2Department et al Nat of Meth 2008 Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA. 3Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, 250 Longwood Avenue, Boston, Massachusetts 02115, USA. 4Agencourt Bioscience Corporation, 500 Cummings Center, Beverly, Massachusetts 01915, USA. 5These authors contributed equally to this work. Correspondence should be addressed to M.V. ([email protected]) or K.S.-A. ([email protected]) or F.P.R. ([email protected]). RECEIVED 19 FEBRUARY; ACCEPTED 21 MAY; PUBLISHED ONLINE 15 JUNE 2008; DOI:10.1038/NMETH.1224

NATURE METHODS | VOL.5 NO.7 | JULY 2008 | 597 BRIEF COMMUNICATIONS

BRIEF COMMUNICATIONS

‘isoform space’ in more complex organisms may partly explain the Isoform discovery by paradoxical lack of correlation between organismal complexity and gene number, and underscores the need to efficiently and compre- targeted cloning, ‘deep- hensively capture the full ORFeome. Historically, determination of intron-exon boundaries in eukaryotes has been addressed mainly well’ pooling and parallel by large-scale sequencing of random cDNAs (expressed sequence tags; ESTs) followed by alignment to a reference genomic DNA sequencing sequence. Although EST collections are extremely helpful,‘isoform the space’ in more complex organisms may partly explain the Isoformhuman isoform discovery space remains underexplored. by A targetedparadoxical cloning lack of correlation between organismal complexity and 1,2,5 1,2,5 Kourosh Salehi-Ashtiani , Xinping Yang , and full-length sequencing strategy could provide thegene desired number, and underscores the need to efficiently and compre- Adnan Derti1,3,5, Weidong Tian1,3,5, Tong Hao1,2,5targeted, information cloning, but is impractically ‘deep- resource-intensive. hensively capture the full ORFeome. Historically, determination of Chenwei Lin1,2, Kathryn Makowski4, Lei Shen4, Next-generation parallel sequencing technologies, suchintron-exon as the boundaries in eukaryotes has been addressed mainly Roche 454 FLX, offer the prospect of sequencing at a much faster Ryan R Murray1,2, David Szeto1,2, Nadeem Tusneemwell’4, pooling and parallel by large-scale sequencing of random cDNAs (expressed sequence 4 1,2 1,2 pace and lower cost than conventional Sanger sequencing–basedtags; ESTs) followed by alignment to a reference genomic DNA Douglas R Smith ,MichaelECusick ,DavidEHill , 3 sequencingcapillary platforms .Mostapplicationsdescribedsofarhaveentailedsequence. Although EST collections are extremely helpful, the Frederick P Roth1,3 &MarcVidal1,2 4–7 resequencing of megabase-scale genomic DNA fragmentshumanor of isoform space remains underexplored. A targeted cloning 8–11 small sequence tags1,2,5 .Adisadvantageofthelatterapproachisthat1,2,5 and full-length sequencing strategy could provide the desired Describing the ‘ORFeome’ of an organism, including all majorKourosh Salehi-Ashtiani , Xinping Yang , cis1,3,5connectivity is lost between1,3,5 the reads;1,2,5 therefore, althoughinformation the but is impractically resource-intensive. isoforms, is essential for a system-level understanding ofAdnan Derti , Weidong Tian , Tong Hao , reads can be assembled into contigs, mRNAs cannot be assembledNext-generation parallel sequencing technologies, such as the any species; however, conventional cloning and sequencingChenwei Lin1,2, Kathryn Makowski4, Lei Shen4, unambiguously when splice variants are involved. SequencingRoche 454 of FLX, offer the prospect of sequencing at a much faster approaches are prohibitively costly and labor-intensive. We 1,2 1,2 4 Ryan R Murraykilobase-scale, David DNA Szeto fragments, Nadeem from complex Tusneem pools, in whichpace frag- and lower cost than conventional Sanger sequencing–based describe a potentially genome-wide methodology for efficiently 4 1,2 1,2 Douglas R Smithments have,MichaelECusick heterogeneous abundance,DavidEHill has not yet been, tested,capillary nor platforms3.Mostapplicationsdescribedsofarhaveentailed capturing new coding isoforms using reverse transcriptase (RT)- 1,3 1,2 Frederick Phas Roth correct&MarcVidal assembly of hundreds to thousands of full-lengthresequencing cDNAs of megabase-scale genomic DNA fragments4–7 or of PCR recombinational cloning,‘deep-well’ pooling and a next- in parallel from a complex mixture been proven feasible. small sequence tags8–11.Adisadvantageofthelatterapproachisthat generation sequencing platform. This ORFeome discovery Describing the ‘ORFeome’Previous and of anongoing organism, full-length including cDNA all isolation major projectscis aimconnectivity to is lost between the reads; therefore, although the pipeline will be applicable to any eukaryotic species with isoforms, is essentialdiscover onefor a isoform system-level per gene, understanding without attempting of to investigatereads can be assembled into contigs, mRNAs cannot be assembled asequencedgenome. any species; however,the depth conventional of ‘isoform space’. cloning Here and we sequencing describe and demonstrateunambiguously the when splice variants are involved. Sequencing of approaches are prohibitively costly andORF-Seq: labor-intensive. We kilobase-scale DNA fragments from complex pools in which frag- Experimental definition of the complete set of protein-coding describeSystematic a potentially full-length genome-wide isoform methodology sequencing for efficiently andments cloning have heterogeneous abundance has not yet been tested, nor transcript sequences (‘ORFeome’) is fundamental for complete capturing new coding isoforms using reverse transcriptase (RT)- has correct assembly of hundreds to thousands of full-length cDNAs understanding of any organism, but this has not been achieved to PCR recombinational cloning,‘deep-well’ pooling and a next- in parallel from a complex mixture been proven feasible. date for any metazoan. Adding to the uncertainty, many eukaryotic 5′ primer generation sequencing platform. This ORFeome discovery Previous and ongoing full-length cDNA isolation projects aim to genes exhibit alternative splicing, leading to a diversity of 3′ primer pipeline will be applicable to any eukaryotic species with discover one isoform per gene, without attempting to investigate open reading frames (ORFs) encoded by a single gene. Currently, RT-PCR asequencedgenome. Gateway the depth of ‘isoform space’. Here we describe and demonstrate the B74% of human genes and B13% of Caenorhabditis elegans genes cloning 1,2 are predicted to undergo alternative splicing . Expansion of the ORFeome PCR products Minipool ORFeome Single colony Experimentalprimer definition pairs (one of well) the complete setminipool of protein-coding arrays isolates hORFeome plates transcript sequences (‘ORFeome’)RT-PCR and Gateway is fundamental cloning for. complete. . . Figure 1 | The isoform discovery pipeline. First, ORFs are captured inunderstanding RT-PCR of any organism, but this has not been achieved to experiments, cloned and transformed into Escherichia coli. Minipoolsdate of for any metazoan. Adding to the uncertainty, many eukaryotic 5′ primer well-indexed and plate-indexed “Filo-Seq” transformants for each gene may contain different isoforms. Second, deep- ORF PCR genes exhibit alternative splicing,indexwell leading to a diversity of 3′ primer A plate well pools are constructed by pooling the PCR-amplified ORF sequence from pool wells index A open reading frames (ORFs) encoded by a single gene. Currently, RT-PCR one transformant for each of many genes. This method of pooling ensures HEAD ORF TAIL Gateway B74% of human genes and B13% of Caenorhabditis elegans genes cloning normalization of ORFs and avoids concurrent sequencing of multiple isoforms. indexwell

B 1,2 454 plate Third, parallel sequencing is carried out separately on each deep well.are The predicted to undergo alternative splicing . Expansion of theindex A ORFeome PCR products Minipool ORFeome Single colony primer pairs (one well) minipool arrays isolates obtained reads are assembled using an SBA algorithm. Resulting ORF contigs Annotation of new AlignmentHEAD and Deep-wellORF TAIL etc... RT-PCR and Gateway cloning are filtered for the presence of noncanonical splice acceptor/receptor sites ORF isoforms assembly sequencing Figure 1 | The isoform discovery pipeline. First, ORFs are captured in RT-PCR‘Deep wells’ of and prior presence in sequence databases to identify unique ‘novel’ isoforms. well-idx-A HEAD ORF TAIL pooledplate-idx-A ORFs experiments, cloned and transformed intowell-idx-BEscherichiaHEAD coli. MinipoolsORF of TAIL plate-idx-A transformants for eachFL deep genesequencing may containwell-idx-C differentHEAD isoforms. Second,ORF deep-TAIL plate-idx-A 1 (PacBio, Oxford Nano) 2 Center for Cancer Systems Biology, Department of Cancer Biology, Danawell Farber pools Cancer are constructed Institute, 44 by Binney pooling Street,well-idx-A the PCR-amplified Boston,HEAD Massachusetts ORF sequenceORF 02115, from USA.TAIL Departmentplate-idx-B of 3 Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA. Departmentwell-idx-B of BiologicalHEAD ChemistryORF and MolecularTAIL Pharmacology,plate-idx-B one transformant4 for each of many genes. This method of pooling ensures well-idx-C HEAD ORF TAIL plate-idx-B Harvard Medical School, 250 Longwood Avenue, Boston, Massachusetts 02115, USA. AgencourtFL seq. analysis Bioscience Corporation, 500 Cummings Center, Beverly, Massachusetts 5 normalization(full-length of ORFs ORF and to plate avoids / well assignment) concurrent sequencing of multiple isoforms. 01915, USA. These authors contributed equally to this work. Correspondence should be addressed to M.V. ([email protected]) or K.S.-A. 454 ([email protected]) or F.P.R. ([email protected]).Third, parallel sequencing is carried out separately on each deep well. The obtained reads are assembled using an SBA algorithm. Resulting ORF contigs Annotation of new Alignment and Deep-well RECEIVED 19 FEBRUARY; ACCEPTED 21 MAY; PUBLISHED ONLINE 15 JUNE 2008; DOI:10.1038/NMETH.1224 ORF isoforms assembly sequencing are filtered for the presence of noncanonical splice acceptor/receptor sites ‘Deep wells’ of and prior presence in sequence databases to identify unique ‘novel’ isoforms. pooled ORFs NATURE METHODS | VOL.5 NO.7 | JULY 2008 | 597 1Center for Cancer Systems Biology, Department of Cancer Biology, Dana Farber Cancer Institute, 44 Binney Street, Boston, Massachusetts 02115, USA. 2Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA. 3Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, 250 Longwood Avenue, Boston, Massachusetts 02115, USA. 4Agencourt Bioscience Corporation, 500 Cummings Center, Beverly, Massachusetts 01915, USA. 5These authors contributed equally to this work. Correspondence should be addressed to M.V. ([email protected]) or K.S.-A. ([email protected]) or F.P.R. ([email protected]). RECEIVED 19 FEBRUARY; ACCEPTED 21 MAY; PUBLISHED ONLINE 15 JUNE 2008; DOI:10.1038/NMETH.1224

NATURE METHODS | VOL.5 NO.7 | JULY 2008 | 597 BRIEF COMMUNICATIONS

‘isoform space’ in more complex organisms may partly explain the Isoform discovery by paradoxical lack of correlation between organismal complexity and gene number, and underscores the need to efficiently and compre- targeted cloning, ‘deep- hensively capture the full ORFeome. Historically, determination of intron-exon boundaries in eukaryotes has been addressed mainly well’ pooling and parallel by large-scale sequencing of random cDNAs (expressed sequence tags; ESTs) followed by alignment to a reference genomic DNA sequencing sequence. Although EST collections are extremely helpful, the human isoform space remains underexplored. A targeted cloning Kourosh Salehi-Ashtiani1,2,5, Xinping Yang1,2,5, and full-length sequencing strategy could provide the desired Adnan Derti1,3,5, Weidong Tian1,3,5, Tong Hao1,2,5, information but is impractically resource-intensive. Chenwei Lin1,2, Kathryn Makowski4, Lei Shen4, Next-generation parallel sequencing technologies, such as the Roche 454 FLX, offer the prospect of sequencing at a much faster Ryan R Murray1,2, David Szeto1,2, Nadeem Tusneem4, 4 1,2 1,2 pace and lower cost than conventional Sanger sequencing–based Douglas R Smith ,MichaelECusick ,DavidEHill , capillary platforms3.Mostapplicationsdescribedsofarhaveentailed 1,3 1,2 Frederick P Roth &MarcVidal resequencing of megabase-scale genomic DNA fragments4–7 or of small sequence tags8–11.Adisadvantageofthelatterapproachisthat Describing the ‘ORFeome’ of an organism, including all major cis connectivity is lost between the reads; therefore, although the isoforms, is essential for a system-level understanding of reads can be assembled into contigs, mRNAs cannot be assembled any species; however, conventional cloning and sequencing unambiguously when splice variants are involved. Sequencing of approaches are prohibitively costly and labor-intensive. We kilobase-scale DNA fragments from complex pools in whichORF-Seq: frag- describe a potentially genome-wide methodology for efficiently ments have heterogeneous abundance has not yet been tested, nor capturing new coding isoforms using reverse transcriptase (RT)- has correct assembly of hundreds to thousands of full-length cDNAs PCR recombinational cloning,‘deep-well’ pooling and a next- Systematicin parallel from a complex mixture full-length been proven feasible. isoform sequencing and cloning generation sequencing platform. This ORFeome discovery Previous and ongoing full-length cDNA isolation projects aim to pipeline will be applicable to any eukaryotic species with discover one isoform per gene, without attempting to investigate asequencedgenome. the depth of ‘isoform space’. Here we describe and demonstrate the

Experimental definition of the complete set of protein-coding transcript sequences (‘ORFeome’) is fundamental for complete understanding of any organism, but this has not been achieved to date for any metazoan. Adding to the uncertainty, many eukaryotic 5′ primer ~1,500 genes, 5 human tissues genes exhibit alternative splicing, leading to a diversity of 3′ primer RT-PCR open reading frames (ORFs) encoded by a single gene. Currently, Gateway B74% of human genes and B13% of Caenorhabditis elegans genes cloning 1,2 are predicted to undergo alternative splicing . Expansion of the ORFeome PCR products Minipool ORFeome Single colony 1,423 full-length isoforms primer pairs (one well) minipool arrays isolates RT-PCR and Gateway cloning Figure 1 | The isoform discovery pipeline. First, ORFs are captured in RT-PCR experiments, cloned and transformed into Escherichia coli. Minipools of transformants for each gene may contain different isoforms. Second, deep- 506 genes with > 2 isoforms well pools are constructed by pooling the PCR-amplified ORF sequence from one transformant for each of many genes. This method of pooling ensures normalization of ORFs and avoids concurrent sequencing of multiple isoforms. Third, parallel sequencing is carried out separately on each deep well. The 454 1677 pairs of alternative isoforms, obtained reads are assembled using an SBA algorithm. Resulting ORF contigs Annotation of new Alignment and Deep-well ORF isoforms assembly sequencing are filtered for the presence of noncanonical splice acceptor/receptor sites ‘Deep wells’ of each encoded by a common gene and prior presence in sequence databases to identify unique ‘novel’ isoforms. pooled ORFs

1Center for Cancer Systems Biology, Department of Cancer Biology, Dana Farber Cancer Institute, 44 Binney Street, Boston, Massachusetts 02115, USA. 2Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts 02115, USA. 3Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, 250 Longwood Avenue, Boston, Massachusetts 02115, USA. 4Agencourt Bioscience Corporation, 500 Cummings Center, Beverly, Massachusetts 01915, USA. 5These authors contributed equally to this work. Correspondence should be addressed to M.V. ([email protected]) or K.S.-A. ([email protected]) or F.P.R. ([email protected]). RECEIVED 19 FEBRUARY; ACCEPTED 21 MAY; PUBLISHED ONLINE 15 JUNE 2008; DOI:10.1038/NMETH.1224

NATURE METHODS | VOL.5 NO.7 | JULY 2008 | 597 Examples of isoforms

BTC

CD99L2

BANP Novel full-length isoforms 506 genes with > 2 isoforms

isoforms

isoforms Novel full-length isoforms

Gene: ARHGEF15 Strand: + chr17:8322516 - 8310241

Gencode v25

Human ORFeome clone

Known ORF-Seq isoforms Novel

Intron lengths are not to scale. Estimated isoform clone expression in the original, endogenous tissues

15

10

Isoform 5 category action of isoforms (%)

r Reference F 0 Alternative 0.01 1 100 10,000 Average TPM across five tissues (log ) 10 ~500 multi-isoform genes

Brain Heart Liver Placenta Testis

Reference is the major isoform Alternative is the major isoform How widespread is isoform functional divergence in the whole proteome?

Systematic identification of large numbers of isoform pairs Unbiased functional profiling for large numbers of human genes

Physical interactions

Enzymatic activities

Cellular localization

Stability ….. Landscape of protein isoform functional divergence Mostly alloforms

Divergence Mostly isoforms

Large numbers of pairs of isoforms encoded by common genes Comparative protein-protein interaction profiling for large numbers of isoform pairs

Gene

Interaction Interactionpartners 1 partners2 3 4 5 Identical 1 2 3 4 5 Identical mRNA isoforms

Protein isoforms

Distinct Distinct Comparative protein-protein interaction profiling for large numbers of isoform pairs

Primary screen using all isoforms

PPI matrix profiling

Verification and sequence confirmation

Validation using an orthogonal assay

High-qualityprofiles PPI profiles Comparative protein-protein interaction profiling for large numbers of isoform pairs

Primary screen using all isoforms

PPI matrix profiling

Verification and sequence confirmation All first-pass pairs identified for any isoform were tested against all isoforms from a common gene.

Validation using an orthogonal assay

High-qualityprofiles PPI profiles Comparative protein-protein interaction profiling for large numbers of isoform pairs

Primary screen using all isoforms

Mammalian protein complementation assay PPI matrix profiling

Verification and sequence confirmation

Validation using an orthogonal assay

High-qualityprofiles PPI profiles Validation

Validation using an orthogonal assay (PCA)

DB-X AD-Y X Y X Y

X Y AD YFP

DB Pol II Ex Em REPORTER GENE Positive reference set 40 Random reference set Two-hybrid PCA Y2H positive interactions In mammalian cells Y2H negative interactions In yeast cells 30

Threshold of 1% 20 detection of Random Reference Set

10 Fraction of pairs recovered (%)

0

1 2 3 4 5 6 PCA score threshold (arbitrary units) Comparative protein-protein interaction profiling for large numbers of isoform pairs

Primary screen using all isoforms

PPI matrix profiling

Verification and sequence confirmation

Validation using an orthogonal assay

High-qualityprofiles PPI profiles Isoform-specific interaction profiles

BTC

CD99L2

BANP Isoform-specific interaction network

CD99L2 isoforms

Convert into nodes and edges CD99L2

Gene Reference isoform Alternative isoform Interaction partners

Gene to isoform relationship

PPI involving the reference isoform

PPI involving only alternative isoforms

SGTA SLC30 UBQLN1 CREB3L1 Isoform-specific interaction network

Gene Reference isoform Alternative isoform Interaction partners Gene to isoform relationship PPI involving the reference isoform PPI involving only alternative isoforms Yang et al Cell 2016 Interaction profile dissimilarity

Identical Intermediate Distinct

Jaccard distance: 0 0.5 1.0

Distinct

1.0

Intermediate 0.75

0.5 Pair of protein isoforms encoded by the same gene Identical 0.25 Interaction partner (Jaccard distance) 0

Interaction profile dissimilarity 0 10 20 30 40 50 60 70 80 90 100 110 Isoform pairs Sequence features in isoforms underlie interaction perturbations

Protein x x isoforms of x NDN U2AF1 (RRM LMBD) RILGLRPW Linear motif

Isoforms with a high density of linear motifs tend to promote interactions (p-value = 5.7x10-4)

Protein isoforms x of BCL2L1 BAD (BCL2-BAD domain)

BH4 BCL2

Truncated domains tend to correspond to a loss of interaction (p-value = 6.4x10-5) How functionally divergent are isoform-specific interactors?

isoforms How functionally divergent are isoform-specific interactors?

isoforms

Rolland et al Cell 2014 How functionally divergent are isoform-specific interactors?

isoforms

Rolland et al Cell 2014 How functionally divergent are isoform-specific interactors? How functionally divergent are isoform-specific interactors?

isoforms

Illumina BodyMap 2.0 Gene expression for 16 human tissues

Reference How functionally divergent are isoform-specific interactors? How functionally divergent are isoform-specific interactors?

isoforms

Disease subnetworks derived from GeneCards database

Safran et al Database 2010 How functionally divergent are isoform-specific interactors? How functionally divergent are isoform-specific interactors?

Isoforms of the same gene Gene without known disease association Disease associated gene

SGTA SLC30A2 CD99L2 COL1A2

UBQLN1 CREB3L1 PLP1

(1) Ehlers-Danlos syndrome type IV Pelizaeus- (2) Osteogenesis Merzbacher imperfecta type II disease How widespread is isoform functional divergence in the whole proteome?

Systematic identification of large numbers of isoforms pairs Unbiased functional profiling for large numbers of human genes

Physical interactions

Enzymatic activities

Cellular localization

Stability ….. Integration of ORF-Seq with third generation sequencing technologies

Gene:ESRRG, Strand: - chr1:216505473- 216723335 Pipeline in progress for 800 human TFs Isoform Name Appris Status

GC ESRRG-001 -

GC ESRRG-002 principal 1 Gene:RFX3, Strand: - chr9:3225042- 3395588 GC ESRRG-003 principal 1 Isoform Name Appris Status GC ESRRG-004 principal 1

GC ESRRG-005 principal 1 GC RFX3-001 -

GC ESRRG-006 - GC RFX3-002 -

GC ESRRG-007 principal 1 GC RFX3-003 - GC ESRRG-008 - GC RFX3-004 - GC ESRRG-010 principal 1 GC RFX3-005 - GC ESRRG-011 -

GC ESRRG-012 principal 1 GC RFX3-006 principal 1

GC ESRRG-013 principal 1 GC RFX3-007 -

GC ESRRG-016 - GC RFX3-008 - GC ESRRG-017 - GC RFX3-009 - GC ESRRG-019 - GC RFX3-010 - GC ESRRG-023 -

GC ESRRG-201 principal 1 GC RFX3-201 -

GC ESRRG-202 principal 1 GC RFX3-202 principal 1

GC ESRRG-203 principal 1 ORFID.14549 ORFID_10820 ORFID.6060 X ESRRG.MP1 X RFX3.MP2 X ESRRG.MP2

X ESRRG.MP3 X RFX3.MP3

X ESRRG.MP5 X RFX3.MP5

X ESRRG.MP7 X RFX3.MP6 X ESRRG.MP8 X RFX3.MP7 X ESRRG.MP9 X RFX3.MP8 X ESRRG.MP10

X ESRRG.MP11 X RFX3.MP9

X ESRRG.MP12 X RFX3.MP10

Isoform clones that are not present in Gencode v25, and thus represent novel forms. 55 AB

AB

ABMulti-parameter comparative profiling of isoforms

Similar perturbations likely for isoforms Variations in TF sequence perturb D protein and/or DNA interactions TP63 isoforms D

C

D C 47%

C

Activation DNA-binding Oligomerization Transactivation Inhibition 35%

Figure 7. Integration of Protein-Protein and Protein-DNA Interaction Perturbations (A) PDI edgotype distribution for disease in 22 TFs that bind to more than one enhancer. (B) Histogram showing percentage of mutations within and outside DBDs as a function of the percentage of PDI loss. Numbers on x axis indicate bin range. p values by one-sided Wilcoxon rank sum test. Figure 7. Integration of Protein-Protein(C) Percentage of and TF Protein-DNA pairs that Interaction cause different Perturbations diseases out of all pairs with different or the same PDI edgotype classes (n = 17). Error bars, SE of the (A) PDI edgotype distributionproportion. for disease p mutations values by in one-sided 22 TFs that Fisher’s bind to exact more test. than one enhancer.Sahni et al Cell 2016 (B) Histogram showing percentage(D) PPI-PDI of mutations integration within enables and outside mutation DBDs characterization as a function at of higher the percentage resolution. Percentage of PDI loss. of Numbers mutations on is x axisshown indicate for: PPI bin and range. PDI unperturbed;p PPI unperturbed values by one-sided Wilcoxonand rank PDI sum perturbed; test. PPI perturbed and PDI unperturbed; and PPI and PDI perturbed in the integrated network. (C) Percentage of TF mutationSee pairs also thatFigure cause S7. different diseases out of all pairs with different or the same PDI edgotype classes (n = 17). Error bars, SE of the Figure 7. Integrationproportion. of Protein-Protein p values by andone-sided Protein-DNA Fisher’s exact Interaction test. Perturbations Will conduct screens for differential (A) PDI edgotype distribution(D) PPI-PDI for disease integration mutations enables in mutation 22 TFs that characterization bind to more at than higher one resolution. enhancer. Percentage of mutations is shown for: PPI and PDI unperturbed; PPI unperturbed DISCUSSION spread disease-specific perturbations of macromolecular inter- (B) Histogram showingand percentage PDI perturbed; of mutations PPI perturbed within and and outside PDI unperturbed; DBDs as a andfunction PPI and of the PDI percentage perturbed of in PDI the integratedloss. Numbers network. on x axis indicate bin range. p protein-protein and protein-DNA values by one-sided WilcoxonSee also rankFigure sum S7. test. actions. Approximately 60% of disease-associated missense (C) Percentage of TF mutation pairs that cause differentIn this diseases systematic out of all characterization pairs with different of or mutations the same PDI across edgotype various classes (nmutations = 17). Error bars, perturb SE of PPIs, the among which half result in complete interactions for TF isoforms proportion. p values by one-sided Fisher’s exact test.human Mendelian disorders, we have found surprisingly wide- loss of interactions, generally caused by protein misfolding and (D) PPI-PDI integrationDISCUSSION enables mutation characterization at higher resolution. Percentage of mutations isspread shown for: disease-specific PPI and PDI unperturbed; perturbations PPI unperturbed of macromolecular inter- and PDI perturbed; PPI perturbed and PDI unperturbed; and PPI and PDI perturbed in the integrated network.actions. Approximately 60% of disease-associated missense See also Figure S7. In this systematic characterization of mutations across various mutations perturb PPIs, among which half result in complete (C) Most perturbed partners of TPM3 are expressed in the disease-relevant tissue. human Mendelian disorders,(D) Edgetic we mutations have found in EFHC1 surprisinglyperturb epilepsy-related wide- loss protein of interactions, partners. generally caused by protein misfolding and DISCUSSION (E) Correlation between the fractionspread of PPI disease-specific perturbation and age perturbations of onset for mutation of macromolecular pairs causing the sameinter- disease. p values by comparing the observed value to 100,000 random controls (nactions. = 13; Extended Approximately Experimental 60%Procedures of disease-associated). missense In this systematic(C) characterization Most perturbed partners of mutationsSee of TPM3 also acrossFigure are expressed S6 various. in the disease-relevantmutations perturb tissue. PPIs, among which half result in complete (D) Edgetic mutations in EFHC1 perturb epilepsy-related protein partners. human Mendelian(E) disorders, Correlation we between have the found fraction surprisingly of PPI perturbation wide- and ageloss of of onset interactions, for mutation generally pairs causing caused the same by disease. protein p valuesmisfolding by comparing and the observed value to 100,000 random controls (n = 13; Extended Experimental Procedures). Cell 161, 647–660, April 23, 2015 ª2015 Elsevier Inc. 657 See also Figure S6. (C) Most perturbed partners of TPM3 are expressed in the disease-relevant tissue. (D) Edgetic mutations in EFHC1 perturb epilepsy-related protein partners. (E) Correlation between the fraction of PPI perturbation and age of onset for mutation pairs causing the same disease. p values by comparing the observed value Cell 161, 647–660, April 23, 2015 ª2015 Elsevier Inc. 657 to 100,000 random controls (n = 13; Extended Experimental Procedures). See also Figure S6.

Cell 161, 647–660, April 23, 2015 ª2015 Elsevier Inc. 657 Acknowledgements

Jasmin Xinping Yang Shuli Kang Tong Hao Coulombe-Huntington

DFCI, Boston Michael Calderwood Tong Hao Carl Pollis Marc Vidal Soon Gang Choi David Hill Aaron Richardson Dawit Balcha Meaghan Daley Katja Luck Sadie Schlabach Wenting Bian David DeRidder Dylan Markey Kerstin Spirohn Donnelly Center, Toronto Tiziana Cafarelli Alice Desbuleux Julien Olivet Yang Wang Fritz Roth

UCSD, San Diego Lilia Iakoucheva Shuli Kang

McGill, Montreal Yu Xia Jasmin Coulombe-Huntington