Personalized (POG): A bioinformacs pipeline for analyzing whole genome and sequence data from individual paent tumours

Yussanne Ma Canada’s Michael Smith Genome Sciences Centre Province-wide, populaon- Develop and deploy based control program technologies for life sciences (Populaon – 4.5 million, 24,000 research, in parcular cancer new cases of cancer per year) research Cancer care from prevenon/ Employs a total of 302 staff and screening to diagnosis and has an annual sequencing treatment capacity 600Tb Research into causes of and cures Involved in many collaborave for cancer sequencing projects within Canada and internaonally (including TCGA, MAGIC, CIRM/ HALT)

2 Personalized Oncogenomics (POG) Scope: Ulizaon of genomic informaon to augment chemotherapy decision-making for people with incurable malignancies (for which standard chemotherapy regimens fail or do not exist)

Approach Whole genome and transcriptome sequencing Data analysis to idenfy putave cancer drivers and aberrant pathways and therapeuc opons

3 POG to date: the first 100 adult and 6 pediatric cases

Adult cohort 17 male, 80 female, average of 3 lines of chemotherapy received before POG Breast 38 GI (includes pancreas) 14 Lung 11 Gynecological 10 Head and Neck 4 Sarcoma 4 Primary unknown 4 Peritoneal Mesothelioma 3 Adrenal 2 Hematological 2 Skin 2 Other 6

Pediatric cohort 4 male, 2 female, median age 10 at consent A typical workflow of POG

Sample collecon Sequencing Biopsy Consult & Consent

Normal

Archival AmpliSeq Panel sequencing

Whole Genome Sequencing Therapeuc RNA Sequencing recommendaons Data analysis

5 The GSC Bioinformacs team

Sequencing Sequence Development Bioinformacs tools

Level 0 QC Omics Contaminants and RNA analysis reagents Bioapps Data analysis and tracking

LIMS Structural Variant Laboratory informaon management system and Integrave Analysis

Results The POG Bioinformacs pipeline Alignment

Raw sequence reads (fastqs) are produced using real me base- calling (Illumina soware) on compute cluster and storage Align to reference directly connected to sequencers

Fastqs are aligned to reference using BWA 0.5.7 to produce binary alignment, or bam files

LIMS storage and alignment capacity scales with sequencer output

Alignment is run on a dedicated cluster with its own scheduler so that it never competes with other jobs, and turnaround me is consistently around 1-2 days per lane Microbial detecon pipeline

Raw sequence reads are rapidly screened to detect possible microbial content. This is informave for detecng contaminants and also viruses associated with cancer ( eg. EBV, HPV) Each paired end read is classified into human or one of 47 infecous agent species using a hash-based classifier that looks for exact match of 25-mers. The algorithm is an implementaon of Bloom filters (Bloom, 1970) The process is very fast but bound by the amount of memory which increases linearly with the total size of the reference filters

forward read reverse read

Filters made from 5000 downloaded reference sequences Alignment based idenficaon of molecular aberraons

Aer alignment to the , the tumour and normal alignments are compared to each other to idenfy molecular aberraons in the tumour to search for possible treatment targets

Genomic aberraons include single nucleode variants, copy number change, loss of heterozygosity, and fusions

Genes with dysregulated expression in the tumour are oen targeed in therapy, so we also use RNA sequencing to look for which are over and underexpressed

SNP and indel calling A C C Single nucleode polymorphism C C

SNP and indel calling is performed on the tumour/normal pair to idenfy putave somac variants ie. variants present in the tumour genome but not in the normal sample. (run as parallel per jobs)

Coding variants which change the protein produced by the tumour cell are priorized.

The level of expression of genomic somac variants is esmated by looking in the transcriptome bam files (submied as several thousand single CPU jobs)

L130F Copy Number Variaon

Read depth in whole genomes can be used to esmate if the genome is diploid (normal) or if Copy number loss there are amplificaons or (0.5x coverage deleons of genomic regions compared to rest of genome/tumour)

The comparison of tumour and normal samples allows for relave coverage to be used, which greatly reduces the noise of the signal

Our CNV detecon method uses Hidden Markov Models to infer the copy number state based on normalized coverage over a region Somac copy number variants

Gain Chr 17 Loss LOH BIRC5 CDK12 ERBB2

Genome wide view of copy number losses and gains, and loss of heterozygosity in tumour (le)

Focal gains of mulple copies in region containing ERBB2 (above)

This observaon, along with gain of funcon mutaon and overexpression led to therapeuc recommendaon of targeted inhibitor for this paent

De novo sequence assembly using ABySS

Large scale structural rearrangement is more accurately Assembly of reads detected by de novo assembly

ABySS was developed at the GSC and uses a distributed Into cong representaon of a de Bruijn graph to allow for parallel Align back to reference computaon

The assembly pipeline is used to call fusions, inserons and deleons, inversions and internal tandem duplicaon events in the transcriptome and genome

It is also used to detect integraon sites where viruses have integrated into the tumour genome EML4-ALK fusion idenfied using assembly

Transcriptome and genome sequencing revealed chr2 inversion fusing EML4 and ALK genes EML4 exons 1-13 ALK exons 20-29

coiled-coil domain tyrosine - dimerization, domain - kinase activation

Paent tested negave for fusion in previous clinical screen

Sequence analysis at chr2 breakpoints idenfied a further inversion and inseron into chr12 that appears to prevent Vysis dual-colour break-apart probe from hybridizing FISH probes

inversion

chr2 inversion & inseron into chr12

chr12

15 Response to ALK inhibion (Crizonib)

EML4-ALK fusion, with high overall expression of ALK together with ROS1 over-expression was reported to the treang oncologist TKI Crizonib was immediately administered The tumour responded dramacally

Sept 4 2013 – before Crizonib Crizonib started Sept 25 Scan from Dec 12, 2013 16 Expression analysis

Exon 1 Intron 1 Exon 2

Reads aligned to genome are ‘reposioned’ across exon-exon juncons

Exon and gene expression is calculated from normalized read depth

As normal RNA is not usually available, expression in tumour is compared to both publicly normal ssue compendiums and also to collecons of tumour samples from (TCGA) project to idenfy outliers Expression is correlated with tumour cohorts from TCGA using Spearman BRCA correlaon PRAD This has helped to clarify and confirm diagnosis

Breast cancer subtyping is also oen confirmed by correlang with TCGA breast cancer cohorts

In the case above a vulvar tumour was found to be an extra-mammary breast tumour in part due to the high correlaon with breast tumour expression profiles

Integrang variant data into pathways

[9/10] [89/99%] [70/32%] [97%] [12] [100%] FGF9/12 IGF1/2 PDGFA IHH [99%] [22] [100%] [10] [99%] [100%] [100%] [1%] [56%] [80%] [30%] [62%] [91%] [2%] [12%] GNA14 RET FGFR3 INSR IGF2R IGF1R PDGFRA PDGFRB KIT EGFR MET PTCH1 PTCH2

[89%] Hedgehog [51%] [2%] [80%] [23%] Pathway PLCB2 SRC NRAS PIK3R2 PTEN [8/10] VUS [2%] [5] [83%] VUS [15%] [98%] [100,95%] [53%] [100%] SMO BRAF MCM2 PRKCA PRKCD/G PIK3CA JAK1 DNA repair [73%] [9%] VUS [8%] MAP2K6 SUFU [1%] [65%] AKT/mTOR KLHL3 [93%] VUS [100%] [99%] [25%] CANT1 [100%] AKT1/2 AKT3 Pathway PARP8 MSH3 MAPK3 GLI4 PKC [49%] [97%] [18%] [13%] Pathway TSC2 PCNA PARP12 MAPK mTOR Gene Regulaon [7-99%] [86-100%] Pathway [2-13] [100%] [50%] [4%] POLD PARP [54%] MYC RPS6KA EIF4EBP1 BIRC5 [4%] 1,2,4 1/9/14/15 [6%] 1/6 CDKN2C FC vs. Adjacent Normal ATM Cell Proliferaon Percenle vs. NBL [26%] [71%] [26%] RB1 CCND1 CDKN2A Overexpressed Acvaon Chroman Remodeling Inhibion and Gene Expression [3%] [35%] Indirect [2-5] Cell Cycle CDK4 CDKN2B Underexpressed [44-100%] [98%] [83%] [7%] [3%] Tumour Sup. [16%] LoF inacvang mutaon HDAC HDAC3 HDAC9 CDK6 VUS unknown mutaon 1/5/10 E2F1 CDKN1B Drug Target [%] percenle [100%] [21%] [] foldchange Doxorubicin/ CDKN1A CDKN1C copy number gain (1 star per cp) Epirubicin/ Taxane Gemcitabine copy number loss (1 star per cp) resistance resistance AUY-922 [11-27] [98%] [22%] [87%] [57-100%] PGP RRM2 HSP90AB1 SSTR 1/2/5 BCCA Confidenal - For Research Purposes Only

Developing the pipeline for variant interpretaon

We have successfully built a pipeline which can rapidly produce a molecular profile of a tumour and categorize genomic and transcriptomic aberraons

Interpretaon of this complex genomic landscape into clinically relevant informaon remains challenging and is oen limited by availability and curaon of publicly available knowledge

As we move from research to clinic, a number of challenges sll need to be addressed

POG reports

Two reports are generated for POG

The first is a targeted alignment report which looks for a small list of specific aconable variants

The second report which is under development, aims to provide a more complete genomic analysis

Both reports are based on a hand curated knowledge base of literature that we have built over the past several years and will connue to expand

Variant databases

HVDB (SNVs and Indels) Clinical Size: 1,464 Gb informaon CNVDB Libraries: 2,420 Single sample Records: 4,81,967,204 SNVs Paired CNV calls 140,978,724 deleons 116,506,738 inserons SampleDB

Biological and technical ExpressionDB SVDB metadata Size: 252 Gb Fusions ~ 15,000 samples Libraries: 12,016 Large indels Records: 274,158,056 gene Inversions expression; 618,411,965 isoform expression Therapeucs and Outcomes

Moving from a relaonal database soluon to a Hadoop soluon allows us to scale the databases while sll allowing for fast queries

As we connue to sequence more paents and record outcomes, we will be able to use data from previous POG cases to inform future cases

IBM Watson collaboraon

We are currently collaborang with IBM to help develop their Watson Genomic Analycs (WGA) plaorm

WGA performs a molecular profile analysis, followed by a pathway analysis using informaon from public pathway databases, augmented with interacons discovered by natural language processing from 20 million PubMed abstracts

The output of WGA consists of a list of drug candidates with raonales and scores along with literature evidence that supports their consideraon

We will be using our analysis of the first 100 POG paents to test and train the WGA method and compare their reports with our findings

Summary of POG outcomes for first 100 adult paents

Outcome of POG Number of paents Percentage

Sufficient ssue for POG analysis 78 78% Insufficient ssue for POG analysis Biopsy content too low for sequencing 16 16% Unable to biopsy due to specific paent factors 6 6% Informave 65 65% total, 83% of sequenced

Aconable 55 55% total, 71% of sequenced 34% total, 44% of sequenced Paents received POG-informed treatment 34 62% when there was something aconable idenfied Clinician assessed clinical or radiographic 14 41% (14 of 34) improvement in cancer (including stable disease) Aconable target idenfied but the paent was too unwell or death before POG-informed therapy 13 24% (13 of 55) could be offered Amended or clarified the diagnosis or primary site 5 5% of total, 6% of sequenced Compute resources for POG

Efforts to scale the bioinformacs pipeline for POG has focussed on 3 main areas

1) Process automaon. We have standardized much of our analysis and most pipelines are run as cron jobs, tracked through a central database

2) Analysis efficiency. We regularly benchmark our tools in our pipeline to look for new tools that will run more efficiently while maintaining the accuracy in our current pipeline

3) Overcoming compute bolenecks. Some processes are CPU bound, some are memory bound and almost all are I/O bound

LIMS Hardware and storage

New cluster with 16 nodes, 512 cores to handle increased sequencer capacity

Node specificaons: • 128 GB RAM • 16 Processors (32 Cores) Hyper threading • 10 Gb bandwidth

894 TB sequencer scratch space 50 TB Isilon system for scratch analysis 2 PB Archived storage of raw sequence and alignment data

Cumulate Data Usage for Downstream Analysis

600

Dec 5,2013 New storage volume acquired 500

Oct 16,2013

July 31, 2013 400

Feb 20,2013

300 May 20, 2012 Terabytes

Apr 23, 2012 200

Dec 6, 2011

100 Nov 19, 2010

Aug25,2009 Mar 24,2010 Apr23, 2009 Aug 22, 2007 Jan 6,2010 0

2007 2009 2011 2013 Total analysis and compute resources at GSC

9 9000

Storage 8 8000 Compute

7 7000

6 6000

5 5000

Petabytes 4 4000 Number of cores 3 3000

2 2000

1 1000

0 0 2008 2010 2012 2014

Data footprint for personalized oncogenomics project For each paent, we sequence 3 Genomes (normal, archival, tumour) + 1 Transcriptome (tumour)

Process Genome/Transcriptome Backup Total storage Raw sequence data Genome + Transcriptome Backed up 3 x g + t = 150GB (zipped)

Inial alignment Genome + Transcriptome Backed up 3 x g + t = 450GB

Reposioned alignment Transcriptome only Backed up 25GB

Split alignment Transcriptome only Temporary 25GB

Assembly Genome + Transcriptome Parally backed up 3 x g + t= 200GB

Trans-ABySS Genome + Transcriptome Parally backed up 3 x g + t = 200GB

Microbial detecon Genome + Transcriptome Temporary 2 x g + t = 100GB (zipped)

1 paent = 1.15Tb scratch / 0.8Tb cumulave storage 3000 paents = 3.5 PB scratch and 2.4 PB cumulave storage Acknowledgements

BC Cancer Agency Genome Sciences Centre

Dr. Janessa Laskin Marco Marra Eric Chuah Steven Jones Dr. Francois Benard Dr. Youwen Zhou Dean Cheng

Dr. Vanessa Bernstein Dr. Diego Villa Nina Thiessen Erin Pleasance Dr. Kim Chi Dr. David Huntsman Carolyn Ch’ng Marn Jones Dr. Rebecca Cosse Dr. David Schaeffer An He Dr. Stephan Chia Yaoqing Shen Dr. Tony Ng Johnson Pang Dr. Randy Gascoyne Katayoon Kasaian Dr. Stephen Yip Tina Wong Dr. Karen Gelmon Sreeja Leelakumari Dr. Malcolm Hayes William Long Dr. Sharlene Gill Yvonne Li Dr. Kathy Ceballos Joseph Juhn Dr. Cheryl Ho Pinaki Bose Dr. Hagan Kenneke Dr. Anthony Karnezis Jenny Qian

Dr. Mohamed Khan Dr. Aly Karsan Kelsey Zhu Jacquie Schein Dr. Chrisan Kollmannsberger Dr. Gillian Mitchell Karen Mungall Peggy Tsang Dr. Meg Knowling Dr. Intan Schrader Dusn Bleile Rebecca Carlsen Dr. Howard Lim Dr. Dan Renouf Alex Hammel Dr. Caroline Lohrisch Dr. Kerry Savage Armelle Troussard Richard Mar Dr. Monty Marn Dr. Tamara Shenkier Yongjun Zhao Melika Bonakdar Dr. Torsten Nielsen Dr. Chrisne Simmons Pawan Pandoh Caralyn Reisle Dr. Sophie Sun Sam Aparicio Simon Haile Merhu Richard Corbe Dr. Anna Tinker Hector Li Chang Helen McDonald Greg Taylor Dr. Sheridan Wilson Balvir Deol Richard Moore Simon Chan

Peter Eirew Mike Mayo Young Song BC Children’s Hospital Julie Ho Andy Mungall Nisa Dar Farzad Jamshidi Angela Tam Dr. Alice Virani Kane Tse Robyn Roscoe Dr. Rod Rassekh Julie Loree Dr. Rebecca Deyell Amy Lum Alex andra Fok IBM Dr. Anna Lee Cydney Nielsen Payal Sipahimalani Dr. Christopher Dunham Tomo Osako Ajay Royyuru Dr. Caron Strahlendorf Sohrab Shah Lance Bailey Takahiko Koyama Dr. Paul Rogers Roland Santos Colleen Jantzen Brandon Sheffield Zeev Waks Colleen Fitzgerald Kulbir Multani Boaz Carmelli

And to all of our paents