Personalized Oncogenomics (POG): A bioinforma cs pipeline for analyzing whole genome and transcriptome sequence data from individual pa ent tumours
Yussanne Ma Canada’s Michael Smith Genome Sciences Centre Province-wide, popula on- Develop and deploy genomics based cancer control program technologies for life sciences (Popula on – 4.5 million, 24,000 research, in par cular cancer new cases of cancer per year) research Cancer care from preven on/ Employs a total of 302 staff and screening to diagnosis and has an annual sequencing treatment capacity 600Tb Research into causes of and cures Involved in many collabora ve for cancer sequencing projects within Canada and interna onally (including TCGA, MAGIC, CIRM/ HALT)
2 Personalized Oncogenomics (POG) Scope: U liza on of genomic informa on to augment chemotherapy decision-making for people with incurable malignancies (for which standard chemotherapy regimens fail or do not exist)
Approach Whole genome and transcriptome sequencing Data analysis to iden fy puta ve cancer drivers and aberrant pathways and therapeu c op ons
3 POG to date: the first 100 adult and 6 pediatric cases
Adult cohort 17 male, 80 female, average of 3 lines of chemotherapy received before POG Breast 38 GI (includes pancreas) 14 Lung 11 Gynecological 10 Head and Neck 4 Sarcoma 4 Primary unknown 4 Peritoneal Mesothelioma 3 Adrenal 2 Hematological 2 Skin 2 Other 6
Pediatric cohort 4 male, 2 female, median age 10 at consent A typical workflow of POG
Sample collec on Sequencing Biopsy Consult & Consent
Normal
Archival AmpliSeq Panel sequencing
Whole Genome Sequencing Therapeu c RNA Sequencing recommenda ons Data analysis
5 The GSC Bioinforma cs team
Sequencing Sequence Development Bioinforma cs tools
Level 0 QC Omics Contaminants and RNA analysis reagents Bioapps Data analysis and tracking
LIMS Structural Variant Laboratory informa on management system and Integra ve Analysis
Results The POG Bioinforma cs pipeline Alignment
Raw sequence reads (fastqs) are produced using real me base- calling (Illumina so ware) on compute cluster and storage Align to reference directly connected to sequencers
Fastqs are aligned to reference using BWA 0.5.7 to produce binary alignment, or bam files
LIMS storage and alignment capacity scales with sequencer output
Alignment is run on a dedicated cluster with its own scheduler so that it never competes with other jobs, and turnaround me is consistently around 1-2 days per lane Microbial detec on pipeline
Raw sequence reads are rapidly screened to detect possible microbial content. This is informa ve for detec ng contaminants and also viruses associated with cancer ( eg. EBV, HPV) Each paired end read is classified into human or one of 47 infec ous agent species using a hash-based classifier that looks for exact match of 25-mers. The algorithm is an implementa on of Bloom filters (Bloom, 1970) The process is very fast but bound by the amount of memory which increases linearly with the total size of the reference filters
forward read reverse read
Filters made from 5000 downloaded reference sequences Alignment based iden fica on of molecular aberra ons
A er alignment to the reference genome, the tumour and normal alignments are compared to each other to iden fy molecular aberra ons in the tumour to search for possible treatment targets
Genomic aberra ons include single nucleo de variants, copy number change, loss of heterozygosity, and gene fusions
Genes with dysregulated expression in the tumour are o en targe ed in therapy, so we also use RNA sequencing to look for genes which are over and underexpressed
SNP and indel calling A C C Single nucleo de polymorphism C C
SNP and indel calling is performed on the tumour/normal pair to iden fy puta ve soma c variants ie. variants present in the tumour genome but not in the normal sample. (run as parallel per chromosome jobs)
Coding variants which change the protein produced by the tumour cell are priori zed.
The level of expression of genomic soma c variants is es mated by looking in the transcriptome bam files (submi ed as several thousand single CPU jobs)
L130F Copy Number Varia on
Read depth in whole genomes can be used to es mate if the genome is diploid (normal) or if Copy number loss there are amplifica ons or (0.5x coverage dele ons of genomic regions compared to rest of genome/tumour)
The comparison of tumour and normal samples allows for rela ve coverage to be used, which greatly reduces the noise of the signal
Our CNV detec on method uses Hidden Markov Models to infer the copy number state based on normalized coverage over a region Soma c copy number variants
Gain Chr 17 Loss LOH BIRC5 CDK12 ERBB2
Genome wide view of copy number losses and gains, and loss of heterozygosity in tumour (le )
Focal gains of mul ple copies in region containing ERBB2 (above)
This observa on, along with gain of func on muta on and overexpression led to therapeu c recommenda on of targeted inhibitor for this pa ent
De novo sequence assembly using ABySS
Large scale structural rearrangement is more accurately Assembly of reads detected by de novo assembly
ABySS was developed at the GSC and uses a distributed Into con g representa on of a de Bruijn graph to allow for parallel Align back to reference computa on
The assembly pipeline is used to call fusions, inser ons and dele ons, inversions and internal tandem duplica on events in the transcriptome and genome
It is also used to detect integra on sites where viruses have integrated into the tumour genome EML4-ALK fusion iden fied using assembly
Transcriptome and genome sequencing revealed chr2 inversion fusing EML4 and ALK genes EML4 exons 1-13 ALK exons 20-29
coiled-coil domain tyrosine kinase - dimerization, domain - kinase activation
Pa ent tested nega ve for fusion in previous clinical screen
Sequence analysis at chr2 breakpoints iden fied a further inversion and inser on into chr12 that appears to prevent Vysis dual-colour break-apart probe from hybridizing FISH probes
inversion
chr2 inversion & inser on into chr12
chr12
15 Response to ALK inhibi on (Crizo nib)
EML4-ALK fusion, with high overall expression of ALK together with ROS1 over-expression was reported to the trea ng oncologist TKI Crizo nib was immediately administered The tumour responded drama cally
Sept 4 2013 – before Crizo nib Crizo nib started Sept 25 Scan from Dec 12, 2013 16 Expression analysis
Exon 1 Intron 1 Exon 2
Reads aligned to genome are ‘reposi oned’ across exon-exon junc ons
Exon and gene expression is calculated from normalized read depth
As normal RNA is not usually available, expression in tumour is compared to both publicly normal ssue compendiums and also to collec ons of tumour samples from The Cancer Genome Atlas (TCGA) project to iden fy outliers Expression is correlated with tumour cohorts from TCGA using Spearman BRCA correla on PRAD This has helped to clarify and confirm diagnosis
Breast cancer subtyping is also o en confirmed by correla ng with TCGA breast cancer cohorts
In the case above a vulvar tumour was found to be an extra-mammary breast tumour in part due to the high correla on with breast tumour expression profiles
Integra ng variant data into pathways
[9/10] [89/99%] [70/32%] [97%] [12] [100%] FGF9/12 IGF1/2 PDGFA IHH [99%] [22] [100%] [10] [99%] [100%] [100%] [1%] [56%] [80%] [30%] [62%] [91%] [2%] [12%] GNA14 RET FGFR3 INSR IGF2R IGF1R PDGFRA PDGFRB KIT EGFR MET PTCH1 PTCH2
[89%] Hedgehog [51%] [2%] [80%] [23%] Pathway PLCB2 SRC NRAS PIK3R2 PTEN [8/10] VUS [2%] [5] [83%] VUS [15%] [98%] [100,95%] [53%] [100%] SMO BRAF MCM2 PRKCA PRKCD/G PIK3CA JAK1 DNA repair [73%] [9%] VUS [8%] MAP2K6 SUFU [1%] [65%] AKT/mTOR KLHL3 [93%] VUS [100%] [99%] [25%] CANT1 [100%] AKT1/2 AKT3 Pathway PARP8 MSH3 MAPK3 GLI4 PKC [49%] [97%] [18%] [13%] Pathway TSC2 PCNA PARP12 MAPK mTOR Gene Regula on [7-99%] [86-100%] Pathway [2-13] [100%] [50%] [4%] POLD PARP [54%] MYC RPS6KA EIF4EBP1 BIRC5 [4%] 1,2,4 1/9/14/15 [6%] 1/6 CDKN2C FC vs. Adjacent Normal ATM Cell Prolifera on Percen le vs. NBL [26%] [71%] [26%] RB1 CCND1 CDKN2A Overexpressed Ac va on Chroma n Remodeling Inhibi on and Gene Expression [3%] [35%] Indirect [2-5] Cell Cycle CDK4 CDKN2B Underexpressed [44-100%] [98%] [83%] [7%] [3%] Tumour Sup. [16%] LoF inac va ng muta on HDAC HDAC3 HDAC9 CDK6 VUS unknown muta on 1/5/10 E2F1 CDKN1B Drug Target [%] percen le [100%] [21%] [] foldchange Doxorubicin/ CDKN1A CDKN1C copy number gain (1 star per cp) Epirubicin/ Taxane Gemcitabine copy number loss (1 star per cp) resistance resistance AUY-922 [11-27] [98%] [22%] [87%] [57-100%] PGP RRM2 HSP90AB1 SSTR 1/2/5 BCCA Confiden al - For Research Purposes Only
Developing the pipeline for variant interpreta on
We have successfully built a pipeline which can rapidly produce a molecular profile of a tumour and categorize genomic and transcriptomic aberra ons
Interpreta on of this complex genomic landscape into clinically relevant informa on remains challenging and is o en limited by availability and cura on of publicly available knowledge
As we move from research to clinic, a number of challenges s ll need to be addressed
POG reports
Two reports are generated for POG
The first is a targeted alignment report which looks for a small list of specific ac onable variants
The second report which is under development, aims to provide a more complete genomic analysis
Both reports are based on a hand curated knowledge base of literature that we have built over the past several years and will con nue to expand
Variant databases
HVDB (SNVs and Indels) Clinical Size: 1,464 Gb informa on CNVDB Libraries: 2,420 Single sample Records: 4,81,967,204 SNVs Paired CNV calls 140,978,724 dele ons 116,506,738 inser ons SampleDB
Biological and technical ExpressionDB SVDB metadata Size: 252 Gb Fusions ~ 15,000 samples Libraries: 12,016 Large indels Records: 274,158,056 gene Inversions expression; 618,411,965 isoform expression Therapeu cs and Outcomes
Moving from a rela onal database solu on to a Hadoop solu on allows us to scale the databases while s ll allowing for fast queries
As we con nue to sequence more pa ents and record outcomes, we will be able to use data from previous POG cases to inform future cases
IBM Watson collabora on
We are currently collabora ng with IBM to help develop their Watson Genomic Analy cs (WGA) pla orm
WGA performs a molecular profile analysis, followed by a pathway analysis using informa on from public pathway databases, augmented with interac ons discovered by natural language processing from 20 million PubMed abstracts
The output of WGA consists of a list of drug candidates with ra onales and scores along with literature evidence that supports their considera on
We will be using our analysis of the first 100 POG pa ents to test and train the WGA method and compare their reports with our findings
Summary of POG outcomes for first 100 adult pa ents
Outcome of POG Number of pa ents Percentage
Sufficient ssue for POG analysis 78 78% Insufficient ssue for POG analysis Biopsy content too low for sequencing 16 16% Unable to biopsy due to specific pa ent factors 6 6% Informa ve 65 65% total, 83% of sequenced
Ac onable 55 55% total, 71% of sequenced 34% total, 44% of sequenced Pa ents received POG-informed treatment 34 62% when there was something ac onable iden fied Clinician assessed clinical or radiographic 14 41% (14 of 34) improvement in cancer (including stable disease) Ac onable target iden fied but the pa ent was too unwell or death before POG-informed therapy 13 24% (13 of 55) could be offered Amended or clarified the diagnosis or primary site 5 5% of total, 6% of sequenced Compute resources for POG
Efforts to scale the bioinforma cs pipeline for POG has focussed on 3 main areas
1) Process automa on. We have standardized much of our analysis and most pipelines are run as cron jobs, tracked through a central database
2) Analysis efficiency. We regularly benchmark our tools in our pipeline to look for new tools that will run more efficiently while maintaining the accuracy in our current pipeline
3) Overcoming compute bo lenecks. Some processes are CPU bound, some are memory bound and almost all are I/O bound
LIMS Hardware and storage
New cluster with 16 nodes, 512 cores to handle increased sequencer capacity
Node specifica ons: • 128 GB RAM • 16 Processors (32 Cores) Hyper threading • 10 Gb bandwidth
894 TB sequencer scratch space 50 TB Isilon system for scratch analysis 2 PB Archived storage of raw sequence and alignment data
Cumulate Data Usage for Downstream Analysis
600
Dec 5,2013 New storage volume acquired 500
Oct 16,2013
July 31, 2013 400
Feb 20,2013
300 May 20, 2012 Terabytes
Apr 23, 2012 200
Dec 6, 2011
100 Nov 19, 2010
Aug25,2009 Mar 24,2010 Apr23, 2009 Aug 22, 2007 Jan 6,2010 0
2007 2009 2011 2013 Total analysis and compute resources at GSC
9 9000
Storage 8 8000 Compute
7 7000
6 6000
5 5000
Petabytes 4 4000 Number of cores 3 3000
2 2000
1 1000
0 0 2008 2010 2012 2014
Data footprint for personalized oncogenomics project For each pa ent, we sequence 3 Genomes (normal, archival, tumour) + 1 Transcriptome (tumour)
Process Genome/Transcriptome Backup Total storage Raw sequence data Genome + Transcriptome Backed up 3 x g + t = 150GB (zipped)
Ini al alignment Genome + Transcriptome Backed up 3 x g + t = 450GB
Reposi oned alignment Transcriptome only Backed up 25GB
Split alignment Transcriptome only Temporary 25GB
Assembly Genome + Transcriptome Par ally backed up 3 x g + t= 200GB
Trans-ABySS Genome + Transcriptome Par ally backed up 3 x g + t = 200GB
Microbial detec on Genome + Transcriptome Temporary 2 x g + t = 100GB (zipped)
1 pa ent = 1.15Tb scratch / 0.8Tb cumula ve storage 3000 pa ents = 3.5 PB scratch and 2.4 PB cumula ve storage Acknowledgements
BC Cancer Agency Genome Sciences Centre
Dr. Janessa Laskin Marco Marra Eric Chuah Steven Jones Dr. Francois Benard Dr. Youwen Zhou Dean Cheng
Dr. Vanessa Bernstein Dr. Diego Villa Nina Thiessen Erin Pleasance Dr. Kim Chi Dr. David Huntsman Carolyn Ch’ng Mar n Jones Dr. Rebecca Cosse Dr. David Schaeffer An He Dr. Stephan Chia Yaoqing Shen Dr. Tony Ng Johnson Pang Dr. Randy Gascoyne Katayoon Kasaian Dr. Stephen Yip Tina Wong Dr. Karen Gelmon Sreeja Leelakumari Dr. Malcolm Hayes William Long Dr. Sharlene Gill Yvonne Li Dr. Kathy Ceballos Joseph Juhn Dr. Cheryl Ho Pinaki Bose Dr. Hagan Kenneke Dr. Anthony Karnezis Jenny Qian
Dr. Mohamed Khan Dr. Aly Karsan Kelsey Zhu Jacquie Schein Dr. Chris an Kollmannsberger Dr. Gillian Mitchell Karen Mungall Peggy Tsang Dr. Meg Knowling Dr. Intan Schrader Dus n Bleile Rebecca Carlsen Dr. Howard Lim Dr. Dan Renouf Alex Hammel Dr. Caroline Lohrisch Dr. Kerry Savage Armelle Troussard Richard Mar Dr. Monty Mar n Dr. Tamara Shenkier Yongjun Zhao Melika Bonakdar Dr. Torsten Nielsen Dr. Chris ne Simmons Pawan Pandoh Caralyn Reisle Dr. Sophie Sun Sam Aparicio Simon Haile Merhu Richard Corbe Dr. Anna Tinker Hector Li Chang Helen McDonald Greg Taylor Dr. Sheridan Wilson Balvir Deol Richard Moore Simon Chan
Peter Eirew Mike Mayo Young Song BC Children’s Hospital Julie Ho Andy Mungall Nisa Dar Farzad Jamshidi Angela Tam Dr. Alice Virani Kane Tse Robyn Roscoe Dr. Rod Rassekh Julie Lore e Dr. Rebecca Deyell Amy Lum Alex andra Fok IBM Dr. Anna Lee Cydney Nielsen Payal Sipahimalani Dr. Christopher Dunham Tomo Osako Ajay Royyuru Dr. Caron Strahlendorf Sohrab Shah Lance Bailey Takahiko Koyama Dr. Paul Rogers Roland Santos Colleen Jantzen Brandon Sheffield Zeev Waks Colleen Fitzgerald Kulbir Multani Boaz Carmelli
And to all of our pa ents