DREAM Challenges Advancing our understanding of human disease through data-centric competitions Michael Kellen, PhD Director, Technology Platforms and Services Sage Bionetworks Crowd-sourcing in History Crowd-sourcing Today DREAM: What is it? A crowdsourcing effort that poses questions (Challenges) about systems biology modeling and data analysis:

– Transcriptional networks, – Signaling networks, – Predictions to response to perturbations – Translational research

DIALOGUE FOR REVERSE ENGINEERING ASSESSMENT AND METHODS DREAM: Structure of a Challenge Data Crowd- sourcing Measurements

Predictions Ground Truth Unbiased Evaluation Acceleration of Research Collaboration The challenge improvement loop Incentives for participation

• Partnerships with journal editors – “Challenge Assisted Peer Review”

• Challenge webinars for live interaction between participants and organizers

• Community forums where participants can learn from each other

• Leaderboard to motivate continuous participation

• Annual DREAM Conference to celebrate and discuss Challenge outcomes

DREAM 7 Published April 2013 Post-analysis and paper writing phase DREAM8.5 Challenges Open for Participation

• Predict cancer-associated mutations from whole-genomc sequencing data • Opened Nov 8 • 172 registered participants

• Predict which patients will not respond to anti-TNF therapy • Opens Feb 10 • 208 registered participants

• Predict early AD-related cognitive decline, and the mismatch between high amyloid levels and cognitive decline • Dry run phase: opening in March • 161 registered participants DREAM 9 Challenges opening in May-June 2014

• Broad Gene Essentiality Challenge – Data Set: 500 cell lines with molecular characterization data (from CCLE) and gene essentiality data (from Achilles RNAi screens). – Challenge structure: Participants train gene essentiality predictive models using training data. Use molecular information from test data to predict gene essentiality scores, which are compared against held out dataset.

• DREAM AML Treatment Outcomes Challenge – Data Set: RPPA data on 231 antibodies and correlated patient demographic and outcomes data – Potential Challenge objectives: • Predict AML patient overall survival and remission duration • Predict patients who respond to therapy (CR), those that then will relapse, and those that are primary resistant to therapy.

DREAM 9.5 Challenges opening in end-2014 • Three potential imaging Challenges – Colorectal histopath

– Melanoma

– Brain Imaging

The Sage Bionetworks/DREAM Breast Cancer Prognosis Challenge Goal: use crowdsourcing to forge a computational model that accurately predicts breast cancer survival

Training data set: genomic and clinical data from 2000 women diagnosed with breast cancer (Metabric data set)

Data access and analysis tools: Synapse

Compute resources: each participant provided with a standardized virtual machine donated by Google

Model scoring: models submitted to Synapse for scoring on a real-time leaderboard

13 Unique Attributes Open source and code-sharing: – The computational infrastructure enables participants to use code submitted by others in their own model building – Winning code must be reproducible

Brand new dataset for final validation of winning model: – Derived from approx. 200 breast cancer samples – Data generation funded by Avon – Winning model: the one that, having been trained using Metabric data, is most accurate for survival prediction when applied to a brand new dataset

Challenge assisted peer-review – Overall winner can submit a pre-accepted article about his/her winning model to Science Translational Medicine

14 Sage / DREAM Breast Cancer Challenge Timeline Synapse: A platform for collaborative data science Synapse: A platform for collaborative data science: Synapse: Winner’s project Winner’s Formal Publication Synapse: Winner’s project Synapse: Links to prior work Synapse: Links to prior work Connection to research community Breast Cancer Challenge: Key Outcomes • Winning approach leveraged prior data in unexpected ways to gain predictive power • Improvement in survival predictability over standard clinical diagnostics • Winning team well outside mainstream of field • Challenge visibility provides mechanism to open data and algorithms • Path to greater clinical impact will require prospective data generation How DREAM Challenge Recognition Can Help Participants Andre Falcao: Professor Andre Falcao was a participant in the recently completed DREAM8 NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge. He brought up valid criticisms regarding the scoring metrics that were being used for a portion of the Challenge. Andrew now has taken a leadership role in the current DREAM8.5 planning of the The Rheumatoid Arthritis Responder Challenge, showing how DREAMers can transition from participants to organizers.

Alex Williams: Alex is a research technician at Brandeis University and a winner of the DREAM8 Whole Cell Parameter Estimation Challenge. Professor Markus Covert from Stanford, who co-sponsored this Challenge, was so impressed with Alex’s' solutions to the Challenge that he has written Alex a recommendation for graduate school in the fall of 2014.

Wei-yi Cheng: Wei-yi was a graduate research assistant when he helped team Attractor Metagenes win the DREAM7 Breast Cancer Prognosis Challenge (BCC). Since winning the BCC, Wei-Yi has since been recruited to join Eric Schadt at the Mount Sinai School of Medicine (MSSM) Institute for Genomics and Multiscale Biology as a research scientist.

Survey of the field - 2012

• Only algorithms with freely available software implementations – Install all locally on our cluster – Create a protocol for debugging issues – Set a drop-dead acceptance

• Default or near-default parameterization

• Comparison to experimental gold-standards Overall Strategy Tools Being Evaluated Why Do We Need This Challenge?

SNVs SVs

Singer Ma (UCSC) What did we learn from our survey?

• Bioinformatics Software is Poor – 5/9 & 12/16 top tools even able to run

• Inter-Tool Variability is immense

• Filtering is critical, but ill-defined

• Surveys are too slow & expensive: – Two analysts – Two postdocs – Two years

Introducing the ICGC-TCGA DREAM Somatic Mutation Calling Challenge!

• The Challenge: – Identify Somatic Single Nucleotide Variants (SNVs) in human tumours – Identify Structural Variants (SVs) in human tumours

SMC Challenge Website: https://www.synapse.org/ #!Challenges:DREAM Data for Somatic Mutation Calling Challenge

in silico Data Real Human Data

5 Synthetic Tumour/Normal Pairs 10 Real Tumour/Normal Pairs

• One released each month • Released November 2013 • Of increasing complexity • 5 Prostate Cancers • No ICGC data-access needed • 5 Pancreatic Cancers • Incentives for top-performing • ICGC data-access needed teams may include free cloud- • Several thousand candidates computing credits will be validated using • Data available immediately independent techniques Challenge Structure

Challenge 1 Challenge 2

Simulated Human Tumour Data Tumour Data

2A 2B

SVs SNVs SVs SNVs • Balanced • Balanced • Tumour 1 • Tumour 1 accuracy accuracy • Tumour 2 • Tumour 2 across all 10 across all 10 • Tumour 3 • Tumour 3 T/N pairs T/N pairs • Tumour 4 • Tumour 4 • Tumour 5 • Tumour 5 Challenge 1A Challenge 1B

2A-1 to 2A-5 2B-1 to 2B-5 How will the Challenge be scored?

Challenge 1: tumour data Challenge 2: in silico data

10 Real Tumour/Normal Pairs 5 Synthetic Tumour/Normal Pairs

• Several thousand candidates • A complete ground-truth is known will be validated (up to 10k) for each dataset • Validation will include (at least) • We will calculate sensitivity, re-sequencing to ~300x specificity and balanced-accuracy coverage using AmpliSeq for each genome on a held out primers on an IonTorrent piece of the genome

Updated Challenge Timeline

Nov 2013 July 2014 Sept 2014 Nov 2014

Competition Validation Winner #3: Apr #3: Apr 15 #4: Apr #4: Apr 15 #1: Feb 15 #2: Mar #2: Mar 15 #5: May #5: May 15

in silicoin in silicoin in silicoin in silicoin in silicoin Challenge Updates: Synthetic #1

• We are pleased to announce that our partnership with Google has officially launched! • Our leaderboard is live for in silico dataset #1 (challenges 2A-1 and 2B-1)

Cloud Computing in Challenges

Challenge Outcomes

• Identification of best methods for predicting somatic SNVs • Identification of best methods for predicting somatic SVs • Creation of a community focused on rapid algorithm- development and benchmarking for cancer NGS • Comparison of benchmarking simulated and real data • Creation of a gold-standard for NGS method development • Assessment of techniques for pan-cancer studies • Challenge-assisted peer review in collaboration with NPG • Best methods will be applied to thousands of genomes at CGHub!

Next Generation Sage Bionetworks Challenges: what will they look like?

• Disease Communities/Groups that have contacted us to run a Challenge: GBM-NBTS, Colon, CHDI, NCI (pan-cancer), BROAD, NIEHS, Alzheimer’s- NIA

39 Next generation Sage Bionetworks Challenges: Opportunities for running an open Breast Cancer Challenge

Focus of Initial Challenge- Proving a challenge can be done with Clinical data and in an open way

Focus of Second Challenge- Proving a challenge can answer an important clinical question rapidly and affordably

Strategy- Let the question not the convenience of data drive the Challenge

Approach- Form an Advisory Group of breast cancer thought leaders

40 The Second Sage/DREAM Breast Cancer Challenge Co Leaders: Stephen Friend and Dan Hayes

Scientific Advisory Board: Fabrice Andre- Inst. Gustave Roussy Jose Baselga- MSKCC John Bartlett- OICR Mitch Dowsett- Royal Marsden Daniel Hayes- University of Michigan Larry Norton- MSKCC Lisa McShane- NCI Martine Piccart- Universite Libre de Bruxelles

1) Determine the best clinical question regarding the treatment of breast cancer that can be developed using existing datasets 2) Determine the best clinical question regarding the treatment of breast cancer that can be developed not

constrained by using existing datasets 41

The Second Sage/DREAM Breast Cancer Challenge

One or more case control studies to determine patients with, or without, residual risk to better guide enrollment into future clinical trials. The Case Control studies could be broken into categories based on ER, or HER2, or neither:

a. ER pos:

i. Those who got ET plus chemo: this is an important group. If we can identify those who relapse anyway (vs. those who don't) we could focus future trials on the former.

ii. those who got ET only (like in TailorRx, plus B20,

B14, 8814) - can we build a better oncotypeDx? 42 The Second Sage/DREAM Breast Cancer Challenge

One or more case control studies to determine patients with, or without, residual risk to better guide enrollment into future clinical trials. The Case Control studies could be broken into categories based on ER, or HER2, or neither:

b. HER2 Pos (amplified or 3+).

i. Those who got only chemo: Is there a group that does not NEED herceptin? ii. those who got Herceptin. This is the key group - who's cured, who's not? Focus future anti-HER2 trials on the latter.

c. ER, PgR, HER2 neg. i. those who got "standard" chemo. There is a large group that are cured with standard chemo. Why enroll such patients in future

trials? Focus future trials only on those who are likely to recur. 43

Acknowledgements

. Sage Bionetworks . DREAM . Stephen Friend . Gustavo Stolovitzky (IBM) . Adam Margolin . Erhan Bilal (IBM) . Mike Kellen . Jim Costello, BU . Thea Norman . Julio Saez Rodriguez, EBI . Michael Menden, EBI . Columbia University . Thomas Cokelaer, EBI . Andrea Califano . Mukesh Bansal . Challenge organizers – dozens of . Chuck Karan people across the world

. NCI . All DREAMers who participated in . Dinah Singer the challenges: more than 40 different countries and 100 . Dan Gallahan Institutions