CANDLE and for surveillance

John Gounley Computational Sciences and Engineering

ECP Annual Meeting 6 February 2020

ORNL is managed by UT-Battelle, LLC for the US Department of Energy CANDLE Background

• DOE-NCI Partnership – Joint Design of Advanced Solutions for Cancer (JDACS4C) – Cancer Moonshot & National Strategic Computing Initiative – ANL, LLNL, LANL, ORNL, and Fredrick National Lab for Cancer Research

• Accelerate precision oncology capabilities Computing – 3 application-focused Pilots driving cancer NCI advances National – 2 cross-cutting initiatives Cancer • Uncertainty quantification Institute DOE • Scalable deep learning (CANDLE) Department Cancer driving of Energy computing advances

2 CANDLE team

• Argonne • Fredrick • Los Alamos – Tom Brettin – Andrew Weisman – Christina Garcia- – Nick Collier – George Zaki Cardona – Rajeev Jain – … – Jamal Mohd Yusof – Jonathan Ozik – … – Arvind Ramanathan • Livermore – Rick Stevens • Oak Ridge – Richard Turgeon – Brian van Essen – Shang Gao – Justin Wozniak – Sam Ade Jacobs – John Gounley – Fangfang Xia – Adam Moody – Hong-Jun Yoon – Harry Yoo – … – Todd Young – … – …

3 CANDLE project goals

• Provide API for deep learning on DOE and NCI supercomputers – CANDLE Library – Facilitate dependencies, IO, UQ, visualization, profiling • Support general set of deep learning workflows – CANDLE Supervisor – Portable across DOE computing platforms • Produce example proxy application models for vendors and users – CANDLE Benchmarks – Support Pilot-based deep learning models

4 JDACS4C Pilots

Pilot 1 PI: Rick Stevens (ANL)

Cellular scale modeling • Patient-derived xenograft and cell lines • Prediction of drug response • Enable pre-clinical screening

Pilot 2 Pilot 3 PI: Fred Streitz (LLNL) PI: Gina Tourassi (ORNL)

Molecular scale modeling Population scale modeling • Predictive models of RAS pathway • Predictive models of patient trajectories • Identify novel drug targets for most • Information extraction and text aggressive cancer types comprehension

5 Pilot 3 team

• ORNL – Teresa Hankins – Valentina Petkov • LLNL – Georgia Tourassi – Shamimul Hasan – Gisele Sarosy – Priyadip Ray – Blair Christian – Jacob Hinkle – Steve Friedman – Braden Soper – Joe Lake – Michael Johnson – Marina Matatova – Hiranmayi Ranganthan – Dallas Sacca – Samantha Marchese – Donna Rivera – Andre Gonvalves – Greshma Agasthya – Alina Peluso – Benmei Liu – Quentin Vaughn – Mohammed Alawad – Noah Shaefferkoetter – Li Zhu • LANL – Folami Almudun – Sean Taylor • Fredrick – Tanmoy Battacharya – Devanshu Argawal – Ryan Tipton – Sylkk Ansah – Jamal Mohd-Yusof – James Council – Hong-Jun Yoon – Vanaja Mukherjee – Nick Hengarther – Ioana Danciu – Todd Young – Deb Hope – Sunil Thulasidasan – Dennis Day • NCI – Mita Myneni – Kumkum Ganguly – Abhishek Dubey – Lynne Penberthy – Ben McMahon – Samantha Erwin • ANL – Jessica Boten – Sarah Michalak – Shang Gao – Tom Brettin – Gielle Kuhn – Judith Humphrey – John Gounley – Peter Slawniak – Serban Negoita

6 Pilot 3: Cancer Surveillance Overarching Goal: Improve the effectiveness of cancer treatment in the “real world” through computing

7 NCI Surveillance Research Program

• Surveillance, , and End Results (SEER) program – Comprehensive, relevant information to support research activities – Population-level data to complement & supplement clinical trials – Clinically relevant models to support decisions (‘precision medicine’)

Current partner registries: • Louisiana • Kentucky • New Jersey • Utah

Pipeline: • California • Seattle • New Mexico • …

8 Information extraction from cancer pathology reports

Data flow:

PATIENT PATHOLOGIST TUMOR INTEGRATION NCI SEER REGISTRAR WITH EMR

Extract cancer phenotypes: • Primary tumor site? Cancer • Grade of the cancer? pathology • Tumor cell histology? report • Laterality of tumor within organ? • …

9 Pilot 3: development & deployment of NLP models

NLP Model Search

Multi-task learning Scalability across Convolutional neural networks pathology labs, SEER registries, and Self-attention neural networks Supervised Development & cancer phenotypes deployment Graph convolution networks

Semi-supervised Cancer API Data setup phenotypes • Site Generation: recurrent neural • Histology networks and Transformers • Grade • Unsupervised Behavior CORAL-2 Word embeddings: Word2Vec, • Laterality CUI2Vec, and FastText • … Increasing labelrobustness to quality O(10000) cancer Temporal multi-document phenotypes representation

10 How is this project not a HIPAA violation?

• Knowledge Discovery Infrastructure (KDI) at ORNL – Accredited protected health information (PHI) enclave – Partnered with VA and CMS – Compute resources for small development work • Synthetic data – Small dataset for testing (10000 reports) – Generated from de-identified data • Getting real data on Summit – Tokenize pathology reports within enclave – Egress tokenized data (but not key) to Summit’s file system – Sufficient for training and testing at scale

11 Multi-task convolutional neural networks (CNNs)

12 Visualizing the CNN

13 Hierarchical self-attention networks (HiSANs) S. Gao, et al.

a total of 70 possible labels for site, 7 for laterality, 4 for behavior, 516 for histology, and 9 for grade; the label descriptions and the number of occurrences per label are available in our supplementary information (SI) section A, and more detailed information can be found in the SEER Model Sitecoding manual (https://seer.cancer.gov/tools/codingmanuals/Laterality Behavior). Acc.To simulateMacro a real productionAcc. environmentMacro inAcc. which a classiMacrofier trained on older existing reports must predict labels for new incoming CNN 89.44reports, we56.44 split our dataset89.05 into train,47.89 validation, and96.44 test sets based74.99 off date. Because the same tumor ID may have multiple pathology reports HiSAN 90.37associated63.36 with it over time,89.35 we designed49.99 our splitting96.71 to prevent84.02 re- ports from the same tumor ID being split between the train, validation, and test sets. Therefore, we first grouped all pathology reports by tumor ID; each pathology report belonging to the same tumor ID is assigned Modelthe date ofHistology the earliest report associatedGrade with that tumor ID – a report written in 2017 may be assigned a 2012 date if it belongs with a tumor ID with aAcc. report first writtenMacro in 2012.Acc. We isolated allMacro reports from 2016 and later into our test set; from the remaining reports from 2004 to CNN2015, we75.39 randomly selected23.50 80% for69.97 our train set68.72 and used the other 20% for our validation set (ensuring reports from the same tumor ID HiSANwere not76.22 split between30.23 both train and71.59 validation sets).74.30 This results in a train set of 236,519 reports, a validation set of 59,241 reports, and a test set of 78,856 reports.

3.1.1. Data cleaning Each raw pathology report was provided to us in XML format that included both metadata and text fields. For each report, we discarded 14 Fig. 1. Architecture for our hierarchical self-attention network (HiSAN). the metadata fields, such as patient ID and registry ID, and retained all text fields, such as clinical history and formal diagnosis. We then low- fi ercased all words, converted any unicode characters into their corre- to clinical text classi cation – a detailed ablation study is available in SI sponding ascii characters, and removed any consecutive punctuation section G. Therefore, we propose the new HiSAN architecture, which is fi (e.g., multiple periods following one another were replaced by a single better suited for classi cation of cancer pathology reports. period). Any unique words appearing fewer than five times across the The structure of our HiSAN is shown in Fig. 1. Each component of entire corpus were replaced with an “unknown_word” token. the HiSAN is discussed in greater detail in the following subsections. We applied several text modification and replacement steps to standardize the pathology reports. These include standardizing all clock-time references, which are used to identify cancer site in 3.2.1. Self-attention such as breast cancer, into the format of a number 1–12 followed by the A self-attention mechanism compares a sequence of embeddings fi string “oclock”. In addition, to reduce the vocabulary space, we con- against itself to nd relationships between the entries in the sequence. Given a sequence of embeddings E ℝl d, where l is the length of the verted all decimals into a “decimal” word token and all integers larger × than 100 into a “large_integer” word token. Additional minor text sequence and d is the embedding∈ dimension, a basic self-attention ld fi mechanism generates a new sequence S ℝ in which each entry si is modi cation and replacement steps are listed in SI section B. After × cleaning, the average pathology report had a length of 633 tokens. a weighted average of all entries ei in the∈ original sequence. Intuitively Like the HAN, the HiSAN first breaks a long document down into speaking, each new entry si should have captured within it the most smaller linguistic segments, such as individual sentences. Generally pertinent information to that entry from all entries ei in the original speaking, pathology reports do not naturally break down into sen- sequence: tences; instead, most pathology reports list relevant facts and in- Self-Attention(EE ) softmax(EE ) (1) formation line by line. We therefore split each document into smaller ⊤ segments based off the natural linebreaks occurring in each pathology = To improve upon this basic self-attention, rather than directly reports. Unfortunately, not all pathology reports use line breaks to se- compare E against itself, we use functions to extract three different sets parate information – in some cases, facts are presented in long para- of features from E: (1) Q and (2) K, which are features that help find graphs of natural language or are separated by symbols such as ‘#’ or important relationships between entries in the sequence, and (3) V, ‘ < ’. Consequently, after splitting pathology reports by linebreaks, if which are the features that are used to generate the new output se- any line is longer than 50 words, we further split it based off a curated quence. This allows for more expressive comparison between entries in list of punctuation and symbols based off our observations of the a sequence. Certain features are highly useful when finding word re- characters used to itemize lists within our corpus; these symbols are lationships, such as identifying how biomedical terms correspond to provided in SI section B. After splitting into lines, the average pathology each other in a pathology report, and these are captured in Q and K. On report had 70 lines with an average of 8.5 tokens per line. the other hand, certain features are more useful for the final classifi- cation task being targeted, and these are captured in V. We use three 3.2. Hierarchical self-attention networks position-wise feedforward operations on the same input sequence E ℝl d to generate Q ℝld, K ℝld, and V ℝl d. Our position- The architecture of the HiSAN is similar to that of the hierarchical × × × × wise∈ feedforward operation∈ is equivalent∈ to a 1D convolution∈ operation convolutional attention network (HCAN) [36], which is an architecture with a window size of one word: we previously developed for sentiment analysis and general text clas- sification tasks. In our experiments, we found that several components used in the HCAN which improved performance on general text such as Yelp and Amazon reviews instead reduced performance when applied

Determine relationships between words

15 Ongoing threads

Semi-supervised learning BERT & Transformers

Uncertainty quantification & abstention (with LANL)

16 Modeling & simulation for patient trajectories SEER SEER Treatment Treatment Surgery/ Outcome Diagnostic Claims Pharmacy Rad Rx SEER Data Data Data Data

49 YO Docetaxel, Anastrozole ER+/HER2- Lumpectomy (7/15) Vital Status Oncotype Dx Score: 36 Cyclophosphamide 1 prescription Beam Radiation Alive- 4/18 Breast Stage IA ductal (10-11/2015) (4/18)

Trastuzumab 70 YO (3/15-3/16) Letrozole HR(ER/PR)+/HER2+ Lumpectomy (1/15) Vital Status Stage IA Docetaxel/ (10/15- present Beam Radiation Alive- 5/18 Breast mammary Carboplatin 4/18) (3/15-3/16) 83 YO F Gefitinib Stage IIB adeno No Surg No systemic Vital Status (11/ 2016- 1/2017) Lung EGFR + Exon19 No Rad chemo Dead 6/17 Erlotinib (2/2017) ALK - 23 YO M Stage IIIC Melanoma Dabrafenib/ Stage III Biopsy/ No systemic Vital Status BRAF V600E/V600K Trametinib Wide excision (9/15) chemo Alive 2/18 Melanoma mutation + (11/16 –present) Groin Mets 10/16

17 Selected references

• S. Gao, J.X. Qiu, M. Alawad, J.D. Hinkle, N. Schaefferkoetter, H.-J. Yoon, B. Christian, P.A. Fearn, L. Penberthy, X.-C. Wu, L. Coyle, G. Tourassi, A. Ramanathan. “Classifying cancer pathology reports with hierarchical self-attention networks, in Medicine, Vol 101, 2019.

• J.X. Qiu, S. Gao, M. Alawad, N. Schaefferkoetter, F. Alamudun, H.-J. Yoon, X.-C. Wu, G. Tourassi. “Semi-Supervised Information Extraction for Cancer Pathology Reports.” 2019 IEEE EMBS International Conference on Biomedical & (BHI), May 2019.

• J.X. Qiu, H.-J. Yoon, K. Srivastava, T.P. Watson, J.B. Christian, A. Ramanathan, X.-C. Wu, P.A. Fearn, and G.D. Tourassi. "Scalable Deep Text Comprehension for Cancer Surveillance on High-performance Computing." BMC Bioinformatics, 2018.

18 Acknowledgements

• This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

• This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE- AC05-00OR22725.

• This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S., Department of Energy under Contract No. DE-AC05-00OR22725.

19 20