Cancer Informatics: New Tools for a Data-Driven Age in Cancer Research Warren Kibbe1, Juli Klemm1, and John Quackenbush2

Cancer Focus on Computer Resources Research Cancer Informatics: New Tools for a Data-Driven Age in Cancer Research Warren Kibbe1, Juli Klemm1, and John Quackenbush2 Cancer is a remarkably adaptable and formidable foe. Cancer Precision Medicine Initiative highlighted the importance of data- exploits many biological mechanisms to confuse and subvert driven cancer research, translational research, and its application normal physiologic and cellular processes, to adapt to thera- to decision making in cancer treatment (https://www.cancer.gov/ pies, and to evade the immune system. Decades of research and research/key-initiatives/precision-medicine). And the National significant national and international investments in cancer Strategic Computing Initiative highlighted the importance of research have dramatically increased our knowledge of the computing as a national competitive asset and included a focus disease, leading to improvements in cancer diagnosis, treat- on applying computing in biomedical research. Articles in the ment, and management, resulting in improved outcomes for mainstream media, such as that by Siddhartha Mukherjee in many patients. the New Yorker in April of 2017 (http://www.newyorker.com/ In melanoma, the V600E mutation in the BRAF gene is now magazine/2017/04/03/ai-versus-md), have emphasized the targetable by a specific therapy. BRAF is a serine/threonine protein growing importance of computing, machine learning, and data kinase activating the MAP kinase (MAPK)/ERK signaling pathway, in biomedicine. and both BRAF and MEK inhibitors, such as vemurafenib and The NCI (Rockville, MD) recognized the need to invest in dabrafenib, have shown dramatic responses in patients carrying informatics. In 2011, it established a funding opportunity, the mutation. However, even these successes led to new questions. Informatics Technology for Cancer Research (ITCR) (https:// The same mutation in colorectal cancer is resistant to BRAF itcr.cancer.gov), designed to support new algorithms, new inhibitors, suggesting that this mutation interacts in a complex methodologies, and the maturation of tools and techniques way with other elements in the cell and those interactions may be necessary to harness the power of data and computation for cell lineage dependent. Exploring mechanisms of resistance to cancer researchers. Since its inception, the ITCR has funded 49 these inhibitors has led to a better understanding of the MAPK/ applications that support cancer informatics in areas that ERK signaling pathway, which in turn has led to identification of include DNA sequence analysis and interpretation, extraction new potential therapeutic targets. of information from clinical records, systems biology and The accelerating progress in cancer research has been driven by network medicine, proteomics, metabolomics, emerging fields rapid developments in technology. We have seen profound such as radiomics, and application of machine learning and advances in sequencing technology, in techniques for assaying artificial intelligence to a host of research problems. A guiding proteins and metabolites, in imaging capabilities, and in estab- principle of the ITCR program is that projects address relevant lishing electronic health records. At the same time, advances in needs in cancer research so that resultant algorithms and tools mobile computing, pervasive availability of the Internet, and providevaluenotonlytodatascientists,butalsotothebroader social media have opened new possibilities for understanding community of basic, clinical, and translational scientists. the contributing factors leading to cancer as well as the outcomes In addition to the ITCR, the NCI also launched the Cancer of treatment at a population scale across the country and even the Genomics Cloud Pilot (CGCP) program to address problems world. But we often lose sight of the fact that all of these associated with the scope and scale of modern cancer data. technologies produce only one thing: data. This unprecedented Although a single human genome sequence can be represented influx of data has only allowed us to advance our understanding in about 300 megabytes of disk space, the terabytes and peta- because of advances in data management, analysis, and interpre- bytes of information that modern cancer studies generate make tation, all of which are increasingly recognized as essential ele- transporting and replicating data across thousands of research ments of an integrated cancer research program. laboratories infeasible. The CGCP was designed to take advan- Indeed, these national initiatives have served to increase the tage of modern, robust, scalable cloud computing technologies awareness of and need for robust investment in both cancer by storing data in a commercial cloud infrastructure and informatics development and in training the next generation of allowing cancer researchers to bring their methods to the data cancer data scientists. The Beau Biden Cancer Moonshot (https:// to perform analyses. www.cancer.gov/brp) identified enhanced data sharing as one of Advancing our understanding of cancer also requires that we three key elements necessary to accelerate cancer research. The share data and establish cohorts that are large enough to draw meaningful conclusions. The NCI established the Genomic Data Commons (GDC, https://gdc.cancer.gov), positioning it as a 1NCI, Center for Biomedical Informatics & Information Technology, Rockville, central resource for sharing genomic, imaging, proteomic, and 2 Maryland. Dana-Farber Cancer Institute and Harvard TH Chan School of Public phenotype (clinical data for human specimens) information. Health, Boston, Massachusetts. The GDC includes defined data standards (https://gdc.cancer. Corresponding Author: John Quackenbush, Dana-Farber Cancer Institute, 450 gov/about-data/data-standards) to help assure consistent, har- Brookline Ave., Sm822, Boston, MA 02215. Phone: 617-582-8163; Fax: 617-582- monized access to data together with well-characterized primary 7760; E-mail: [email protected] analyses for various types of genomic data. This includes whole- doi: 10.1158/0008-5472.CAN-17-2212 genome sequencing, whole-exome sequencing, deep targeted Ó2017 American Association for Cancer Research. sequencing, RNA-seq, methyl-seq, and other sequence-centric www.aacrjournals.org e1 Downloaded from cancerres.aacrjournals.org on September 30, 2021. © 2017 American Association for Cancer Research. Kibbe et al. datasets. The GDC went live in June of 2016 and currently We hope that these collected works provide readers of provides centralized access to large genomically focused datasets, Cancer Research with new tools that they can incorporate into including The Cancer Genome Atlas (TCGA) and TARGET. In their work, either to explore existing public datasets such as addition, sequencing data from 18,000 FoundationOne tests, TCGA or to analyze data that they are generating. We also the Multiple Myeloma Research Foundation Compass study, hope that these articles provide incentive for broader collab- and the AACR Project GENIE dataset will soon be available oration between cancer data scientists and laboratory, trans- through the GDC. lational, and clinical scientists. Cancer is a complex disease, Together,theITCR,GDC,andCGCPprogramsrepresentan and conquering it will require bringing all our collective skills investment in the future of data-driven cancer research. This to bear. Further, we expect that this issue will be the beginning special issue of Cancer Research is designed to highlight some of many more computational resource papers that will be of the resources and discoveries that have been made possible published and highlighted in future issues of Cancer Research. by these NCI programs. Most of the articles appearing here are short "application notes" introducing a tool or resource and Disclosure of Potential Conflicts of Interest providing a short vignette demonstrating its application. Each J. Quackenbush is the co-founder and former board chair at Genospace, research team was also asked to provide a brief video that LLC. No potential conflicts of interest were disclosed by the other authors. could be included online to either provide more background or to serve as a brief tutorial. In addition, research articles demonstrate the utility of these tools in gaining new insight Received July 21, 2017; accepted September 18, 2017; published online into cancer. November 1, 2017. e2 Cancer Res; 77(21) November 1, 2017 Cancer Research Downloaded from cancerres.aacrjournals.org on September 30, 2021. © 2017 American Association for Cancer Research. Cancer Informatics: New Tools for a Data-Driven Age in Cancer Research Warren Kibbe, Juli Klemm and John Quackenbush Cancer Res 2017;77:e1-e2. Updated version Access the most recent version of this article at: http://cancerres.aacrjournals.org/content/77/21/e1 E-mail alerts Sign up to receive free email-alerts related to this article or journal. Reprints and To order reprints of this article or to subscribe to the journal, contact the AACR Publications Department at Subscriptions [email protected]. Permissions To request permission to re-use all or part of this article, use this link http://cancerres.aacrjournals.org/content/77/21/e1. Click on "Request Permissions" which will take you to the Copyright Clearance Center's (CCC) Rightslink site. Downloaded from cancerres.aacrjournals.org on September 30, 2021. © 2017 American Association for Cancer Research. .

Cancer Informatics: New Tools for a Data-Driven Age in Cancer Research Warren Kibbe1, Juli Klemm1, and John Quackenbush2

Finding New Order in Biological Functions from the Network Structure of Gene Annotations

Extracting Biological Meaning from High-Dimensional Datasets John

Molecular Processes During Fat Cell Development Revealed by Gene

Day 1 Session 4

The Human Proteome

Roles of the Mathematical Sciences in Bioinformatics Education

Expert Panel Report of the Cancer Systems Biology Consortium Program Evaluation

Perspective: Learning to Share

PIPELINES INTO BIOSTATISTICS Sequencing and Complex Traits: Beyond 1000 Genomes

Postdoctoral Fellow in the Department of Biostatistics Harvard T.H. Chan School of Public Health

Report on Emerging Technologies for Translational Bioinformatics: a Symposium on Gene Expression Profiling for Archival Tissues

Dana-Farber Cancer Institute: Oracle Customer Case Study