Text-Mining Clinically Relevant Cancer Biomarkers for Curation Into the Civic Database Jake Lever1,2, Martin R

Lever et al. Genome Medicine (2019) 11:78 https://doi.org/10.1186/s13073-019-0686-y RESEARCH Open Access Text-mining clinically relevant cancer biomarkers for curation into the CIViC database Jake Lever1,2, Martin R. Jones1, Arpad M. Danos3, Kilannin Krysiak3,4, Melika Bonakdar1, Jasleen K. Grewal1,2, Luka Culibrk1,2, Obi L. Griffith3,4,5,6*, Malachi Griffith3,4,5,6* and Steven J. M. Jones1,2,7* Abstract Background: Precision oncology involves analysis of individual cancer samples to understand the genes and pathways involved in the development and progression of a cancer. To improve patient care, knowledge of diagnostic, prognostic, predisposing, and drug response markers is essential. Several knowledgebases have been created by different groups to collate evidence for these associations. These include the open-access Clinical Interpretation of Variants in Cancer (CIViC) knowledgebase. These databases rely on time-consuming manual curation from skilled experts who read and interpret the relevant biomedical literature. Methods: To aid in this curation and provide the greatest coverage for these databases, particularly CIViC, we propose the use of text mining approaches to extract these clinically relevant biomarkers from all available published literature. To this end, a group of cancer genomics experts annotated sentences that discussed biomarkers with their clinical associations and achieved good inter-annotator agreement. We then used a supervised learning approach to construct the CIViCmine knowledgebase. Results: We extracted 121,589 relevant sentences from PubMed abstracts and PubMed Central Open Access full-text papers. CIViCmine contains over 87,412 biomarkers associated with 8035 genes, 337 drugs, and 572 cancer types, representing 25,818 abstracts and 39,795 full-text publications. Conclusions: Through integration with CIVIC, we provide a prioritized list of curatable clinically relevant cancer biomarkers as well as a resource that is valuable to other knowledgebases and precision cancer analysts in general. All data is publically available and distributed with a Creative Commons Zero license. The CIViCmine knowledgebase is available at http://bionlp.bcgsc.ca/civicmine/. Keywords: Precision oncology, Text mining, Information extraction, Machine learning, Cancer biomarkers Background the disease by defining different molecular subtypes of The ability to stratify patients into groups that are clinic- cancers that should be treated in different ways (e.g., ally related is an important step towards a personalized ERBB2/ESR1/PGR testing in breast cancer [1]). Immuno- approach to cancer. Over time, a growing number of histochemistry techniques are a primary approach for test- biomarkers have been developed to select patients who ing samples for diagnostic markers (e.g., CD15 and CD30 are more likely to respond to certain treatments. These for Hodgkin’s disease [2]). Recently, the lower cost and biomarkers have also been valuable for prognostic increased speed of genome sequencing have also allowed purposes and for understanding the underlying biology of the DNA and RNA of individual patient samples to be characterized for clinical applications [3]. Throughout the world, this technology is beginning to inform clinician * Correspondence: [email protected]; [email protected]; [email protected] decisions on which treatments to use [4]. Such efforts are 3McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA dependent on a comprehensive and current understand- 1Canada’s Michael Smith Genome Sciences Centre, Vancouver, BC, Canada ing of the clinical relevance of variants. For example, the Full list of author information is available at the end of the article © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Lever et al. Genome Medicine (2019) 11:78 Page 2 of 16 Personalized Oncogenomics project at BC Cancer identi- developing a particular cancer, such as BRCA1 muta- fies somatic events in the genome such as point mutations for breast/ovarian cancer [11]orRB1 mutations tions, copy number variations, and large structural for retinoblastoma [12]. Lastly, prognostic evidence changes and, in conjunction with gene expression data, items describe variants that predict survival outcome. As generates a clinical report to provide an ‘omic picture of a an example, colorectal cancers that harbor a KRAS mu- patient’s tumor [5]. tation are predicted to have worse survival [13]. The high genomic variability observed in cancers means CIViC presents this information in a human-readable that each patient sample includes a large number of new text format consisting of an “evidence statement” such mutations, many of which may have never been docu- as the sentence describing the ABL1 T315I mutation mented before [6]. The phenotypic impact of most of above together with data in a structured, programmatic- these mutations is difficult to discern. This problem is ex- ally accessible format. A CIViC “evidence item” includes acerbated by the driver/passenger mutation paradigm this statement, ontology-associated disease name [14], where only a fraction of mutations are essential to the evidence type as defined above, drug (if applicable), cancer (drivers) while many others have occurred through PubMed ID, and other structured fields. Evidence items mutational processes that are irrelevant to the progression are manually curated and associated in the database with of the disease (passengers). An analyst trying to under- a specific gene (defined by Entrez Gene) and variant (de- stand a patient sample typically performs a literature re- fined by the curator). view for each gene and specific variant which is needed to Several groups have created knowledgebases to aid understand its relevance in a cancer type, characterize the clinical interpretation of cancer genomes, many of whom driver/passenger role of its observed mutations, and gauge have joined the Variant Interpretation for Cancer Con- the relevance for clinical decision making. sortium (VICC, http://cancervariants.org/). VICC is an Several groups have built in-house knowledgebases, initiative that aims to coordinate variant interpretation which are developed as analysts examine increasing num- efforts and, to this end, has created a federated search bers of cancer patient samples. This tedious and largely mechanism to allow easier analysis across multiple redundant effort represents a substantial interpretation knowledgebases [15]. The CIViC project is co-leading bottleneck impeding the progress of precision medicine this effort along with OncoKB [16], the Cancer Genome [7]. To encourage a collaborative effort, the CIViC knowl- Interpreter [17], Precision Medicine Knowledge base edgebase (https://civicdb.org) was launched to provide a [18], Molecular Match, JAX-Clinical Knowledge base wiki-like, editable online resource where community- [19], and others. contributed edits and additions are moderated by experts Most of these projects focus on clinically relevant gen- to maintain high-quality variant curation [8]. The resource omic events, particularly point mutations, and provide provides information about clinically relevant variants in associated clinical information tiered by different levels cancer described in the peer-reviewed literature. Variants of evidence. Only CIViC includes RNA expression-based include protein-coding point mutations, copy number var- biomarkers. These may be of particular value for child- iations, epigenetic marks, gene fusions, aberrant expres- hood cancers which are known to be “genomically quiet, sion levels, and other ‘omic events. It supports four types ” having accrued very few somatic mutations. Conse- of evidence associating biomarkers with different classes quently, their clinical interpretation may rely more heav- of clinical relevance (also known as evidence types). ily on transcriptomic data [20]. Epigenomic biomarkers Diagnostic evidence items describe variants that can will also become more relevant as several cancer types help a clinician diagnose or exclude a cancer. For in- are increasingly understood to be driven by epigenetic stance, the JAK2 V617F mutation is a major diagnostic misregulation early in their development [21]. For criterion for myeloproliferative neoplasms to identify example, methylation of the MGMT promoter is a well- polycythemia vera, essential thrombocythemia, and pri- known biomarker in brain tumors for sensitivity to the mary myelofibrosis [9]. Predictive evidence items standard treatment, temozolomide [22]. describe variants that help predict drug sensitivity or re- The literature on clinically relevant cancer mutations sponse and are valuable in deciding further treatments. is growing at an extraordinary rate. For instance, only 5 Predictive evidence items often explain mechanisms of publications in PubMed mentioned BRAF V600E in the resistance in patients who progressed on a drug treat- title or abstract in 2004 compared to 454 papers

Text-Mining Clinically Relevant Cancer Biomarkers for Curation Into the Civic Database Jake Lever1,2, Martin R

Nidogen-1 Expression Is Associated with Overall Survival and Temozolomide Sensitivity in Low-Grade Glioma Patients

Rare Cancers Are Not So Rare: the Rare Cancer Burden in Europe

Hamdan Medical Journal 2012; 5:79–82

Text-Mining Clinically Relevant Cancer Biomarkers for Curation Into the Civic Database

Role of IQGAP1 in Carcinogenesis

WO 2009/080437 Al

Cancer Genomic Research at the Crossroads: Realizing the Changing Genetic Landscape As Intratumoral Spatial and Temporal Heterogeneity Becomes a Confounding Factor

Development and Characterization of Monoclonal Antibodies to GDF-15 for Potential Use in Cancer Therapy

Surveillance of Rare Cancers Jan Maarten Van Der Zw an 2016

Lynch Syndrome-Associated Ultra-Hypermutated Pediatric Glioblastoma Mimicking a Constitutional Mismatch Repair Deficiency Syndrome

Pan-Cancer Analysis of Immune Complement Signature C3/C5/C3AR1/C5AR1 in Association with Tumor Immune Evasion and Therapy Resistance

Download.Php?File=/Images/ Publication/128393883472.Pdf, Accessed: May 2012] References 23