OF MUTATIONS AND THEIR IMPACT

FROM BIOMEDICAL LITERATURE

by

A. S. M. Ashique Mahmood

A dissertation submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of in Computer

Fall 2018

c 2018 A. S. M. Ashique Mahmood All Rights Reserved TEXT MINING OF MUTATIONS AND THEIR IMPACT

FROM BIOMEDICAL LITERATURE

by

A. S. M. Ashique Mahmood

Approved: Kathleen F. McCoy, Ph.D. Chair of the Department of Computer and Information Sciences

Approved: Babatunde A. Ogunnaike, Ph.D. Dean of the College of Engineering

Approved: Douglas J. Doren, Ph.D. Interim Vice Provost for Graduate and Professional Education I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Vijay K. Shanker, Ph.D. in charge of dissertation

I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Cathy H. Wu, Ph.D. Member of dissertation committee

I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Li Liao, Ph.D. Member of dissertation committee

I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Peter McGarvey, Ph.D. Member of dissertation committee ACKNOWLEDGEMENTS

This research work is the outcome of years-long dedication and patience. But, it would not have been possible without the support from many people around me. First of all, I express my gratitude towards my advisor and mentor, Prof. Vijay K Shanker. Throughout the research journey, his continuous advisement, mentoring and encouragement played an integral role in shaping up this dissertation. Specially, Prof. Shanker taught me how to think critically about a research problem, how to effectively write research papers and how to present research in front of peers. These skills have helped me in my research in many ways and I believe I will continue to benefit from them in my future career. I am truly grateful for all he has done for me. I thank my dissertation committee members: Prof. Cathy Wu, Prof. Li Liao and Dr. Peter McGarvey. Despite their busy schedules, they were kind enough to serve in my dissertation committee and helped me with their suggestions and insights regarding the applicability of my research work. I am grateful for their invaluable time and attention towards this dissertation. I have spend many wonderful years in the BioTM lab. I have come across wonderful minds in BioTM lab, who also played roles in shaping my research. I am thankful to Oana (Catalina Tudor) for mentoring me when I first joined the lab. She helped me getting into the NLP research world. I fondly remember former and present members of BioTM lab: Gang Li, Yifan Peng, Samir Gupta, Ruoyao Ding, Jia Ren and Peng Su. We spent a lot of time together, be it for “research” or leisurely activities. We had fun together in hacking into some new cool tool as well as watching UCL matches. Thank you guys! Since I moved to USA, I am lucky to have wonderful friends who were there for me always. It would take pages if I start listing why they are special to me.

iv Instead, I just express my heartfelt gratitude to Farzana Khair, Musawir Chowdhury, Shermin Ashraf, Saif Tahsin, Samara Saif, Tareque Aziz, Firdous Saleheen, Purujit Saha, Laura Moum, Sonia Jahan, Fazle Rob, Mahfuzur Khan, Zannatun Noor, Rifat Lutful, Shafique Ahmed, Mithub Deb and Dabojani Das for their kind friendship. They all made me feel home, while away from home. I thank my family and relatives for their unconditional love and support that shaped my entire life. My parents, Shaheen Sultana and Mahbubul Hoq, have always believed in me and encouraged in every step of my academic journey. I cannot thank my parents enough for this. Specially, without the love, care and sacrifices from my mother, I would not be the person that I am today. Thank you mom! And last but not the least, I thank my wife Nancy (Tanjima Ferdous). We got married while I was a PhD student; and since then, she has supported my PhD journey through unconditional love, sacrifices, encouragement and patience. She is the best partner and companion I could wish for. I love her and I am grateful for all she has done for me. In addition, I would like to thank my department (CIS, UDEL) for this won- derful opportunity of graduate education as well as for the financial support at the be- ginning. I also thank the funding agencies who continued to fund the research projects that I was involved with. I am grateful to our research collaborators in Georgetown University, George Washington University, Delaware Biotechnology Institute (DBI) and University of Delaware for the countless fruitful discussions, from which I learned a lot.

In a nutshell, I am grateful to each and everyone who supported my journey, in one way or other. To everyone I mentioned and forgot to mention, thank you.

v TABLE OF CONTENTS

LIST OF TABLES ...... x LIST OF FIGURES ...... xi ABSTRACT ...... xii

Chapter

1 INTRODUCTION ...... 1

1.1 Motivation ...... 1 1.2 contributions ...... 2

1.2.1 Mutation detection ...... 2 1.2.2 Mutation-disease association ...... 2 1.2.3 Impact of genomic anomalies on drug responses ...... 3 1.2.4 Mutation impact on PPI ...... 4

1.3 Outline of the dissertation ...... 5

2 MUTATION DETECTION ...... 6

2.1 Introduction ...... 6 2.2 Related works ...... 6 2.3 Approach ...... 7

2.3.1 Mutation detection ...... 8 2.3.2 Genotype/Allele detection ...... 9 2.3.3 Mutation-gene association ...... 10

2.4 Evaluation ...... 13

2.4.1 Evaluation setup ...... 13

vi 2.4.2 Evaluation metrics ...... 14

2.5 Results and discussion ...... 15

2.5.1 Results on mutation detection ...... 15 2.5.2 Results on mutation-gene association ...... 16

2.6 Conclusion ...... 17

3 MUTATION-DISEASE ASSOCIATION ...... 19

3.1 Introduction ...... 19 3.2 Related works ...... 23 3.3 Approach ...... 24

3.3.1 General relation extraction system ...... 24 3.3.2 CAIR relations ...... 26 3.3.3 MF relations ...... 28 3.3.4 Statistical relations ...... 28 3.3.5 Co-occurrence in title/conclusion ...... 29 3.3.6 Extracting specific information ...... 29

3.3.6.1 Extracting mutations ...... 29 3.3.6.2 Extracting diseases ...... 30 3.3.6.3 Patient Context (PC) sentence ...... 30

3.3.7 Extracting additional information ...... 31

3.3.7.1 Rhetorical zones ...... 31 3.3.7.2 Patient related information ...... 32

3.4 System implementation ...... 34 3.5 Evaluation ...... 35

3.5.1 Evaluation setup ...... 35 3.5.2 Evaluation metrics ...... 35

3.6 Results and discussion ...... 36

3.6.1 Results on annotated datasets ...... 36 3.6.2 Full-scale processing ...... 37

3.7 Conclusion ...... 38

vii 4 IMPACT OF GENOMIC ANOMALIES ON DRUG RESPONSES 39

4.1 Introduction ...... 39 4.2 Related works ...... 42 4.3 Approach ...... 43

4.3.1 Different information types ...... 43

4.3.1.1 Association ...... 44 4.3.1.2 Comparison ...... 44 4.3.1.3 Biomarker ...... 46 4.3.1.4 Sensitization ...... 47

4.3.2 Syntactic processing ...... 47 4.3.3 Entity recognition ...... 48 4.3.4 Typing of phrases ...... 50 4.3.5 Pattern matching ...... 51 4.3.6 Extracting specific information ...... 52

4.3.6.1 Extracting drugs ...... 52 4.3.6.2 Extracting diseases ...... 52

4.3.7 Extracting additional information ...... 53

4.4 System implementation ...... 53 4.5 Evaluation ...... 54

4.5.1 Evaluation setup ...... 54 4.5.2 Evaluation metrics ...... 56

4.6 Results and discussion ...... 57

4.6.1 Results on annotated datasets ...... 57

4.7 Conclusion ...... 59

5 MUTATION IMPACT ON PROTEIN-PROTEIN INTERACTIONS ...... 60

5.1 Introduction ...... 60 5.2 Related works ...... 62

viii 5.3 Approach ...... 63

5.3.1 Extraction of PPI relation ...... 63 5.3.2 Mutation impact on PPI ...... 65 5.3.3 Extraction of mutated gene ...... 67 5.3.4 Anaphora resolution ...... 68 5.3.5 Entity recognition ...... 69

5.4 System implementation ...... 69 5.5 Evaluation ...... 70

5.5.1 Evaluation setup ...... 70 5.5.2 Evaluation metrics ...... 71

5.6 Results and discussion ...... 71 5.7 Conclusion ...... 72

6 CONCLUSION ...... 73

6.1 Thesis summary and contributions ...... 73 6.2 Future work ...... 76

REFERENCES ...... 78

Appendix

A LEXICO-SYNTACTIC PATTERNS ...... 89

A.1 Patterns for Association type sentences ...... 89 A.2 Patterns for Comparison type sentences ...... 90 A.3 Patterns for Sensitization type sentences ...... 90 A.4 Patterns for Biomarker types sentences ...... 90 A.5 Patterns for drug detection ...... 90

B GENE-DRUG COMBINATIONS ...... 91 C PERMISSIONS ...... 93

ix LIST OF TABLES

2.1 Summary information of the datasets used for evaluation purposes. refers to mutation-disease associations. refers to mutation-gene associations. M refers to mutation detection. .... 15

2.2 Evaluation of mutation detection systems on various datasets ... 16

2.3 MeX’s performance of mutation-gene association () and comparison with EMU’s performance...... 17

2.4 MeX’s performance of mutation-gene association () on CIViC dataset...... 17

3.1 Example sentences which have different syntactic representations. . 26

3.2 An abstract split into rhetoric zones (PMID:10810408)...... 33

3.3 DiMeX’s performance of mutation-disease association () on different datasets and comparison with EMUs performance. .... 36

3.4 Characteristics of the extracted results of the large scale run. .... 37

4.1 Performance of our system in finding the association of a genomic anomaly with drug responses () in InHouseSet1...... 57

4.2 Performance of our system in separating non-relevant abstract from relevant abstracts in InHouseSet2...... 58

4.3 Performance of our system in finding the association between genomic anomaly and drug responses in PharmGKB dataset...... 59

5.1 Evaluation results for detection of mutation impact on protein-protein interactions...... 72

x LIST OF FIGURES

3.1 SDG representation of the sentence “His239Arg is associated with lung adenocarcinoma”...... 25

3.2 arg0 and arg1 edges from the sentence “His239Arg is associated with lung adenocarcinoma”...... 25

3.3 A regulation relation with trigger “promotes”...... 27

3.4 An Involvement relation with trigger “plays a role”...... 27

3.5 A comparison sentence marked by the word “than”...... 28

3.6 A Mutation Found relation sentence...... 28

5.1 SDG representation of the sentence in Example 1...... 64

5.2 SDG representation of sentence in Example 3...... 65

5.3 SDG representation of sentence in Example 5...... 67

xi ABSTRACT

The increasing amount of research focusing on genetic mutations has triggered a rapid growth in the number of published articles describing mutations and their effect on diseases, drug responses and protein functionalities. With the advent of precision medicine, which aims at identifying targeted therapies that have maximal efficacy for individual patients, there is a pressing need to gather such mutational information from text into public knowledge bases. But manual curation slows down the growth of such databases. We have applied natural language processing (NLP) techniques to locate and extract mutational information from text that will assist curators and researchers. In particular, in this dissertation, we have addressed the following tasks: mutation detection, mutation to disease association, mutation impact on drug responses and impact of mutations on protein-protein interactions from research literature. We have developed a system, MeX, to detect a wide range of mutation mentions from text. Evaluations on several publicly available corpora exhibit that we have achieved state-of-the-art performance in mutation detection. The mutation detector also applies a novel algorithm to associate mutations with genes. We have developed a system, DiMeX, which finds the association between mutations and diseases from abstracts of published articles. Our system outperformed the current state-of-the- art when evaluated on multiple corpora. We have developed a system, eGARD, to identify the impact of genomic anomalies on drug responses. Evaluations showed high performance measures from eGARD that will significantly reduce manual curation time. Finally, we have developed a text mining system to extract mutation impact on protein-protein interaction. This type of information will provide further insight into how mutations affect protein functions, and thereby play a role in the development and progression of diseases. Our system outperformed the current state-of-the-art

xii approaches for the task. To enable easier access to data and make it available to computational tools, we have applied DiMeX and eGARD on Medline- scale and stored the results in databases.

xiii Chapter 1

INTRODUCTION

1.1 Motivation The advent of Next-generation sequencing (NGS) techniques have revolution- ized the study of genomics and molecular biology that has led to dramatic rise in the volume of scientific literature [1,2]. Furthermore, molecular profiling plays an integral role in identifying genomic anomalies which help in personalizing treatments, improv- ing patient outcomes and minimizing risks associated with different therapies. Thus, it is imperative to understand how genomic anomalies impact certain diseases, drug responses as well as functions at molecular level such as protein-protein interaction. All these have created a pressing need to collect valuable mutational information from published articles to advance the field of precision medicine [3]. But extracting muta- tional information has been a difficult task mostly because the information is spread across the articles as natural text. There have been numerous manual curation efforts to feed the information need. However, manual extraction of such information from text is costly in terms of time and money, and most importantly, manual efforts cannot keep up with the new information being published every year. That is why there has been a great deal of interest in recent years to automatically extract relational infor- mation from published literature. All these have contributed towards the motivation of this dissertation in developing text mining techniques to extract mutational infor- mation from text. We hope that the research work reported in this dissertation will facilitate and assist various biological tasks, such as knowledge discovery, bio-curation and hypothesis generation for personalized care.

1 1.2 Thesis contributions In particular, the contribution of this thesis can be divided into four parts, each addressing a different challenge in extracting mutational information from natural text. The contributions are described in the following sections.

1.2.1 Mutation detection The first aspect of this dissertation is about detecting mutation mentions from text. In order to extract relations between mutations and other entities, it is essential to first recognize mutations. There are existing text mining (TM) tools [4–13] to address this task but they vary in detecting different types of genetic variants. Many of the tools are limited to detecting only certain types of mutations. We have developed a system that detects a wide range of mutation types. Evaluation on several publicly available corpora shows that our system produces state-of-the-art results. In addition to detecting mutation mentions, associating the mutations with corresponding genes will help validate the mutations. Only a few of them TM tools [4–6, 13] attempt to find the associated gene. They mostly use co-occurrence based methods that suffer from low precision. To address this, in addition to mutation mention detection, we have developed a novel algorithm that looks into the syntactic and semantic structure of the text to associate each mutation with the relevant gene.

1.2.2 Mutation-disease association The second aspect of this dissertation concerns the development of a text mining system, named DiMeX [14], to extract associations between mutations and diseases from literature. Once the mutations are extracted and paired with genes, we find the association of mutations with diseases to complete the extraction of the mutation, gene and disease triplet. With the rapid growth in published literature regarding genomic findings from experimental studies, vast amount of information is already available about the asso- ciation between mutations and diseases. Uniprot [15], COSMIC [16], BioMuta [17],

2 OMIM [18], HGMD [19], UMD [20], HGVbaseG2P [21], MutDB [22], dbSNP [23], PharmGKB [24], ClinVar [25] and InSiGHT [26] are examples of repositories that house mutations and related disease and phenotype information laboriously manually curated from the literature. But compared to the growth of literature every year, manual curation falls short of covering the spectrum. There are few existing systems that targets automated extraction of the associa- tion between mutations and diseases from text [13,27–32], based on the co-occurrences of the entities. In our approach, we have used natural language processing (NLP) tech- niques that exploit the lexico-syntactic structure of texts to extract the associations. We employed a generated relation extraction system [33] to effectively extract different types of relations that connect mutations and diseases. Evaluation of our system shows significant improvement over the current systems. In addition, we have applied DiMeX on a large set of Medline abstracts stored the results in a database.

1.2.3 Impact of genomic anomalies on drug responses The third aspect of this dissertation is the development of a text mining system, named eGARD [34], to automatically extract relations between genomic anomalies and drug response from scientific literature. The emergence of the rapidly growing field of precision medicine is expanding the pharmacogenomics literature quickly, identifying important relationships between drugs and molecular entities [35]. It is becoming prohibitive for biocurators, clinical researchers and oncologists to keep up with the rapidly growing volume and breadth of information, especially those that describe therapeutic implications of biomarkers and therefore relevant for treatment selection. There are databases such as PharmGKB [24] that manually curates genetic variants and their impact on drug response and diseases. To assist manual curation, several text mining systems [35–40] have also been attempted to extract such information, most of which are based on co-occurrence of the entities. In contrast, we have developed a NLP based system to extract the relations between genomic anomalies and drug responses.

3 In eGARD, we treated mutations and change in gene expression levels as ge- nomic anomalies. Drug response is a broader term which incorporates ideas such as sensitivity to drugs, outcome of a drug or chemotherapy, survival after drug treat- ment etc. Evaluations on several dataset show that our system can assist the manual curation process by automatically extracting relations, thereby significantly reducing manual curation time and effort. In addition, we applied eGARD on a large number of Medline [41] abstracts and stored the extracted relations in a database.

1.2.4 Mutation impact on PPI Finally, the last part of this dissertation describes our approach to extract mu- tation impact on protein-protein interactions from text. To innovate personalized treatments, it is important to uncover how the ge- nomic profile of patients impact functionalities at the molecular level, thereby affecting the disease development. As proteins and their interactions are the building blocks of metabolic and signaling pathways regulating cellular processes [42], understanding how genetic mutations impact the functionality of protein-protein interactions is crucial for providing additional support to precision medicine efforts. There have been manually curated resources, such as IntAct [43] or BioGRID [42], that houses protein-mutation interactions. There have been separate efforts towards extracting mutational infor- mation [4–14, 27–31] and PPI interactions [44–48] from text, but only a handful of recent works [49–51] exists that target to integrate these two tasks. To address this, we have developed a text mining system to automatically extract mutational impact on protein-protein interactions (PPI) from scientific literatures. To be precise, we ex- tract the mutation and the two interactants in the PPI relation, in cases where the mutation impacts the interaction. Evaluations on dataset acquired from BioCreative VI precision medicine task [52] shows high precision of our system.

4 1.3 Outline of the dissertation The rest of the dissertation is organized as follows. Chapter2 discusses the details of mutation extraction from text and the techniques for mutation-gene asso- ciation. In Chapter3, we present the DiMeX system that associates mutations with diseases. The text mining system eGARD, to find the association of genomic anomalies with drug responses, is discussed in Chapter4. Chapter5 discusses our approach to extract mutational impact on protein-protein interaction. Each of the chapters (2-5) reports related works and evaluation results of the experiments completed. Finally, we conclude with summary of the contributions and future work in Chapter6.

5 Chapter 2

MUTATION DETECTION

2.1 Introduction Rapidly evolving sequencing technologies, combined with increasing emphasis on finding the impact of genetic variations on diseases, is causing the volume of scientific literature mentioning genetic variations to grow quickly. In order to extract relations between mutations and other entities, it is essential to recognize mutation mentions in text. Additionally, associating the mutations with corresponding genes will help validate the mutations. To address this, we have developed a system, MeX, to detect a wide range of mutation mentions from text. Additionally, the system detects genotype and allele mentions as well. In our work, we have developed a novel algorithm that looks into the structure of the text to associate each mutation with the relevant gene. The mutation detection and mutation-gene association components are fully portable, meaning they can be used in any other application. This chapter describes these two components as well as the evaluation of these two tasks.

2.2 Related works The number of scientific articles mentioning mutations and their impact on phenotypes has been expanding rapidly over the last few decades. As we know manual curation from text is expensive, and most importantly, it cannot keep pace with the fast growing volume of biomedical literature. To assist manual curation, several search- based and text-mining (TM) efforts have been attempted to detect mutations from text. The majority utilize regular expressions to detect mutations, such as MEMA [4], MuteXt [5], MuGeX [6], Mutation GraB [7], MutationFinder [8] and Yip et al. [9]. There are some, like tmVar [11] and VTag [12], that use conditional random fields

6 (CRFs), and SETH [10], which implements an Extended Backus-Naur Form (EBNF) grammar to detect mentions of mutations. EMU [13] uses regular expressions, but applies a sequence filter furthermore to validate the extracted mutations. There are several corpora that are publicly available to support evaluation of mutation extraction tools. The most commonly used is the corpus provided with the MutationFinder [8] tool, which covers only protein mutations. The corpus provided with the tool tmVar [11] covers a wide range of mutations that includes substitutions, deletions, insertions, INDELs, duplications, frameshifts and SNP IDs. Verspoor et al. [53] developed a corpus named Variome of 10 full text publications which includes both the specific mutation mentions as well as generic references to mutations such as mutations or somatic mutations. The mutation types that are included in the Variome corpus are substitution, insertion and deletion. Thomas et al. [10] also provides muta- tion annotations for a corpus comprised of articles collected from a targeted sampling of journals. Some of the mutation detection tools are limited to identifying mutation men- tions only, whereas others extend the mutation detection method to associate the mutation with genes/proteins. MEMA, MutXt and MuGeX use the co-occurrence and proximity of protein and mutation mentions to infer mutation-protein relation. Muta- tion GraB uses graph-based traversal to identity mutation-protein pair that have the shortest path in between. EMU have proposed the use of sequence filter to validate mutation-gene association. Polysearch [28] uses only co-occurrence information to as- sociate mutations with genes. LitVar [32] considers sentences that mention mutations detected by tmVar and highlights co-occurring genes appearing in the same sentence.

2.3 Approach In this section, we first describe the methodology for mutation detection. After that, we describe the methods for associating mutations with corresponding genes.

7 2.3.1 Mutation detection A mutation can be expressed in text in various forms. The most common forms are protein and DNA level mutations, as shown below with some examples.

• Protein level mutations: Ala282Val, Asp32>Asn, T877A, Phe153—-Ala etc.

• DNA level mutations: A3537G, 4304G>A, 1066-6T>G, -79C/T etc.

In addition, mutations could be further classified into other types such as inser- tion, deletion, SNP IDs, frameshifts and nonsense mutations. Below are some examples of these types of mutations.

• Insertions: 5382insC, IVS9-5insT etc.

• Deletions: 9631delC, 6886delGAAAA, IVS19+2delT etc.

• SNP IDs: rs1800795, ss984046046 etc.

• Frameshifts: p.Pro246HisfsX13, Leu203fsX15 etc.

• Nonsense: Y497X, p.C52* etc.

We have developed a method to detect a wide range of mutation mentions from text, including but not limited to the above mentioned mutation types. The mutation extraction employs a list of patterns to detect mutations primarily containing the three components, i.e., wild-type symbol, mutant-type symbol, and the position. The patterns allow for symbols to be single, 3-letter, full mentions of amino acids, or [A,C,G,T] for DNA-bases. In some cases, conjunctions are part of mutation mentions. For example, in PMID:9466928, “Ala16 –>Cys, Thr, Met, Arg, His and Tyr” is mentioned. We detect the conjunctions in this case and generate six mutations: Ala-16-Cys, Ala-16-Thr, Ala-16-Met, Ala-16-Arg, Ala-16-His and Ala-16-Tyr. Usually the mutation mentions follow the formats recommended by the Human Genome Variation Society (HGVS) [54]. But often times they are expressed in natural text instead of specific formats. Our system also attempts to extract such mentions of mutations that are beyond the scope of the patterns. These extractions are triggered

8 by detection of a pair of amino acids or nucleotides [A,C,G,T]. These are considered as wild-type and mutant-type symbols if an associated mutant position is found. If the mutant position is not mentioned in the same phrase as the wild and mutant- type symbols, then it is often attached to the phrase with a prepositional phrase (See examples below). We search for specific words, such as codon, position, residue etc. to locate the mutant position. Some examples of a range of mutations extracted using this technique are listed below.

• A–> C transversion in codon 135

• T to C transition at positions 409 and 412

• Ser–> Leu change at amino acid 217

• guanine-adenine point mutation at nucleotide 2185

We employ a normalization technique to canonicalize the SNP mentions into one standard format by matching the wild-type, mutant-type and position. We use WildType-pos-MutantType as the standard format for normalization. For example, G5557A and 5557G>A (PMID:22200742) normalize to the same mutation G5557A. In addition to detecting specific mutation mentions, we aim to detect mentions of genetic aberrations that do not specify the mutation. These mentions carry valuable information regarding the genetic alterations for a specific gene. For example, our system detects the phrases such as MET gene copy number, TOP2A gene alterations, and Amplification of HER2 as well, where all represent aberrations in the genes MET, TOP2A and HER2, respectively.

2.3.2 Genotype/Allele detection In addition to detection of mutations, we are interested in also detecting al- lele/genotype mentions and identify the connected forms of the gene. As our goal is to detect mutations and associated them with other biological entities, it is important to also capture mentions of genotypes/alleles from literature. Because, we noticed that sometimes, instead of mentioning the complete mutations, the authors choose to refer

9 to them using the genotype or allele notations. Our system detects these mentions as well as match them to the corresponding mutation that might be mentioned elsewhere in the abstract. In Example 1 below, the authors indicate an association between the mutation (Arg194Trp) and gastric cancer. However, the mutation is referred using the allele 194Trp, whose corresponding mutation is Arg194Trp.

Example 1: “XRCC1 194Trp allele significantly increased the risk of gas- tric cancer and also associated with risk of gastric cardia carcinoma and promoted distant metastasis of gastric cancer.” (PMID:20863780)

Allele mentions are usually similar to mutation mentions except sometimes they may not specify both the wild-type and mutant-type. However, it is also common to find genotype mentions in articles that discuss mutations. Example 2 demonstrates one such case where AA, AG and GG genotypes are mentioned. To identify the cor- responding mutation in the abstract, we look for the mutations whose wild-type and mutant-type match with the genotype nucleotides. If multiple such mutations are found, we associate the closest one with the genotype. For example, the AG genotype is matched with the mutation +49G/A from an earlier sentence in the same abstract.

Example 2: “In HCC and CHB groups, the genotype frequency was 40.3% and 50.0% for GG, and 59.7% and 50.0% for AG+AA, respectively” (PMID:20813679)

2.3.3 Mutation-gene association Our preliminary studies have revealed that mutations are usually mentioned in text along with gene names, and often times there are multiple genes in the same text. That is why it is important to know the corresponding gene for each mutation. There are previous systems that associate mutations with genes, mostly using the co- occurrence information which suffers from low precision. In our approach, we look into the syntactic and semantic structure of the text to associate each mutation with the relevant gene.

10 First of all, we have used BioNex [55] to perform the chunking of text as we will be interested in the noun phrases (NP). In simplest of cases, associating a muta- tion with a gene is straightforward because the mutation and the corresponding gene are mentioned close to each other. Based on this observation, we developed the first rule that makes an association between the mutation and the gene with high confidence when both of them appear in the same NP or NPs connected with prepositional phrases. Example 3 presents one such case where the mutations “C-2123G”, “G-1969A”, and “T715P” are associated with the gene SELP from the NP “C-2123G, G-1969A and T715P in SELP”, and “Met62Ile” is associated with PSGL-1 from the NP “Met62Ile and the VNTR variants in PSGL-1 gene”.

Example 3: “Our aim was to evaluate the contribution to CHD of the following SNPs: C-2123G, G-1969A and T715P in SELP, Met62Ile and the VNTR variants in PSGL-1 gene in a North African population from .” (PMID:20376705)

Even in situations when a particular mutation occurrence does not have an accompanying gene in the same NP, we have noticed that, often, the gene is mentioned in the same NP (with or without prepositional phrases) with the mutation at least once in some other sentence in the abstract. We propagate the gene detected in these latter cases to all occurrences of the mutation in the rest of the abstract. The second rule of the algorithm considers cases where the mutations do not appear together with their corresponding genes in any sentence of the abstract. If multiple genes are mentioned in the abstract, we look for textual clues that convey the information that mutations of particular genes are discussed in the context of the abstract. To do that, we look for gene mentions that occur together with a mutation- specific term, such as “variant”, “mutant”, “variation”, “mutation”, “polymorphism”, “alteration”, “SNP” etc. in the same NP. We call this occurrence of the gene and the mutation specific term a gene-mutation pairing. For any detected mutation, we associate it to the gene mentioned in the closest gene-mutation pairing that occurred

11 previously in text, either in the same sentence or any sentence before. Once a mutation has been associated with a gene, the association is propagated to every occurrence of the mutation in that abstract. In Example 4a, the gene ELAC2/HPC2 is detected as having a gene-mutation pairing because of the phrase “mutations of the ELAC2/HPC2 gene”. The immediate sentence, shown in Example 4b, has a mutation “Glu622Val” that does no co-appear with a gene. Applying our rule, “Glu622Val” is associated with ELAC2/HPC2.

Example 4a: “Here, we screened for mutations of the ELAC2/HPC2 gene in 66 Finnish HPC families.”

Example 4b: “Several sequence variants, including a new exonic variant (Glu622Val) were found, but none of the mutations were truncating.” (PMID:11507049)

The next rule applies when the above rules fail to find the gene of the mutation. We have noticed that in the context of protein-protein binding interactions, often times the genes of a mutation are mentioned together with terms that indicate binding site or domain. So we look for gene mentions that occur together with terms such as “site”, “binding”, “domain”, “” etc. in the same NP. We call this occurrence a gene-sequence pairing. For any detected mutation, we associate it with the gene that appears in the closest gene-sequence pairing. Example 5 shows the gene gp91(phox) as having a gene-sequence pairing from the phrase “sequence analysis of his gp91(phox) gene”.

Example 5: “Sequence analysis of his gp91(phox) gene revealed a single- base mutation (C –> T) at position -53.” (PMID:9600921)

The final rule covers cases where a single gene is mentioned in the whole ab- stract. In that case, a detected mutation will be associated with that single gene.

12 Once a mutation is detected from the abstract, we apply the above four rules in sequence to associate it with the corresponding gene. We continue until we are able to associate a gene with the mutation. If all rules fail to associate a gene with the mutation, we report the mutation without any gene.

2.4 Evaluation 2.4.1 Evaluation setup We evaluated MeX’s mutation detection using three different corpora. The first one is MutationFinder [8] corpus, which we will refer to as MF. We chose MF for this evaluation as it is the most popular benchmark and has been used by Yepes et al. [56] to compare different mutation detection systems. The MF set consists of 910 point mutation mentions from 508 abstracts. One thing to note is that MF annotates only point mutations, whereas our system extracts various other types as well. For compar- ison purposes, we only considered the point mutations. To test the wide coverage of mutation extraction, we evaluated our system on two other corpora: tmVar [11] corpus and Variome [53] corpus. The tmVar corpus has 464 mutation annotations from 166 abstracts and contains various types of mutations including substitutions, deletions, insertions, frameshifts, duplications, INDELs and SNP IDs. The Variome is a corpus of 10 full text publications which includes both the specific mutation mentions as well as generic references to mutations such as “mutations” or “somatic mutations”. The mutation types that are included in the Variome corpus are substitution, insertion and deletion. In order to compare with other systems, we identified 118 instances of specific mutation mentions in the annotated corpus and used for evaluation. None of the above mentioned corpora annotate mutation mentions to genes. So, to evaluate MeX’s performance in extracting mutations and associating them with genes, we needed some other corpus that annotates mutations with genes. We have used annotated gold dataset from three different sources. The first dataset is a manually annotated corpus from the BioMuta [17] project, which we will call BiomutaC. BioMuta is an integrated database aiming to provide a framework for automated and manual

13 curation and integration of cancer-related variations. Although BioMuta considers full text, the annotation included in this data set is based on abstract text alone. BiomutaC contains 62 abstracts with 119 mutation-gene-disease association triplets. A second collection of two publicly available datasets described in Doughty et al. [13] was used for evaluating MeX. These allowed for the comparison of our work with previously published results. We will call the two sets PCa-filtered-UD and BCa- filtered-UD, corresponding to abstracts from prostate cancer (PC) and breast cancer (BC), respectively. These two datasets are discussed in further detail in Chapter 3. There are 97 and 132 abstracts in PCa-filtered-UD and BCa-filtered-UD, respectively. We have applied both MeX and EMU on these two datasets (PCa-filtered-UD and BCa-filtered-UD) so that the evaluation results are directly comparable. A third dataset is collected from the CIViC [57] project. CIViC is an open- access, community-driven resource for clinical interpretation of variants in cancer. It annotates variants and their clinical significance along with evidences from literature. The corresponding genes for the variants are also annotated. This gives us a chance to evaluate our mutation-gene association algorithm. We downloaded the publicly available CIViC database, and collected the variant-gene annotations with evidence Pubmed IDs. As the annotations are done from full length articles, we limit the annotations to the variant-gene pairs where the variant is mentioned in the article abstract. That yielded a set of 275 mutation-gene pairs from 231 Medline abstracts. We applied our mutation-gene association module on this set to evaluate its performance. Table 2.1 summarizes all the datasets used for evaluation in this study.

2.4.2 Evaluation metrics We counted true positives (TP), false positives (FP), and false negatives (FN), and used the standard information retrieval metrics of Precision (P), Recall (R), and F- measure (F) for performance evaluation, where P = TP/(TP+FP), R = TP/(TP+FN) and F = 2PR/(P+R).

14 Table 2.1: Summary information of the datasets used for evaluation purposes. refers to mutation-disease associations. refers to mutation-gene associations. M refers to mutation detection. Name of dataset Used for Used for evalua- Size tasks of tion of BioMutaC MeX 62 abstracts (119 triplets) PCa filtered UD MeX & EMU 97 abstracts (170 triplets) BCa filtered UD MeX & EMU 132 abstracts (216 triplets) CIViC MeX 231 abstracts (275 pairs) MF M MeX 508 abstracts (910 point mutations) Variome M MeX 10 full text articles (118 mutations) tmVar M MeX 166 abstracts (464 mutations)

2.5 Results and discussion 2.5.1 Results on mutation detection We have evaluated the performance of MeX’s mutation detection component on three corpora: MF corpus, Variome corpus and tmVar corpus. Table 2.2 pro- vides MeX’s results on these corpora along with the results of the tools Mutaion- Finder [8], EMU [13], tmVar [11], and SETH [10]. The results in Table 2.2 for the other tools are published results from the original papers, the SETH tool website (https://rockt.github.io/SETH/) as well as from [56]. In order to include EMUs per- formance on MF corpus, we had to consider performance for normalized mutations, since EMU only reports results for normalized mutations, where multiple occurrences of the same mutation are normalized to one entry and evaluation is done on the nor- malized mutation instead of considering all occurrences. MeX and tmVar exhibited the best results on MF corpus with 0.94 F-measures each, although MeX achieved the best F-measure of 0.93 in case of normalized mutations. We ran the systems tmVar, SETH and EMU on the Variome corpus and found that MeX’s performance, as mea- sured by F-measure, was not statistically significantly different from them (p-values using paired t-test: 0.89, 0.05 and 0.07, respectively). On the Variome corpus, SETH achieved the best precision (0.99) while MeX and tmVar achieved the best F-measure (0.94). Similarly for tmVar corpus, MeX and tmVar both achieved the top F-measure

15 (0.91) while SETH exhibited the top precision (0.94). In summary, both MeX and tm- Var consistently perform better than the rest. Both are also the most comprehensive in terms of mutations that are detected.

Table 2.2: Evaluation of mutation detection systems on various datasets. The values in precision (P), recall (R) and F-measure (F) for tools other than MeX are obtained from comparisons performed in [56] and published results in [8], [11] and SETH tool website. A dash (‘-’) indicates unavailabil- ity of data. The tools are MutationFinder (MF), Extractor of Mutations (EMU), tmVar and SNP Extraction Tool for Human Variations (SETH) and MeX. For the MF corpus, the results in parenthesis represent eval- uation on normalized mutations where multiple occurrences of the same mutation are normalized to one entry. Corpus Tool Performance measures MF (MF mu- Variome tmVar tations normal- ized) MF P 0.98 (0.98) 0.94 - R 0.82 (0.81) 0.16 - F 0.89 (0.89) 0.24 - EMU P - (0.99) 0.97 - R - (0.81) 0.76 - F - (0.89) 0.85 - tmVar P 0.99 (0.98) 0.97 0.91 R 0.90 (0.84) 0.91 0.91 F 0.94 (0.90) 0.94 0.91 SETH P 0.98 (0.97) 0.99 0.94 R 0.82 (0.81) 0.76 0.81 F 0.89 (0.88) 0.86 0.87 MeX P 0.99 (0.98) 0.96 0.94 R 0.89 (0.89) 0.92 0.89 F 0.94 (0.93) 0.94 0.91

2.5.2 Results on mutation-gene association Table 2.3 summarizes the evaluation results for mutation-gene () asso- ciations. We achieved high precision and recall on BiomutaC set, with an F-measure of 0.93. We have also achieved F-measure of 0.94 for both PCa-filtered-UD and BCa-

16 filtered-UD sets. EMU scored F-measure of 0.76 and 0.68 on the PCa-filtered-UD and BCa-filtered-UD sets, respectively. Paired t-tests yielded statistical significance with p=0.003 and p=0.0007, respectively for the latter sets. In-depth analysis of the false positive (FP) and false negative (FN) cases revealed that most of the errors are due to gene mention detection problems. For example, for BiomutaC set, the gene mention detector failed to detect the target gene name in five cases. Because the correct gene was missed, our algorithm linked the wrong gene, contributing towards both FP and FN. Similar trend was seen for PCa-filtered-UD and BCa-filtered-UD sets. Evaluation results for CIViC dataset is shown in Table 2.4. We achieved similar recall values (0.93) for the CIViC dataset, too. Please note that we are only able to assess the recall value for this set, as it only contains positive annotations. We inspected the errors and found that in some cases the mutation was not detected which contributed to the FN cases. The gene detector failed to detect the target genes in few cases, and our heuristics were unable to find the associated gene for the mutation in rest of the cases.

Table 2.3: MeX’s performance of mutation-gene association () and compari- son with EMU’s performance. DiMeX performance EMU performance Datasets P R F P R F BiomutaC 0.90 0.95 0.93 - - - PCa filtered UD 0.96 0.92 0.94 0.77 0.75 0.76 BCa filtered UD 0.94 0.94 0.94 0.65 0.72 0.68

Table 2.4: MeX’s performance of mutation-gene association () on CIViC dataset. Dataset TP FN Recall CIViC 255 20 0.93

2.6 Conclusion We have developed a system (MeX) to detect a wide range of mutation men- tions from text. Additionally, the system detects genotype and allele mentions as well.

17 Evaluations on several corpora exhibit that we have achieved state-of-the-art perfor- mance in mutation detection and mutation-gene association. The mutation detection and mutation-gene association systems are built as separate modules that can used by other text mining systems. The entire system implementation is dockerized [58] for better portability which also helps run the system in a parallel fashion.

18 Chapter 3

MUTATION-DISEASE ASSOCIATION

3.1 Introduction In this chapter, we discuss our method to find textual mentions of association of mutations with diseases. We have developed a text-mining system (DiMeX) to extract mutation to disease associations from Medline abstracts. In this work, we are interested in finding the connections between mutations and diseases. Statements in the literature may connect mutations with the disease itself (as in Example 1) or with some aspect of a disease. An aspect of a disease could be an outcome or response to a treatment (marked by phrases such as survival, progression, remission, response rate, resistance etc.). Example 1 shows a sentence where the mu- tation His239Arg is directly connected to the disease lung adenocarcinoma. Example 2 shows a sentence where the mutation V600E is connected with two aspects of the disease papillary thyroid cancer: prognostic factor and clinical outcome.

Example 1: “His239Arg SNP of HRAD9 is associated with lung adenocar- cinoma” (PMID:16444745) Example 2: “The association of the BRAF(V600E) mutation with prog- nostic factors and poor clinical outcome in papillary thyroid cancer” (PMID:21882184)

There are various ways a connection between mutations and diseases are made in scientific literature. Next, we will discuss the kind of sentence structure and connec- tions we attempt to capture in DiMeX between a mutation and a disease or a disease aspect.

19 CAIR relations The first type of connections correspond to four types of relations: Association, Regulation, Involvement and Comparison, or shortly referred to as CAIR. In DiMeX, we treat all of them under the umbrella term “association”. Association relations are explicit ways to mention an association between two entities, as in the case for Examples 1-2. Association relations are often mentioned in the literature using words/phrases such as “association”, “correlation”, “link” etc. Regulate relations are those where one entity regulates or controls another. Example 3 presents one such sentence where several mutations (Thr241Met, 135G>C and E233G) regulates the disease breast cancer (BC). The trigger words/phrases for regulate rela- tions include “regulate”, “confer”, “promote” etc. The involvement relations indicate one entity’s role into another concept, usually indicated by phrases such as “plays a role in”, “is required for”, “is involved in” etc. Example 4 is one such sentence that indicates a connection between Leu1074Phe and risk of prostate cancer. We also con- sider comparisons, where a connection between two entities are made in the form of a comparison. Comparative sentential structures are often used to assert an associa- tion. In biomedical literature, it is quite common that authors conduct experiments and compare an observation between two groups (e.g. controls vs disease groups). For instance, in Example 5, the two compared entities are wild-type and mutated NDPK- A and the observed entity is regarding the effectiveness in promoting neuroblastoma metastasis. Such kind of sentence structures can be used to state a connection between mutations and diseases, S120G and neuroblastoma metastasis in this case.

Example 3: “XRCC3-Thr241Met, RAD51-135G>C, and RAD51D-E233G have been found to confer increased BC susceptibility” (PMID:20054644)

Example 4: “the genetic variant Leu1074Phe in the DNA repair gene WRN might play a role in the risk of prostate cancer.” (PMID:22037268)

Example 5: “Compared with its wild-type, NDPK-A (S120G) appears more effective in promoting neuroblastoma metastasis.” (PMID:15280446)

20 Mutation Found relations We detect another type of connection between mutations and diseases, which we named “Mutation Found” (MF) relation. These are described in sentences that mention mutations that were found for a small set of patients (see Example 6). While we cannot draw conclusions of statistical significance from such findings, these are still noteworthy., We believe that by making such statements, the authors believe the con- nection between the mutation and the disease may be consequential.

Example 6: “Low frequency KRAS active (G12R) and EGFR kinase do- main mutations (G719A) were identified in one NSCLC patient” (PMID:24200637)

Statistical relations Apart from the CAIR relations mentioned above, we also infer connections be- tween mutations and diseases from sentences that mention some statistically significant results. The literature often contain sentences that mention findings with statistical significance by mentioning P or OR (odds ratio) value. While in many cases, sentences mentioning P or OR value also fit the pattern for the previously mentioned CAIR or mutation found (MF) relations, there are some sentences that go undetected as having either a CAIR or MF relation. So for a sentence that mentions a mutation and a disease or disease aspect, if we don’t detect CAIR or MF relation but notice that if it contains a P or an OR value then we marked such sentences as a “statistical sentence”. Our assumption is that a sentence mentioning P or OR value and mentions a mutation and a disease, we assume that the experimental results indicate a connection between the two. We refer to these relations as “Statistical” relations. Example 7 is a sentence of this kind.

Example 7: “we found a significantly increased overall lung cancer risk for Lys939Gln polymorphism (recessive model: OR = 1.14, 95 % CI = 1.01-1.29, P = 0.218 for heterogeneity).” (PMID:24375193)

21 Neither our method for extracting associations nor the comparisons component were able to detect the connections in this case and the statistical relation module serves as a backup method.

Co-occurrence in title/conclusion Finally, the title or a concluding sentence of an abstract usually summarizes the results or key points of the article. Therefore, if the title or conclusion contains mutation and disease mentions, we infer that the authors of the article are making an association between the mutations and the diseases mentioned in the title or conclu- sion, even if it does not follow the patterns of above mentioned relation types. This is especially useful for titles since titles are often not complete sentences but merely a single noun phrase. Hence there is low likelihood of mention of a CAIR relation using the usual patterns. The following title in Example 8 is just a noun phrase and since it is the title of the article, we assume the authors are likely to be discussing their findings regarding association of IVS1-27G>A with prostate cancer.

Example 8: “KLF6 IVS1-27G¿A variant and the risk of prostate cancer in Finland.” (PMID:17125911)

We implemented, evaluated and published an earlier version of DiMeX [14] by extracting the above-mentioned relations. However, in a newer implementation of DiMeX, we took advantage of a general relation extraction system [33] which is capable of identifying relations from text with higher precision, as well as identifying relations with more syntactic variations. Additionally, in the earlier version of DiMeX, we did not differentiate between cases where a mutation is associated with a disease as opposed to associated with an aspect of the disease. For instance, old DiMeX yields the association from Example 2, whereas new implementation of DiMeX can identify that V600E is associated with aspects (prognostic factor and poor clinical outcome) of papillary thyroid cancer.

22 3.2 Related works Rapidly evolving sequencing technologies [1,2] have led to a dramatic rise in the number of published articles reporting associations between genomic variations and diseases. There is an estimate that over 10,000 articles are published each year mentioning such associations [59]. Manually collecting this information is both ex- pensive and time consuming. Uniprot [15], COSMIC [16], BioMuta [17], OMIM [18], HGMD [19], UMD [20], HGVbaseG2P [21], MutDB [22], dbSNP [23], PharmGKB [24], ClinVar [25], CIVIc [57] and InSiGHT [26] are examples of repositories that house mu- tations and related disease and phenotype information laboriously manually curated from the literature. However, manual curation cannot keep up with the new informa- tion being published every year. To assist this manual curation, several text-mining (TM) efforts [4–13,49,60–63] have been attempted. While most of the tools perform mutation mention detection [4–13], and some of them extend to associate mutations with genes/protein [4–6, 13], only a limited number of efforts have been made to find the association between muta- tions and diseases from literature. Schenck et al. [27] combines existing TM methods into a workflow to associate mutations with cancers from text with high precision but low recall. PolySearch [28] is a search-based TM tool that infers relationships between mutations and diseases based on their frequency of co-occurrence in Medline abstracts and displays the possible associations with ranked evidence sentences. MutD [29] applies graph traversal in the dependency parse graph representation and makes an association among the mutation, protein and disease if all of them share a common node in the graph. SNPShot [36] and EMU [13] are TM methods that extracts muta- tions from abstracts and couples them with associated diseases based on co-occurrence statistics. Singhal et al. [30] applied a based method to identify the relations between mutations and diseases. Verspoor et al. [31] applied an existing rule- based relation extraction system, PKDE4J [64], to mine relations between variants and diseases using syntactic and semantic features from text. LitVar [32] looks into each individual sentences to find co-occurring mutations and diseases to infer relations

23 between them. There are several corpora that are publicly available to support evaluation of mutation-disease extraction tools. EMU annotates two corpora for breast and prostate cancers that associates mutations with the target disease. BRONCO [65] contains variants and their associations with cancers, extracted from 108 full-text articles. Var- iome [53] is a corpus of 10 full text publications that also associates mutations with diseases, along with other entities. A recent work, SNPPhenA [66], published a corpus for extracting the association of SNP-phenotypes from texts annotated with negation, modality, and the confidence degree of such associations.

3.3 Approach 3.3.1 General relation extraction system We employed a general relation extraction system to extract the relations dis- cussed in the previous section. The relations are stated in text through some lexical triggers. The relation extraction system uses the trigger words and syntactic depen- dencies to extract the predicate argument relations. Converting the syntactic parse tree into dependencies provides an output closer to the predicate argument relations. In Figure 3.1 the Stanford Dependency Graph (SDG) [67] for the sentence “His239Arg is associated with lung adenocarcinoma” is shown. Dependency Graph provides a representation of grammatical relations between words in a sentence. In Figure 3.1 one such dependency triplet is nsubjpass(His239Arg, associated), where the governor and dependent of the relation being “associated” and “His239Arg” respectively. The Association relations between “His239Arg” and the disease “lung adenocarcinoma” correspond to the two syntactic dependents of the lexical trigger “associated”. Parsing allows us to examine sentences at an abstraction level that abstracts away from many textual variations that are not important for extracting predicate- argument relations. For example, the same dependencies for Association are obtained from “His239Arg is associated with lung adenocarcinoma” as well as “His239Arg is

24 Figure 3.1: SDG representation of the sentence “His239Arg is associated with lung adenocarcinoma”. found to be associated with lung adenocarcinoma”. In both the cases, the use of de- pendency structure provides a uniform representation as far as the connection between “His239Arg” and “lung adenocarcinoma” is concerned. Since nearly all our relations are binary (between two arguments of a trig- ger), we assume that calling them arg0 and arg1 will suffice. In our approach, the predicate-argument relations of arg0(associate, His239Arg) and arg1(associate, lung adenocarcinoma) will be produced corresponding the text in Figure 3.1, as seen in Figure 3.2.

Figure 3.2: arg0 and arg1 edges from the sentence “His239Arg is associated with lung adenocarcinoma”.

To provide for such generalized representation and to account for other gener- alizations, the Extended Dependency Graph (EDG) framework, proposed by Peng et al. [33], was adopted. EDG not only considers syntactic dependencies between words in a sentence, but also utilizes information beyond syntax to capture different dependen- cies. From the syntactic dependencies provided by the Stanford Dependency Graph, arg0 and arg1 (henceforth called numbered argument) dependencies are produced. Ta- ble 3.1 illustrates the benefit of EDG by showing 10 example sentences which have

25 different syntactic representations. These sentences all have the same EDG represen- tation between the binding partners, which are the ARG0 and the ARG1 arguments of the trigger “bind”. In each case, ARG0 is roughly the agent (causer/doer) and ARG1 roughly corresponds to the theme (object that is affected).

Table 3.1: Example sentences which have different syntactic representations.

HFE binds to the transferrin receptor Plasminogen activator inhibitor 1 (PAI) is bound to vitronectin in plasma. binding of G beta gamma to Raf/330 Raf-1-binding proteins, Ras p53 binds and activates the xeroderma pigmentosum DDB2 gene in humans Histone deacetylase 1 can repress transcription by binding to Sp1. CD5 is a T-cell-specific antigen which binds to the B-cell antigen CD72 TPO binds and activates its receptor, myeloproliferative leukemia virus receptor The basic cleft of RPA70N binds multiple checkpoint proteins, including RAD9 ARTS binds to a distinct domain in XIAP-BIR3

3.3.2 CAIR relations Association, Regulation and Involvement relations are similar in and detected from predicate argument relations, where the mutation and the disease will be the arguments of the binary relation. As an example, Association triggers include “association” , “correlation” , “link” etc. The triggers could be verb-based triggers with prepositional phrases (PP) attached to them (e.g. “associated with” , “correlated with/to” , “linked to” etc.) as seen in Example 1. For example, from the syntactic dependencies of the sentence “His239Arg is associated with lung adenocarcinoma” in Figure 3.1, we follow nsubjpass edge to get arg0 and nmod:with to get the arg1. The triggers could also have nominalized forms with arguments attached to it via PP, as seen in the sentence in Example 2. In regulation relations, triggers are verb-based and have Noun Phrases (NP) as complement such as “regulates” , “mediates” , “promotes” etc. Here verb-based rules (active, passive and normalized) are used to determine the arguments. An example of

26 a Regulation relation from a sentence in the active form is presented in Figure 3.3. In this sentence we follow the nsubj and dobj edges from the trigger “promotes” to get the arg0 and arg1 edges respectively.

Figure 3.3: A regulation relation with trigger “promotes”.

Involvement relations have similar verb-based triggers as Association or Reg- ulation relations. However, it can have multi-word triggers such “plays a role in” or “has an effect on”. In these cases the presence of the nouns (role or effect) rather than the verb (has or play) indicates the relations. Consider the example sentence in Figure 3.4. Here we use the nsubj edge and nmod:in edge from “plays” to determine the arg0 and arg1 edges respectively. Additionally, we also need to consider the dobj edge from “plays” to “role” to add the arg edges.

Figure 3.4: An Involvement relation with trigger “plays a role”.

The comparison relations yield three arguments: two compared entities and a comparison aspect on which compared entities are being compared. The comparative sentence structure varies widely based on several comparative words. For instance, Figure 3.5 shows one comparison sentence with comparison marked by word “than”. In addition, we find the adjective with the comparative adverb (RBR). Here, arg0 edge points to the compared aspect and arg1/arg2 points to compared entities, which

27 are the mutation and the wild-type in this case. We adopted the approach by Gupta et. al. [68] for relation extraction from comparison sentences for different comparison structures.

Figure 3.5: A comparison sentence marked by the word “than”.

3.3.3 MF relations Similar to Regulation relations, Mutation Found relations have verb-based trig- gers with Noun Phrases (NP) as complement. The trigger words indicate a “find” relation, with arg0 pointing to an entity that is found/observed and arg1 pointing to the NP indicating where argo-entity was found. The triggers for MF-relation include “detect”, “identify”, “find” etc. Figure 3.6 shows the EDG representation of the sen- tence in Example 6. Here, we get the arg0 by following the nsubjpass from “identified” and arg1 by following the edge nmod:in.

Figure 3.6: A Mutation Found relation sentence.

3.3.4 Statistical relations If a sentence mentions mutations and diseases along with some statistical sig- nificance values (such as P-value or OR-value), we infer an association between the mutations and diseases mentioned. If the above CAIR and MF relations do not cap- ture any association from such sentences, we use some regular expression patterns to

28 look for mentions of P-values or OR-values. Then we extract the mutation and disease mentions from the sentence to mark the associations.

3.3.5 Co-occurrence in title/conclusion The title or a concluding sentence of an abstract usually summarizes the results or key points of the article. Therefore, if the title or conclusion contains mutation and disease mentions, we infer an association between the mutations and diseases mentioned. If none of the above relation types capture any association in such cases, we extract the mutation and disease mentions from the sentence to mark the associations. The detection of the rhetorical zone “conclusion” from abstracts is described in section 3.3.7.

3.3.6 Extracting specific information Once we get the arg0 and arg1 edges by the EDG-based framework, we need to identify the mutations and diseases that are in the relation to assert the mutation- disease association. The arg0 and arg1 edges point towards the head words of the corresponding noun phrases. From there, we can access the entire NP and NPs that are connected via prepositions or conjunctions. We mostly look into these NPs for the target entities, however, we have some additional heuristics to extract the mutation and the disease.

3.3.6.1 Extracting mutations We look for mutation mentions in either of the Noun Phrases pointed by arg0 or arg1. Please note that the NP may not contain specific mutation mentions. Instead, the authors may refer the mutation with a phrase indicating one or more mutations of a certain gene. Usually such phrases are headed by a word/phrase indicating mutations, such as “mutations”, “polymorphism”, “variants” etc. If we detect such mutational phrase with a gene name, we extract the referent mutation(s) from the closest sentence where the mutation is already associated with that gene. For instance, in Example 9, arg0 gives us the NP “BRCA1 mutations”, from which we extract the referent, R841W

29 from the previous sentence in the abstract.

Example 9: BRCA1 mutations cause increased risk for breast and ovarian cancer, frequently of early onset. (PMID:8968716)

3.3.6.2 Extracting diseases Similar to extracting mutations, we look for disease mentions in the argument NPs of the relation trigger. However, sometimes the disease may not be explicitly mentioned in the arguments but referred to by an aspect of the disease. In these cases, we assume that the disease is implicit in context. We extract the disease from the abstract to complete the association between the mutations and diseases. To do that, we take the disease mentioned in the Patient Context (PC) sentence (described below) to be the central disease of the study, and hence associate the disease with the mutations. If the disease is not found in a PC sentence, we look for the central disease at other rhetorical zones in the abstract in the following order: title, conclusion sentence(s) and introduction sentence(s).

3.3.6.3 Patient Context (PC) sentence There are sentences in biomedical abstracts which describe the subjects (pa- tients) who are part of the study reported in the article. Such sentences mention information such as total number of participants, size of control, demographic infor- mation etc. We call this type of sentences “Patient Mention” sentences. We observed that generally the first Patient Mention sentence in an abstract is the richest in this type of information. We define a sentence as Patient Context (PC) if it is the first Patient Mention sentence in the abstract. Example 10 is such a PC sentence. We have used a few simple patterns to identify such information from sentences. The patterns look for mention of quantitative values closely followed by mentions of patients or con- trols of certain diseases.

30 Example 10: “A total of 453 breast cancer patients and 382 age- and sex- matched controls from Greece and Turkey were analyzed.” (PMID:15330212)

3.3.7 Extracting additional information 3.3.7.1 Rhetorical zones We hypothesize that abstracts of biomedical articles have different rhetoric zones that convey different types of information. For example, an experimental result pre- sented towards the conclusion of the abstract is more likely to convey a new finding, whereas information in introduction is more likely to convey background knowledge. In our approach to extract mutation-disease associations, we have considered five rhetor- ical zones for each abstract and utilized the information conveyed by different zones: Title, Introduction/Background, Methods/Aims, Results, and Conclusion. If the abstract is a structured abstract and is already sectioned into these rhetor- ical zones, we detect and use this information to assign the sentences to the corre- sponding zones. Otherwise, we identify the zone boundaries in the abstract. There are previous works [69–74] that use different approaches to classify sentences into rhetor- ical zones or sections. These tools classify each sentence into one of the rhetorical zones using mostly machine learning based methods which require significant amount of training data. As we could not download any of these tools, we have developed our own module to perform the task. Our method is based on some heuristics rules, which looks at the abstract as a whole and set the boundaries of the sections instead of treating each sentence separately. In other words, we check whether a sentence marks the boundary of a section or not. The position of the sentences and certain keywords are used for this purpose. The start of the Introduction section and the end of the Conclusion section are obvious as the beginning and the end of the abstract text itself. For any other pair of consecutive sections, it is sufficient to find the start of the section that follows. To detect the end of the Introduction section and the start of the Method section, we look for phrases that indicate the goals of the study. These can be indicated in sentences that starts with phrases such as “we have analyzed”, “we studied”, “our

31 aim is to”, etc. If no such sentence is found, we assume that the Method section starts after three Introduction sentences. Similarly, to detect the shift from Method section to Results, we search for sentences that present findings from the study. Phrases such as “we found that”, “the results indicate that”, “our findings exhibit that”, “we have shown that”, etc. are strong indicators for Results sections. Finally, to mark the end- ing of the Result section and the start of the Conclusion section, we look for phrases like “In conclusion”, “We conclude by”, “In summary” etc. that convey concluding remarks. In case we fail to find such cases, we assume that the very last sentence of the abstract forms the Conclusion section. Please note that there could be more than one sentence for each of the sections. Table 3.2 shows one example abstract that is split into five rhetoric zones identified by our system.

3.3.7.2 Patient related information For mutation-disease associations, it is helpful to know information related to the patients, such as the size of the experimental patient population and the control population, the race or nationality etc. This additional information is extracted from literature and associated with the abstract. Patient-related information is commonly present in Patient Context sentences. Consider the Patient Context sentence that we have already presented in Example 10. We extract the following information from Example 10:

• Patients: 453, Controls: 382

• Population: Greece and Turkey

We extract the region of the population or nationality using a pre-compiled list of country names, their adjectival forms and demonyms (names given to residents of a place, e.g., Sri Lankan, Chinese, Peruvian etc.). To detect the patient cohort size, we use predefined regular expression patterns.

32 Table 3.2: An abstract split into rhetoric zones (PMID:10810408).

Rhetorical zone/section Sentences Title Missense alterations of BRCA1 gene detected in diverse cancer patients. Introduction/Background The mutations in the breast cancer susceptible gene BRCA1 are responsible for about 50% of in- herited breast cancers and confer increased risk of breast and ovarian cancer to its carriers. BRCA1 gene mutations may also be related with other types of cancers such as prostate cancer and col- orectal cancer. Methods/Aims The goal of this study was to investigate if BRCA1 mutation could be detected in diverse types of can- cers. We used PCR-NIRCA and PCR-SSCP meth- ods for screening the BRCA1 mutation hot regions, exons 2, 5, 11, 16 and 20. The positive samples were sequenced to confirm the nature of the mu- tations. Results We have identified a rare sequence variant, A3537G (Ser 1140Gly) in a B cell lym- phoma patient and two polymorphisms, A1186G (Gln356Arg) in a brain cancer patient and A3667G (Lys1183Arg) in a germline tumor patient. Conclusion In conclusion, 3 missense alterations of BRCA1 gene have been identified in cancers other than breast cancer.

33 3.4 System implementation The input to the DiMeX system is a list of documents that we obtain from the Medline repository. We also collect the named entities; namely genes, mutations and diseases from PubTator [75]. We run our mutation detector tool (discussed in previous chapter) on the documents which identifies additional mutation mentions and associate the mutations to genes. After that, the documents are passed via the relation extraction system. The text are split and tokenized using the Stanford NLP [76] pipeline. Then we use the Bllip parser [77, 78] to obtain the parse trees and then apply Stanford Conversion tool [76] to get the Stanford Dependency Graph for each sentence in the text. As discussed above, we apply rules corresponding to the different relation types and thus obtain the EDG representation for each sentence. The numbered argument edges in EDG (arg0, arg1 and optionally arg2) will point to the head of the arguments of the relations. We will use the parse trees to extract argument phrases, which will be the parent noun phrase (with prepositional attachments) of the argument head. We extract the desired entities of interest in the relation from the arguments. In order to build a database containing the vast amount of mutation-disease association information already available in free text, DiMeX was applied on a large subset of PubMed collection. To select the subset from PubMed, we ran a search on PubMed using the query “cancer[tiab]) AND (mutation[tiab] OR variant[tiab] OR polymorphism[tiab])” and selected abstracts from 2009 to 2011. This yielded a total of 9873 PMIDs, among which 9727 had abstract text. We applied DiMeX on this set of 9727 abstracts, extracting the mutations, associating them with genes and diseases, and storing the triplets along with the additional information in the database. It is often the case that a system like DiMeX, which encompasses various mod- ules, is developed and run on a set of specific machines because of the need to have all the correct modules available in the machine. Thus, the portability of the system is always limited. To address this, we have dockerized [58] the entire DiMeX system into one container. This enables us to quickly take the dockerized version of DiMeX to a new server and run it with relative ease. Additionally, it allows for the parallel

34 execution of the system in multiple threads with the help of an external framework, such as Apache Spark [79].

3.5 Evaluation 3.5.1 Evaluation setup DiMeX’s performance of mutation to disease association is evaluated using an- notated gold dataset from two different sources. The first dataset is BioMutaC, already described in Chapter2. It contains 62 abstracts with 119 mutation-gene-disease associ- ation triplets. There are no previously published results on BiomutaC set. So, in order to compare DiMeXs performance with already published results, we have used a second collection of two publicly available datasets, namely PCa-filtered-UD and BCa-filtered- UD, corresponding to abstracts from prostate cancer (PC) and breast cancer (BC), respectively. There are 97 and 132 abstracts in PCa-filtered-UD and BCa-filtered-UD, respectively. Originally, we wanted to evaluate DiMeX on the exact datasets PCa- filtered (113 abstracts) and BCa-filtered (147 abstracts) that were used in Doughty et al. [13], giving us a chance to directly compare performances with already published results. However, instead of the exact datasets, we received two larger datasets from the authors and the filtering criteria to regenerate the datasets they have used. Using the filtering criteria provided by the first author, we had generated PCa-filtered-UD and BCa-filtered-UD. The details of the creation of PCa-filtered-UD and BCa-filtered- UD could be found in [14]. There are 170 and 216 mutation-gene-disease association triplets in PCa-filtered-UD and BCa-filtered-UD, respectively. We have applied DiMeX and EMU on these two datasets so that the evaluation results are directly comparable.

3.5.2 Evaluation metrics We counted true positives (TP), false positives (FP), and false negatives (FN), and used the standard information retrieval metrics of Precision (P), Recall (R), and F- measure (F) for performance evaluation, where P = TP/(TP+FP), R = TP/(TP+FN) and F = 2PR/(P+R).

35 3.6 Results and discussion 3.6.1 Results on annotated datasets Table 3.3 lists the evaluation results for mutation-disease () associa- tions for BioMutaC, PCa-filtered-UD and BCa-filtered-UD sets.

Table 3.3: DiMeX’s performance of mutation-disease association () on dif- ferent datasets and comparison with EMUs performance. Datasets DiMeX performance in EMU performance in extraction extraction P R F P R F BiomutaC 0.87 0.89 0.88 - - - PCa filtered UD 0.95 0.88 0.91 0.76 0.75 0.75 BCa filtered UD 0.93 0.85 0.89 0.64 0.71 0.67

The F-measure score on BiomutaC is 0.88 with both P and R about the same. Analysis of the false positive (FP) and false negative (FN) cases revealed that most of the errors are due to gene mention and disease detection problems. Only in a handful of cases, the mistakes were made in the mutation detection, mutation-gene association or mutation-disease association. The main focus of our work, mutation- disease associations, failed only in two cases. F-measures of 0.91 and 0.89 were achieved by DiMeX for the PCa-filtered-UD and BCa-filtered-UD set, respectively with the PCa-filtered-UD set yielding higher precision and recall. EMU yielded F-measures of 0.75 and 0.67 for the PCa-filtered- UD and BCa-filtered-UD sets, respectively. We performed paired t-test to check for statistical significance on these datasets (p = 0.002 and p = 0.00007, respectively). Please note that these two datasets were only annotated for prostate cancers (PC) and breast cancers (BC). Therefore, the extracted triplets with any disease other than PC or BC were not considered in our evaluation. Since BCa filtered UD showed a little lower precision (0.93), recall (0.85) and F-measure (0.89) than PCa filtered UD in extraction, we analyzed the er- rors on this set. Similar to the BiomutaC set, analyzing the BCa filtered UD results

36 revealed that most FPs and FNs are due to mistakes in gene or disease detection. There are six cases of mutations being erroneously detected, mostly described in regu- lar text rather than standard formats used for mutations. In four cases, the mutation detection component missed the mutations entirely, contributing to FN. For example, in PMID:10207667, the phrase “This germ line mutation leads to the replacement of isoleucine by asparagine” gives the wild-type and mutant-type but the codon position is mentioned in the previous sentence. There were five cases of a wrong gene associated with a mutation. Similarly, there were several cases of mutation-disease associations being erroneously inferred, mostly because multiple diseases were mentioned in the context of the abstract and our extraction technique attached the wrong disease to the mutations. DiMeX failed to extract a handful of mutation-disease associations, which contributed towards lower recall as well. Many of these associations were described using patterns that were not part of our pre-compiled list of sentence patterns. The PCa-filtered-UD dataset also showed similar distribution of errors for the false positives and false negatives.

3.6.2 Full-scale processing We applied DiMeX on a set of 9727 abstracts (a subset of Medline collection), extracting the mutations, associating them with genes and diseases, and storing the triplets along with the additional information in the database. Table 3.4 lists some of the key characteristics of the extractions from this subset.

Table 3.4: Characteristics of the extracted results of the large scale run.

Characteristics Counts Abstracts 9727 Abstracts with at least one triplet 2511 Total triplets 7175 Unique triplets 6410 Unique mutations 3204

37 3.7 Conclusion We have developed a system (DiMeX) that finds the association between muta- tions and diseases from abstracts of published articles. Evaluations on several corpora show superior performance from our system. Additionally, we have applied the DiMeX system on a large set of Medline abstracts. The results of the DiMeX system are stored in a database. The entire system implementation is dockerized for better portability which also helps run the system in a parallel fashion.

38 Chapter 4

IMPACT OF GENOMIC ANOMALIES ON DRUG RESPONSES

4.1 Introduction Tumor molecular profiling plays an integral role in identifying genomic anoma- lies which may help in personalizing cancer treatments, improving patient outcomes and minimizing risks associated with different therapies. However, critical information regarding the evidence of clinical utility of such anomalies is largely buried in biomed- ical literature. In an effort to improve and speed up the process of manually reviewing and extracting relevant information from literature, we have developed a text mining system called eGARD (extracting Genomic Anomalies association with Response to Drugs). This system is designed to extract relations between genomic anomalies and drug responses from scientific abstracts. Example 1 provides a sample sentence that captures the information eGARD is designed to extract.

Example 1: “Low expression of Bax was significantly associated with poor survival of patients with metastatic or recurrent gastric cancer treated with FOLFOX chemotherapy.” (PMID: 20503071)

By genomic anomaly, we consider not only mutations but also differential ex- pression of genes and proteins. For instance, in Example 1, the genomic anomaly is given by the phrase “low expression of Bax”. In addition to considering expression levels, eGARD considers genetic variants, including substitutions, duplications, inser- tions, deletions, gene copy number variations and structural variants. eGARD captures phrases like “Over-expression of ERCC1”, “Patients with high RRM1”, “PAX4 variant rs6467136”, “IL28B polymorphisms” that express anomalies.

39 In order to capture the effects of genomic anomalies on responses to treatment, we are interested broadly in phrases such as response rate, change in sensitivity to drugs, and outcome of treatment, which includes overall survival and progression free survival. While there could be more terms that represent the effects, these concepts are of key concern to precision oncology, explaining their prevalence in the associated literature. We will refer to them as RO entities (RO stands for Response or Outcome) in this article. An account of how RO entities are detected is described later in this article in the subsection “Entity recognition”. Thus, Example 1 associates RO entity “poor survival” with the anomaly entity Bax’s low expression level. Clearly, the information about the response/outcome is incomplete without the extraction of drug/treatment and the disease for which this treatment is being used). In this example, eGARD detects that they are given by “Folfox chemotherapy” and “metastatic or recurrent gastric cancer”. While all four components detected by eGARD are mentioned in the same sentence in this example, this is not always the case and sometimes the disease and the drug have to be inferred from other sentences in the abstract. We interpret the relations between genomic anomaly and drug response as the anomaly having an impact on the drug treatment in the context of a disease. Such impact information can be conveyed in text in various ways. One common way is to express the relation as “association”, such as seen in Example 1. Associations are often implied by comparative statements as exemplified by Example 2. In Example 2, the two compared entities are ERCC1-positive and ERCC1-negative patients, and these groups are compared for two types of survival (PFS and OS). Thus, from this sentence, eGARD’s extraction suggests that there is an impact of ERCC1 level in patients on paclitexel therapy on survival.

Example 2: “ERCC1-negative patients had better PFS (P = 0016) and OS (P = 0030) compared with positive patients.” (PMID:25107571)

40 Next, we consider cases where there is no explicit comparison. Instead some sentences simply present a quantitative value of an RO entity for multiple and often contrastive Anomaly entities. When an expression of a P-value follows the quantities, we hypothesize that there is an implicit comparison being made, and as before we take comparisons to indicate some sort of association. Example 3 below illustrates such a case where the outcome is mentioned in terms of response rates and is connected to ERCC1 levels.

Example 3: “Among cohort 2, the response rates of patients with low ERCC1 and high ERCC1 expressions were 45.5% and 20.0% respectively (P = 0.361).” (PMID:23358102)

Example 4 is an example of a different type of connection between an anomaly and patient outcome. While this connection need not be an impact of the given anomaly, nevertheless the sentence conveys a connection between low ERCC1 level and survival. In designing eGARD we are interested in such statements where the anomaly is stated to serve as a biomarker or indicator of outcome.

Example 4: “Multivariate analysis showed that low expression of ERCC1 was an independent predictor for prolonged survival (HR, 0120; 95% CI, 00160934, P = 0043).” (PMID:18594541)

eGARD also extracts information that connect genomic anomalies with the ef- ficacy of a treatment. For example, the sentence in Example 5 shows the relation between a genomic anomaly (ATM deficiency) and the sensitization to a treatment (“poly (ADP-ribose) polymerase-1 inhibitors.”).

Example 5: “ATM deficiency sensitizes mantle cell lymphoma cells to poly (ADP-ribose) polymerase-1 inhibitors.” (PMID:20124459)

41 4.2 Related works Rapidly evolving molecular profiling technologies have enabled improved detec- tion of alterations in genomic biomarkers that predict response to cancer treatments. This in turn has led to a dramatic rise in the number of studies analyzing the effects of tumor-specific alterations on drug response. Nevertheless, the large volume and com- plexity of cancer precision medicine literature makes it challenging for busy oncologists and clinical researchers to sort through vast amounts of data and review pertinent information that can inform personalized treatment plans for their patients. Several large scale consortiums such as ClinGen [80], ClinVar [25], My Cancer Genome [81], and CIViC [57] have ongoing efforts to standardize and organize large scale informa- tion linking genomic variants to phenotypic data to drive precision medicine research. UniProt [15], BioMuta [17], OMIM [18], UMD [20], HGVbaseG2P [21], MutDB [22] and dbSNP [23] are few other repositories that house mutations as well as related dis- ease and phenotype information. PharmGKB [24] curates genetic variants and their impact on drug response and diseases. However, all these efforts are based on meticu- lous manual curation of literature by experts, which is labor intensive and expensive. While expert curated data is highly accurate, the simple task of searching the PubMed database and sorting through thousands of non-relevant papers in order to identify the relevant ones can be extremely time consuming. With the realization of the importance of biomedical text mining, there have been numerous tools developed for various purposes [82, 83]. There have been sev- eral text mining tools to extract information from the literature in the pharmacoge- nomics area, too. Such tools need to first identify mentions of biomedical entities from literature. Currently available tools extract different biological entities [75] such as genes [84], diseases [85], chemicals [86], mutations [11] and species [87]. Addi- tional tools to identify relationships between these entities have been developed such as mutation-disease [30, 88] from scientific literature. There are certain tools in phar- macogenomics domain that identifies relationships between drugs and other entities. SNPshot [36] finds binary relation between entities such as mutation-drug, allele-drug

42 and gene-drug using the co-occurrence information of entities along with parse tree and keyword matching. HiPub [89] finds relations between entities using sentence-level co-occurrence and information from external databases such as PharmGKB and Drug- Bank [90]. Xu et al. [37] developed a conditional relationship extraction approach to extract drug-gene pairs from MEDLINE abstracts using known drug-gene pairs as prior knowledge. Rance et al. [38] used a co-occurrence based approach to automatically ex- tract mutation-drug relations. Rinaldi et al. [39] modified an existing protein-protein extraction tool to adapt to the extraction of pharmacogenomics relations. Pakhomov et al. [40] proposed to use the positive and negative labels for drug-gene relations from PharmGKB as training data to build a support vector machine classifier to pre- dict drug-gene relationships. Garten et al. [35] developed a text mining tool named Pharmspresso for extracting pharmacogenomics concepts and relationships from full text. Lee et al. [91] extracted mutation-drug relations from literature using machine learning classifiers such as random forest and deep convolutional neural network, with word vectors trained on PubMed abstracts and Google News articles. Additionally, re- view articles from Garten et al. [92] and Coulet et al. [93] provide a rich description of the state-of-the-art extraction of pharmacogenomics information from the biomedical literature.

4.3 Approach 4.3.1 Different information types The information regarding the connection between genomic anomaly and drug responses are expressed in text in various ways. Our approach is to detect certain lexico-syntactic dependency structures in sentences to extract the relations between an Anomaly entity and a RO entity. Based on our observation, we have identified several ways such relations can be conveyed. They are described below along with the methodology used to extract such relations.

43 4.3.1.1 Association There are sentences that mention the relationship between genomic anomaly and drug responses in the form of an association. Usually there is a “trigger” word that indicates the association relation between an Anomaly entity and a RO entity. In Example 1 the trigger word is the “associate” itself and in Example 6 below, the trig- ger word is “correlate”. There are several other trigger words (such as “relationship”, “contribute”, “(play a) role”) that we use for this sentence structure, with several tex- tual variations for each trigger.

Example 6: “The MGMT expression was inversely correlated with response to temozolomide.” (PMID: 20130512)

As with Examples 1 and 6, the Anomaly entity and the RO entity serve as syntactic arguments of the trigger word. The syntactic relations are of subject and object of the trigger, and the trigger word is a verb. Our rules for association type sentence structure look for the syntactic pattern: “trigger word” . Our system first identifies sentences which contain the trigger word. In such cases we check if the requisite syntactic arguments of the trigger are an Anomaly entity and a RO entity. In this case, the pair is extracted.

4.3.1.2 Comparison As we already mentioned in section 4.1, an association between Anomaly entity and RO entity can be expressed in the form of a comparison. These types of sentences are very common in biomedical literature, especially in the pharmacogenomics domain. Unlike the association relation, there are three entities involved in the comparison: an observed entity and two compared entities. We are interested in comparison sentences where an observed entity is an RO entity and the two compared entities are related to the Anomaly entity (as in Example 2 which is duplicated below) or vice-versa.

44 Example 2: “ERCC1-negative patients had better PFS (P = 0016) and OS (P = 0030) compared with positive patients.” (PMID:25107571)

In Example 2, the observed entity is indicated by the phrase “better PFS (P = 0016) and OS (P = 0030)”, and the two compared entities are “ERCC1-negative patients” and “positive patients”, respectively. Therefore, this comparison marks an association between ERCC1 expression level and survival of patients. To recognize and extract information from comparisons, we look for two clues. First, we consider the trigger to be a comparative adjective (with the part of speech of JJR in the parse structure) such as higher, lower, greater, better etc. (“better” in Example 2). Then, we look for phrases such as “compared with”, “in comparison to”, “compared to”, “than” and “versus” which often separates the two compared entities. The comparative adjective is connected to the observed entity in one of two ways. It can modify the observed entity which appears as noun to its right as in Ex- ample 2. The syntactic pattern that corresponds to Example 2 is: “trigger” comparison phrase . Alternatively, the com- parative adjective can appear as the head predicate of a sentence as in Example 7 and in these cases, its subject (TS and TP mRNA levels in Example 4) is extracted as the observed entity. The syntactic pattern that corresponds to Example 4 is: in “trigger” comparison phrase . Note that in ei- ther case, one of the two compared entities appears immediately after the comparison phrase (“compared with”, “compared to”, “than” etc.). The other argument will be either the subject when the JJR modifies a noun or as an adjunct of the subject when it appears as the head predicate (Example 7).

Example 7: “TS and TP mRNA levels in the patients with complete re- sponse, partial response or stable disease (n=34) were significantly lower compared to those in the patients with progressive disease (n=11) (p=0.017 and p=0.04, respectively).” (PMID:22783377)

There are several other variations in the comparison structures including cases where only one of the compared entities is explicitly mentioned in the sentence. In such

45 cases, the implicit compared entity is usually implied from the context. Consider the sentence in Example 8 where one of the compared entities is missing in the sentence, but implicitly referred to as groups with high level of TS.

Example 8: “However, the group with low level of TS had a longer DFS (144 mo versus 83 mo, P=0.017).” (PMID:17854149)

4.3.1.3 Biomarker A different type of connection is possible, where the genomic anomalies are stated to be markers for drug responses. These connections are triggered by words such as “predictor”, “biomarker”, “marker”, “indicator” etc. For instance, the sentence in Example 4 (duplicated below) is of this type of sentences with trigger word “predictor”.

Example 4: “Multivariate analysis showed that low expression of ERCC1 was an independent predictor for prolonged survival (HR, 0120; 95% CI, 0.016-0.934, P=0.043).” (PMID:18594541)

A key requirement for these relations is the presence of an “is-a” verb group. We have adopted the approach mentioned in miRiaD [94], which uses “is a”, “are”, “acts as”, “functions as”, “serves as” and appositives as triggers for the “is-a” relation. The Anomaly entity is found as the subject of “is-a” and the “marker” trigger is its object. The RO entity is found as the noun modifier for the trigger often linked with the preposition “for”. Using this rule, from Example 4, we can extract “low expression of ERCC1” as the Anomaly entity (from which the gene and anomaly are obtained), “prolonged survival” as the RO entity. Given the nature of copular sentences, we can detect variations of this type of sentences where the trigger might be an adjective (JJ) instead of noun, such as “predictive”, and “indicative”. Example 9 presents one such sentence where the trig- ger is the adjective “predictive”.

46 Example 9: “Furthermore, concomitant low expression levels of ERCC1, RRM1, and RRM2 and the high expression level of BRCA1 were predictive of a better outcome (P=0.014).” (PMID:25227663)

4.3.1.4 Sensitization eGARD also extracts information that connects genomic anomaly with efficacy of a drug treatment. One variation of such information can be expressed similar to association relations, where words like “resist” or “sensitive” can appear in their noun form as RO terms in the association relations; as in Example 10 for instance.

Example 10: “These results indicate that enhanced MGMT expression con- tributes to TMZ resistance in MGMT-positive GSCs.” (PMID:23958055)

However, the verb forms of these words are also often used in sentences that connect them with Anomaly entities. These verbs appear as the head predicate of the clause with the Anomaly entities as their subjects. Our rules for extraction look for the following syntactic pattern: sensitizes to as shown in Example 5 (duplicated below) already. That is, the trigger appears as verb (VBN) with the Anomaly entity appears as its subject and the disease cells as a direct object. The drug or treatment phrase appears as a preposi- tional phrase modifying the trigger with the preposition “to”.

Example 5: “ATM deficiency sensitizes mantle cell lymphoma cells to poly (ADP-ribose) polymerase-1 inhibitors.” (PMID:20124459)

Sometimes, the disease cells that are sensitized are not mentioned in the sen- tence. In such cases, as discussed later, the disease is inferred from the context.

4.3.2 Syntactic processing We employed sentence simplification by Yifan et al. [95] in some cases to simplify complex sentence syntax into simpler forms. This facilitates the extraction of relations

47 with simpler and more uniform patterns instead of applying complex patterns. For instance, consider the sentence in Example 11.

Example 11: “High MDR1 and ERCC1 gene expressions are associated with inferior outcome after cisplatin-based adjuvant chemotherapy for locally advanced bladder cancer”. (PMID: 20689757)

The simplification step will render these simplified sentences: “High MDR1 gene expressions are associated with inferior outcome” and “High ERCC1 gene expressions are associated with inferior outcome”. Thus, applying one simple pattern, we can extract both the relations without explicitly handling conjunctions in the pattern. We used BioNex [55] for tokenization and parsing of sentences into chunks of base noun phrases (NP) and base verb groups (VG). When consecutive base NPs are connected via prepositions, conjunctions or punctuation marks, we merged them to form larger NPs. Similarly, consecutive base VGs were merged into larger VGs as well. For example, the BioNex system will parse the sentence is Example 11 to the bases phrases as: NP(High MDR1), NP(ERCC1 gene expressions), VG(are associated), NP(inferior outcome), NP(cisplatin-based adjuvant chemotherapy) and NP(advanced bladder cancer). The merged phrases will be NP(High MDR1 and ERCC1 gene ex- pressions), VG(are associated) and NP(inferior outcome after cisplatin-based adjuvant chemotherapy for locally advanced bladder cancer).

4.3.3 Entity recognition To detect genes and diseases, we used annotations from PubTator [75]. The gene mentions were normalized to EntrezIDs. The disease mentions are normalized to MESH IDs by Pubtator, which we again normalized to DOID in Disease Ontology (DO) [96] database. We noticed some mistakes in disease tagging in Pubtator, such as AR being commonly tagged as disease although its full form also mentioned in the abstract is the gene Androgen Receptor. The system was able to automatically rectify this type of problems using the acronym detector. The acronym detector detects AR

48 as a short form of Androgen Receptor, which in turn is detected as a Gene. By looking at the full form detected by acronym detector, the system was able to discard AR a disease and consider it as a gene mention. For drug detection, we used a custom list of drugs provided to us by a domain expert. Genomic anomalies can be either mutations or change in gene expression levels. Mutations were detected using our previously built tool used in the system DiMeX [14], which also provides mutation to gene associations. We developed a module to detect gene expression level mentions. Usually expression terms appear along with the corresponding gene name in the same noun phrase, with the head word of a NP indicating an expression (e.g., expression, overexpression, inhibition, deficiency, levels etc.). Sometimes the expression terms are connected to the gene name via the “of” preposition (e.g “high expression of TS”). We also detect NPs as expression entities if they are headed by gene names and modified by level indicators like “low” or “high” (e.g., like “low TS”, where TS is a gene name). Additionally, phrases that indicate expression levels with numeric values along with a gene name such as “TS<= 7.5x10(- 3)” are also detected as expression entities. The RO entities are detected by NPs headed by words that indicate response or outcome. Based on our observations, we have identified several of such words, such as “survival”, “prognosis”, “outcome”, “response”, “efficacy” etc. because the NPs headed by these words represent an outcome when the drug(s) are administered. Thus, these NPs represent our definition of drug responses. Some examples of such de- tected NPs are “longer survival”, “higher response”, “poor efficacy”, “overall increased response” etc. We have also used a list of fixed phrases (and their corresponding acronyms, if applicable) that indicate RO entities. Some examples of such occurrences are “progression free survival”, “PFS”, “overall survival”, “OR”, “objective response rate” etc.

49 4.3.4 Typing of phrases Once the noun phrases (NP) are obtained from the parser, we categorize the NPs depending on the entities that appear within the NP, such as gene, disease, mu- tation etc. We named this step as typing of phrases, as the NPs are assigned an entity type. The entity type is assigned based on occurrences of certain entities or keywords at the head of the NP. We took the rightmost word of an NP as the head. If multiple base NPs were merged together due to prepositional phrases attachments to form one NP, we consider the head of the leftmost constituent NP to be the head of the entire merged NP. For instance, consider the sentence in Example 12. The first NP (“High expression of thymidylate synthase”) is of type because the head word (“expression”) of the first constituent NP (“High expression”) refers to expression type. Likewise, the second NP (“the drug resistance of gastric carcinoma to high dose 5-fluorouracil-based systemic chemotherapy”) is taken to be of type as the head word of the first constituent NP is “resistance”.

Example 12: “High expression of thymidylate synthase is associated with the drug resistance of gastric carcinoma to high dose 5-fluorouracil-based systemic chemotherapy.” (PMID: 9576280)

However, in certain cases, the head of the leftmost constituent NP was not conclusive for any type such as when the head word is patient or group. In such cases we consider the head of the next constituent NP that modifies it to determine the type of the NP. For example, in the text excerpt “patients with low ERCC1 expression showed a significantly higher rate of good tumor response” (PMID: 25674147), the type of the NP “patients with low ERCC1 expression” was determined to be . Using the same approach, NPs were typed to other entities, namely , , , etc.

50 4.3.5 Pattern matching In the discussion of the different sentence types above, we already outlined some of the syntactic patterns that we used to match against text. In this section, we will formally define the patterns and discuss the matching process with an example. Let’s reconsider the syntactic pattern for sensitization pattern that we introduced earlier: sensitizes to . This pattern can be formally broken down and written as:

1. sensitize VG has subj Anomaly NP; 2. sensitize VG has obj Disease cell NP; 3. sensitize VG nmod to Drug NP

The common element that binds the entire pattern together is a verb group (VG) headed by the word “sensitize”. So, to match the pattern in text, we first searched for a (possibly merged) VG with head “sensitize”. As we already have the types of NPs (from the typing of phrases step), the next step is to ensure that the has subj constraint is met by finding an NP of type that is a subject to the detected sensitize VG. This is done by checking if an NP of type appears immediate left of the sensitize VG when this verb group is in active form (detected by the parser). Similarly, the has obj constraint is ensured by looking for a NP of type to the immediate right of sensitize VG. Finally, the nmod to constraint is ensured by searching a NP of type that is a modifier for the sensitize VG and is connected as a prepositional phrase via “to”. Similarly, the syntactic pattern for one of the association type sentences, associate , can be formally written as follows and the pattern is matched in text using the same technique.

1. associate VG has subj Anomaly NP; 2. associate VG has obj RO NP

51 All the patterns that are used in this work are matched using this same approach. List of all patterns for all sentence types are available in the AppendixA.

4.3.6 Extracting specific information 4.3.6.1 Extracting drugs In addition to detecting association between genomic anomaly and drug re- sponses, eGARD also records drug(s) associated in that relation. The drug may co- occur in the same sentence that mentions the association between the Anomaly entity and the RO entity. In such cases, if the drug is found in the noun phrases that are extracted for Anomaly or RO entities by the patterns, then they are extracted as well. However, quite often, the drug is not mentioned in the same sentence but must be inferred from the context. So, we looked for some simple patterns at certain rhetorical zones in the abstract in the following order: title, method sentences, patient context (PC) sentence, conclusion sentence(s) and introduction sentence(s). The rhetorical zones and PC sentence are determined using the method described in Chapter3. If we denote the mention of drugs as drugname, some of the patterns that we used are but not limited to “treatment with drugname”, “patients treated with drugname”, “patients receiving drugname”, “drugname therapy”, “efficacy of drugname” etc. The intuition behind these patterns was to identify the drugs that were used to treat patients, rather than just looking for co-occurrence of drugs. The full list of patterns used is available in AppendixA.

4.3.6.2 Extracting diseases Similar to drugs, we need to infer the disease involved in the relation if it is not mentioned in the relation arguments. Similar to our approach in DiMeX, we consider the disease mentioned in the Patient Context (PC) sentence to be the central disease of the study, provided it occurs before the current sentence. If the disease is not found in a PC sentence, we look for the central disease at other rhetorical zones in the abstract in the following order: title, conclusion sentence(s) and introduction sentence(s).

52 4.3.7 Extracting additional information In addition to the extraction of the association of genomic anomaly and drug responses, we also extracted additional information that we believe would be helpful for a curator or a researcher to easily distinguish information extracted from a pa- tient vs. cell line study or prospective vs retrospective study. This may be useful in easily determining and summarizing the level of evidence associated with a predictive biomarker. Firstly, we extract information related to patients in the study, such as size of the experimental patient population and control, the race or nationality, etc. We used the method used in DiMeX for this purpose. This information is assumed to indicate whether the investigation involved patients (as compared to those on cell lines or models) and hence can be used to prioritize curation or rank the importance of the extracted conclusion. We also look for NPs of type or (given by the typing of phrases step) and tag the abstract as being related to cell type study instead of patient study. Additionally, we look for the presence of certain information (in the form of words or phrases) in the abstract that provide valuable insight for a curator or a researcher to filter or rank information. These information include but not limited to: retrospective or prospective study, clinical trial phases (I or II or III or IV), in-vivo or in-vitro or ex-vivo, clinical trial IDs and meta-analysis. We match these phrases or their minor variations against abstract text. Finally, we check if the publication is a review article by examining the MeSH terms.

4.4 System implementation The input to the eGARD system is a list of documents that we obtain from the Medline repository. We also collect the named entities; namely genes, mutations and diseases from PubTator. We run our mutation detector tool (discussed in chapter 2) on the documents which identifies additional mutation mentions and associate the mutations to genes. An in-house sentence splitter was used to split the abstracts into

53 individual sentences. We used an acronym detector [97] to detect possible abbrevia- tions. The abbreviated forms of different terms and entities assisted the entity detection step. For instance, in the text excerpt “The median disease-free survival (DFS) time was 10.2 mo in the patients.”(PMID: 17854149), the term “disease-free survival” is a RO entity and the acronym detects the abbreviated form for it as “DFS”. Thus, we treated DFS as a RO entity as well throughout the abstract. In order to build a database containing the vast amount of information regarding the association of genomic anomalies and drug responses, eGARD was applied on a large subset of PubMed collection. First, we collected abstracts retrieved for 50 genes and 42 cancer drugs including cell cycle inhibitors, kinase inhibitors and antibody treatments. Then we collected abstracts that relate to several cancers and all FDA- approved drugs from PubMed. In total, the collection contained 4,275,130 abstracts. We further applied a filtering process that discards abstracts that do not mention an RO entity terms. This yielded a list of 1,289,667 abstracts. We applied eGARD on this set of 1,289,667 abstracts. Currently, the results are incorporated in iTextMine [98] system. To enhance of the portability of the system, we have dockerized [58] the entire eGARD system into one container. Dockerization allows for the parallel execution of the system in multiple threads with the help of an external framework, such as Apache Spark [79].

4.5 Evaluation 4.5.1 Evaluation setup We evaluated eGARD on two different datasets to evaluate how well it performs over a range of text. The first set includes abstracts that were annotated in-house. The second set of abstracts is based on the PharmGKB [24] data. The first annotated in-house dataset contains a set of abstracts from which the information regarding the association of genomic anomaly on drug response had al- ready been annotated during a previous curation work [99]. This curation was done

54 by a domain expert who did not participate in the design and implementation of the eGARD system. The same domain expert was able to quickly convert the information into annotated dataset that could be used for evaluation. We called this set InHous- eSet1. It includes 100 abstracts, where each abstract is annotated with information containing a gene, the type of anomaly (specific mutations, high or low expression levels etc.), the drug and the disease. The abstracts in InHouseSet1 pertain to seven gene-drug combinations. The list of these seven gene-drug combinations is available in the AppendixB. We consider the annotation to be correct only if all 4 components that are extracted by the system matched with those in the annotation. We also considered a second in-house annotated dataset. A PubMed search for a biomarker gene and drug combination often returns hundreds of abstracts, where many might not contain information on the genomic anomalies’ association with drug response. For example, [34] reported that only 85 abstracts of the 575 abstracts re- turned from such a search was relevant. In that sense, InHouseSet1 is different from a typical search result because nearly all of them were relevant for curation. Thus, this set is not appropriate to evaluate the system’s ability to reject an abstract as irrele- vant and thereby save precious curation time and effort. Thus, we developed another dataset, called InHouseSet2, which was annotated by the same annotator. The only difference is that the 100 abstracts in InHouseSet2 were chosen randomly from the results of a PubMed search on the same seven gene-drug combinations that appeared in the InHouseSet1 data. In contrast with InHouseSet1, only 38 of the 100 abstracts in InHouseSet2 contained relevant information. Since the primary goal of developing and using InHouseSet2 for evaluation was to consider the ability to reject irrelevant abstracts, determination of true negatives is important. Thus, we focus on the metric of True Negative Rate (TNR), also known as specificity and defined as TNR=TN/(TN+FP), when using InHouseSet2. We also provide the precision and recall results, although it must be noted that the number of positive instances are much smaller and hence the precision and recall results are less reliable in contrast to the use of InHouseSet1.

55 We also used the data from the PharmGKB project. PharmGKB has a variant annotation dataset containing manually curated associations in which the variant af- fects a drug dose, drug response or drug metabolism. We took the PharmGKB variant annotation dataset and tailored it to use it to evaluate our system’s ability to extract the impact of variants on drug responses. This evaluation provides a chance to ex- amine how much of manually curated information in an existing database our system can reproduce. As our work is concerned about drug responses, we filtered out effects of variants on drug dose and metabolism, keeping only the drug responses. In this filtered version, we searched for annotations that are concerned with a list of FDA- approved drugs for 13 genes in cancer study. This list of drugs (and their variations in names), which is available in AppendixB, was provided to us by the annotator. This yielded a set of 46 articles. From the rest of the filtered annotation set, we ran- domly picked 54 articles to make the entire PharmGKB evaluation set containing 100 articles. As our system is concerned with abstracts only, we did not include the an- notations where the information is found in the full-length text but not the abstracts. The above filtering process yielded set of 76 annotated associations from 46 abstracts (FDA) and 65 annotated associations from 54 abstracts (others), making the entire set of 100 abstracts with 141 annotated associations. We used recall value as evaluation metric for this set. However, because all of the annotations are positive and we cannot calculate false positives, we cannot assess the precision for this set. Also, PharmGKB annotation does not contain disease information and hence it was not included in the evaluation. The PharmGKB dataset was obtained via acquiring a license through https://www.pharmgkb.org/.

4.5.2 Evaluation metrics For evaluation, we counted true positives (TP), false positives (FP), and false negatives (FN), true negatives(TN) and used the standard information retrieval metrics of Precision (P), Recall (R), F-measure (F), and True Negative Rate (TNR) for per- formance evaluation, where P = TP/(TP+FP), R = TP/(TP+FN), F = 2PR/(P+R)

56 and TNR = TN/(TN+FP).

4.6 Results and discussion 4.6.1 Results on annotated datasets We first report the results of eGARD when it was evaluated on the InHous- eset1 data set. Table 4.1 records the precision and recall for the extraction of the four components: the anomaly, the response, the drug and the disease. We consider an extraction to be correct only if all four components match the corresponding com- ponents in the annotation. We have achieved F-measure of 0.90 with a precision of 0.95 and recall of 0.86. In depth analysis of the mistakes revealed mainly three types of errors. First was due to the complexity of sentences, as in the sentence “Gene expression levels of APTX, BRCA1 and ERCC1 were significantly lower in irinotecan- sensitive gastric cancer samples than those irinotecan-resistant samples (P<0.001 for all genes), while ISG15 (P=0.047) and Topo1 (P=0.002) were significantly higher.” (PMID:23517622). While eGARD was able to extract the relationship between APTX, BRCA1 and ERCC1 expressions with irinotecan sensitivity, it failed to capture the relation between ISG15 and Topo1 with the drug response. We had previously men- tioned that the drug and disease might not be mentioned in the same sentence and that eGARD attempts to identify them from context in these cases. A few errors were due to the extraction from context. Finally, while the patterns in eGARD gave us a recall of 0.86, there were a few FNs due to missing patterns. For example, “Class III beta-tubulin overexpression is a prominent mechanism of paclitaxel resistance in ovarian cancer patients” (PMID:15671559) was missed because of the lack of a trigger.

Table 4.1: Performance of our system in finding the association of a genomic anomaly with drug responses () in InHouseSet1. TP FP FN Precision Recall F-measure InHouseSet1 128 7 21 0.95 0.86 0.90

57 As mentioned earlier, we have developed another dataset, InHouseSet2, to eval- uate eGARD’s ability to differentiate between relevant and non-relevant abstracts. Therefore, evaluation of this dataset was performed at the abstract level, meaning each abstract was considered either relevant or non-relevant, as opposed to inHous- eSet1 where individual combinations of the four components were annotated. The result of this evaluation is presented in Table 4.2. We used the true negative rate (TNR) value to represent how well our system rejects non-relevant abstracts. There were only two cases in this 100 abstract where eGARD predicted an abstract to be relevant when in fact, it was not according to the annotation (false positives), thus yielding a TNR value of 0.97. This is an encouraging result because it suggests that eGARD can considerably save curation time and effort. Although we also present the precision and recall in Table 4.2, we would like to note that the number of positive abstracts in this set is small.

Table 4.2: Performance of our system in separating non-relevant abstract from rele- vant abstracts in InHouseSet2. TP FP FN TN Precision Recall F-measure TNR InHouseSet2 28 2 10 60 0.93 0.74 0.82 0.97

The evaluation results for PharmGKB set is presented in Table 4.3. Please note that we are only able to assess the recall value for this set, as it only contains positive annotations. The overall recall on the 100 abstracts is 0.77. Table 4.3 shows that similar recall scores are obtained for the two subsets: abstracts based on 13 FDA approved cancer drugs, and abstracts for the other drugs (which covered mostly non- cancer diseases).

58 Table 4.3: Performance of our system in finding the association between genomic anomaly and drug responses in PharmGKB dataset. TP FN Recall PharmGKB (FDA drug set, 46 articles) 59 17 0.78 PharmGKB (other set, 54 articles) 50 15 0.77 PharmGKB (combined) 109 32 0.77

4.7 Conclusion We have developed a system (eGARD) that finds the relations between ge- nomic anomalies and drug responses from abstracts of published articles. Evaluations on several dataset show that our system can assist the manual curation process by au- tomatically extracting relations, thereby significantly reducing manual curation time. Additionally, we have applied the eGARD system on a large set of abstracts from MEDLINE. The results are stored in a database as JSON object. An interactive web interface is built to visualize the results, which is available at: www.mace2k.org. The entire system implementation is dockerized for better portability which also helps run the system in a parallel fashion.

59 Chapter 5

MUTATION IMPACT ON PROTEIN-PROTEIN INTERACTIONS

5.1 Introduction There has been great importance placed on the field of precision medicine in re- cent years [100,101]. Precision medicine innovates new approaches for disease treatment and prevention by taking into account the genetic profile of patients, hence generating customized treatment. In this regard, it is important to know how genetic variations contribute to diseases in certain ways. In chapters2-4, we have developed text mining methods to connect genetic variations with genes, diseases, certain aspects of diseases and drug responses. However, it is also important to know how the genetic profiles impact functionalities at the molecular level, thereby affecting the disease develop- ment in turn. As proteins and their interactions are the building blocks of metabolic and signaling pathways regulating cellular processes [102], understanding how genetic mutations impact the protein-protein interactions is crucial for providing additional support to precision medicine efforts. To address this, we have developed a text min- ing system to automatically extract mutational impact on protein-protein interactions (PPI) from scientific literatures. Example 1 below shows one sentence from which our system extracts various information.

Example 1: “The LRP5 high-bone-mass G171V mutation disrupts LRP5 interaction with Mesd.” (PMID: 15143163)

In Example 1, there is a PPI relation indicated by the phrase “interaction with”, and the interactants of the PPI relation are LRP5 and Mesd. The mutation that im- pacts this PPI relation is G171V, indicated by the word “disrupts”. So, our task is

60 subdivided into two separate tasks: finding the PPI interactants, and finding the mu- tation that impacts the PPI relation. Additionally, we would like to know which gene the mutation belongs to. In Example 1, the phrase “LRP5 high-bone-mass G171V mutation” indicates that the mutation G171V belongs to the gene LRP5. Thus, all together from the sentence in Example 1, we will extract the tuple:

< G171V (mutation), LRP5 (mutated-gene), LRP5 (PPI interactant 1), Mesd (PPI interactant 2) >

Along with the impact relation, the impact type is also indicated in text. For instance, Example 1 indicates the impact type with word “disrupts”. Currently, we do not categorize the impact type based on the word/phrase that indicate that. That is why the impact type is not part of the tuple above. However, we retain the word/phrase so that in future we can formalize the impact type. Sometimes the mutation which impacts a PPI can be mentioned as part of the interaction itself, as shown in Example 2. Here, the mutated gene p47(F253S) failed to form a complex with gene p97, thereby indicating F253S having an impact on the PPI relation between p47 and p97. Usually, a modifier to the key word/phrase (“form a complex” in this case) indicates an anomalous behavior, which actually expresses the impact of the mutation on the PPI. For example, in this case, the negative modifier “not” denotes p47’s inability to form complex with p97, thereby revealing F253S’s im- pact on the PPI.

Example 2: “p47(F253S) could not form a complex with p97” (PMID:20691684)

61 5.2 Related works With the advent of next generation sequencing technologies, there has been an exponential growth of research articles connecting genomic variations with molecular functions, protein properties and pathways. Information need regarding the impact of mutations on biological entities is greater than before because of the fast growing in- terest in personalized medicine. There are manually curated resources that house such information. For example, IntAct [43] curates evidence for molecular interactions. Bi- oGRID [42] is a database that contains protein and genetic interactions. However, to automate and assist the curation task of mutational impact on protein-protein inter- actions, there has been only a handful recent text mining efforts. There have been sizable efforts towards extracting mutational information from text [10, 11, 13, 14, 103] and PPI extraction [44–48] separately. But no text mining effort has addressed the challenge of integrating these two tasks to retrieve protein-protein interactions affected by mutations, which is critical in the precision medicine context. This motivated a new Precision Medicine Track (track 4) in the latest BioCreative VI workshop [104]. The Precision Medicine Track addresses this problem in the form of two tasks:

• Document Triage: Identification of relevant PubMed citations describing muta- tions affecting protein-protein interactions

• Relation Extraction: Extraction of experimentally verified PPI pairs affected by the presence of a genetic mutation

For the relation extraction task in the BioCreative VI precision medicine task, Chen et al. [50] used a SVM classifier to identify protein-protein interactions that are impacted by mutations. They have used textual features such as co-occurrence of mutational terms and frequency of terms and gene mentions. Tran et al. [51] used convolutional neural network to predict mutational impact on PPI relations. They used word embeddings as features, as well as mentions of genes. The top result for the relation extraction task in BioCreative VI was reported as F1-score of 0.3729. Individually, top precision was 0.4544 and top recall was 0.5387.

62 There have been little efforts on automatic identification of the details of mu- tation impact on biomedical entities, but not necessarily on PPI. Naderi et al. [49] presents a system, namely Open Mutation Miner (OMM), which extracts mutation- protein relationship from biomedical text. OMM employs separate modules for extrac- tion of the impact, grounding of impact to mutations, identifying the impacted protein property and magnitude and direction of the impact. It is a rule based system coupled with selected keywords. EnzyMiner [61] uses a machine-learning based approach that identifies the abstracts that contain an amino-acid-level mutation and then classifies them according to the mutation’s effect on the enzyme. For impact analysis, doc- ument classification is performed to identify the abstracts that contain a change in enzyme’s stability or activity resulting from the mutation. A recent survey from Tang et al. [105] discusses other computational but non-TM approaches to determine the impact of genetic variations.

5.3 Approach Extracting mutational impact on PPI consists of three separate components: (1) Finding the PPI relation, (2) Finding the genetic variation that impacts the PPI relation, and (3) Finding the mutated gene/protein. Please note that the mutated gene/protein is usually one of the interactants. These three components are described in section 5.3.1 to 5.3.3, respectively.

5.3.1 Extraction of PPI relation PPI relations are extracted from predicate-argument relations in text, where both the proteins will be the arguments of a binary relation indicated by a lexical trigger word/phrase. The trigger words include “binding”, “interaction”, etc. or their textual variations. For instance, the PPI relations in Example 1 (“The LRP5 high-bone-mass G171V mutation disrupts LRP5 interaction with Mesd.”) can be extracted from the predicate-arguments relations of (interaction, LRP5) and (interaction, Mesd). Here

63 both are connected via the trigger “interaction” and both proteins come as arguments of the trigger. To detect PPI relations, we have employed the Extended Dependency Graph (EDG) framework, which leverages the Stanford Dependency Graph (SDG) generated from syntactic parse tree. As already mentioned chapter3, EDG not only considers syn- tactic dependencies between words in a sentence, but also utilizes information beyond syntax to capture different dependencies. From the syntactic dependencies provided by the Stanford Dependency Graph, it produces arg0 and arg1 dependencies taking into account of textual variations of the same dependency representation. Figure 5.1 shows the SDG representation of Example 1. In Figure 5.1, from the lexical trigger “interaction”, we follow the compound edge to get arg0 and nmod:with edge to get arg1, which represents the predicate-arguments relations arg0(interaction, LRP5) and arg1(interaction, Mesd), respectively. Thus, we are able to extract the PPI relation .

Figure 5.1: SDG representation of the sentence in Example 1.

Traditional PPI relation extraction systems identify both the interactants from the relations. However, in this work, we also consider PPI relations that mention only one interactant as part of the relation, and the other interactant is implicitly referred to from the context. As we ultimately target to find mutation impact on the PPI, usually the other interactant is the gene of the mutation that is impacting the PPI relation. Example 3 shows one such sentence. Here, the PPI relation is indicated by the trigger “binding”. The nmod:to edge from “binding” gives one interactant IL8R1

64 (Figure 5.2), but there is no other edge yielding to another interactant. However, as the mutations L49A and L49F inhibits the binding, the other interactant is IL-8, which the mutations belong to.

Example 3: “IL-8 mutations L49A or L49F selectively inhibited binding to IL8R1.” (PMID:8626516)

Figure 5.2: SDG representation of sentence in Example 3.

5.3.2 Mutation impact on PPI Mutational impact on PPI is often expressed in text using one of the CAIR relations (introduced in chapter3). As with DiMeX, we used the EDG framework to capture the impact relation between mutations and PPI. The difference in this case is that one argument (arg0) of the relation is the mutation, and the other argument (arg1) point to the PPI relation. Consider the SDG representation in Figure 5.1, where the PPI relation is indicated by the trigger “interaction”. Additionally, there is an impact relation indicated by the trigger “disrupt”. The nsubj edge from the impact trigger points to the head of the noun phrase (NP) that contains the mutation. The dobj edge points to the noun “interaction”, which happens to be the PPI trigger. Thus, combining both the impact and PPI relations, we are able to extract that the mutation G171V has an impact on the PPI relation . We just mentioned that here we are interested in the cases where the arg0 ar- gument of impact relation is a mutation. Hence, we check whether the NP pointed by arg0 of the impact trigger refers to a mutation. But the NP may not always contain

65 specific mutation mentions. Instead, the authors may refer the mutation with a phrase indicating one or more mutations of a certain gene. Usually such phrases are headed by a word/phrase indicating mutations, such as “mutations”, “polymorphism”, “variants” etc. If we detect such mutational phrase with a gene name, we extract the referent mutation(s) from the closest sentence where the mutation is already associated with that gene. For instance, in Example 4, arg0 gives us the NP “A novel silent beta- thalassemia mutation”, from which we extract the referent (-101C–>T) from a later sentence in the abstract.

Example 4: “A novel silent beta-thalassemia mutation in the distal CACCC box affects the binding and responsiveness to EKLF.” (PMID:15352994)

As we already pointed out, the mutational impact on PPI can be expressed as an interaction itself, as shown in Example 2. Similar to PPI relation extraction, this kind of impact relations are also extracted from predicate-argument relations. The interaction is triggered by a keyword/phrase, and the mutation and the protein are arguments of the binary relation indicated by the lexical trigger word/phase. Consider the sentence in Example 5 and it’s SDG representation in Figure 5.3. Here, the trigger “associate” and it’s negative modifier (“not”) is indicating an interaction. The nsubj edge from trigger points to the NP that contains the mutation and the mutated-gene. The nmod:with edge from trigger leads to the NP that contains one PPI interactant. As the mutated-gene is the other PPI interactant, we extract the 4-tuple: which captures the mutational impact on a PPI.

Example 5: “the mutant Y152F PTP1B does not associate with N-cadherin in situ” (PMID:11106648)

66 Figure 5.3: SDG representation of sentence in Example 5.

5.3.3 Extraction of mutated gene Once we extract the mutation(s) that impact a PPI relation, we extract the gene/protein that the mutation belongs to. We call this the mutated-gene. It is im- portant to find the mutated-gene because it is not always clear which interactant in the PPI relation is the mutated-gene. Also, as we pointed out above, sometimes the mutated-gene may not be explicitly mentioned in the PPI relation (Example 3). We already developed techniques to associate mutations with genes (described in chapter 2). To find the mutated-gene, we used the same techniques as in chapter 2. However, we have some additional steps to identify the mutated-gene if the previous techniques fail. We have noticed that the authors often describe the mutation in the context of some biological function of the mutated-gene. Consider the sentence in Example 5a. Here, the mutation R67C impacts the binding relation between PAK3 and Cdc42 genes. Although the PPI relation gives both the interactants, it is not evident which one the mutation R67C belongs to. The mutation-gene association techniques described in chapter 2 could not find the mutated-gene for R67C. However, a previous sentence, mentioned in Example 5b, describes the effect of the mutation on functions of gene PAK3. This is an indication that R67C belongs to gene PAK3. Therefore, we look for Regulation or Involvement relations (CAIR) where the target mutation is connected to a functionality of a gene. We take that gene to be the mutated-gene. Thus, combining information from relations extracted from both the sentences in Examples 5a and 5b, we are able to extract the tuple .

67 Example 5a: “The R67C mutation drastically decreases the binding of PAK3 to the small GTPase Cdc42” (PMID:17537723)

Example 5b: “the three mutations R419X, A365E, and R67C, responsible for mental retardation have different effects on the biological functions of PAK3” (PMID:17537723)

5.3.4 Anaphora resolution An anaphor is an expression that refers to an earlier item in the text. In biomed- ical context, anaphors are usually pronouns that refer to some entities previously men- tioned. In conjunction with our approaches for identifying PPIs and mutation impact on PPIs (section 5.3.2 and 5.3.3), we have employed anaphora resolution to increase recall of the system. For the most common anaphor “its” or “this”, we resolve to a single target entity in the same sentence or the previous sentence. In case of “these”, “those” and “their”, we resolve to multiple target entities in phrases either in same sen- tence or in previous sentences. Anaphors could be used to refer to mutations, protein interactants or the interaction relation. Consider Example 6 which mentions the impact of mutation (S1444D) on a PPI, which is referred to with “this interaction” instead of mentioning the PPI interac- tants explicitly. However, from the earlier phrase in the sentence “Myo2 interacts with Mid1”, we detect the PPI and “this” resolved to the interaction between Myo2 and Mid1. This completes our extraction of the 4-tuple: .

Example 6: “Myo2 interacts with Mid1 in cell lysates, and this interaction is inhibited by an S1444D mutation in Myo2.” (PMID:15184401)

Similarly, in Example 7, the anaphora “these” resolves to mention of the muta- tions R1170W and W1186S from the previous sentence.

Example 7: “these mutations impair LRP4 interaction with sclerostin” (PMID:21471202)

68 5.3.5 Entity recognition Genes were detected using annotations from PubTator [75]. The gene mentions were normalized to EntrezIDs. Additionally, as mentioned previously in Chapter4 (section 4.3.3), we used an acronym detector to rectify problems in gene mention de- tection where the full-form of the gene was detected by Pubtator but the short form was not. Using the acronym detections, we were able to automatically add gene men- tions for short forms, too. Mutations were detected using our previously built tool used in the system DiMeX [14], which also provides mutation to gene associations. To add more coverage to mutation detection, we included annotations from tmVar [11], too, as previously it has been established that combining results of multiple mutation detection tools provide higher yields [59].

5.4 System implementation The input to the system is a list of documents that we obtain from the Medline repository. We also collect the named entities; namely genes and mutations from PubTator [75]. We run our mutation detector tool from DiMeX on the documents. It also associates the mutations to genes. After that, the documents are passed via the relation extraction system. As mentioned before, this involves the sentence splitting and tokenization using the Stanford NLP [76] pipeline, then the use of the Bllip parser [77,78] to obtain the parse trees and the application of the Stanford Conversion tool [76] to get the Stanford Dependency Graph for each sentence in the text. We then apply rules corresponding to the different relation types of interest and thus obtain the EDG representation for each sentence. The numbered argument edges in EDG (arg0 and arg1) will point to the head of the arguments of the relations. We use the parse trees to extract argument phrases, which will be the parent noun phrase (with prepositional attachments) of the argument head. We extract the desired entities of interest in the relation from the arguments.

69 5.5 Evaluation 5.5.1 Evaluation setup To evaluate our system on the task of mutational impact on PPI relations, we needed to have annotated datasets that annotates a 4-tuple that contains: (1) the mutation, (2) the mutated-gene, (3) PPI interactant 1, and (4) PPI interactant 2. As this is a new direction of research, we could not find any existing resource that provides us with this type of annotations. The closest resource we found was from Relation Extraction task of the BioCreative VI track 4 [52], which we will refer to as BioCreative dataset henceforth. Relation Extraction task of BioCreative VI targets to find PPIs in the presence of some mutation, but does not target to find the actual mutation. Thus, the BioCreative dataset only contains annotations of the two PPI interactants and does not specify the mutations that impact the interaction. As we needed the mutations to be annotated as well, we collected the BioCreative datasets and asked a Biology expert with significant experience in biocuration to annotate the abstracts with the 4-tuples from the dataset. We will refer to this annotated dataset as M2PPI dataset. First, we took the training set provided by BioCreative dataset because the training set already includes the annotations for protein-protein interactions. The training set included 597 PubMed abstracts. Then we filtered out abstracts that do not contain at least one mention of mutations. That left us with a set of 193 abstracts. We randomly selected 50 abstracts for evaluation. The domain expert, who did not participate in design and development of the text mining system, manually curated the 4-tuples from the 50 abstracts. The manual curation from the annotator yielded the M2PPI dataset with 92 relations coming from 52 sentences off 44 abstracts, each indicating an impact of a mutation on a PPI relation. We evaluated our system on the M2PPI dataset to see how well it performs in finding the mutation impact on PPIs, i.e. finding all four elements of the 4-tuple impact relation: the mutation, the mutated-gene, PPI interactant 1 and PPI interactant 2. Additionally, we evaluated the individual tasks, too, i.e., the mutation detection, the

70 mutated-gene detection and the PPI detection, all of which contribute in detecting the 4-tuples.

5.5.2 Evaluation metrics We counted true positives (TP), false positives (FP), and false negatives (FN), and used the standard information retrieval metrics of Precision (P), Recall (R), and F- measure (F) for performance evaluation, where P = TP/(TP+FP), R = TP/(TP+FN) and F = 2PR/(P+R).

5.6 Results and discussion Evaluation results of our system in summarized in Table 5.1. As we mentioned earlier, there were 92 instances of a mutation impacting a PPI in the M2PPI dataset. The first row of Table 5.1 shows that we achieved F-measures of 0.84 with a very high precision (0.95) in identifying the 4-tuples, i.e., all four elements of the impact relation instances. The subsequent rows present the performance for the individual tasks. The individual tasks showed similar performances, with F-measures ranging from 0.90 to 0.91. Analysis of the false negatives and false positives revealed that some of the mis- takes are attributed to erroneous parsing of the sentences, which contributed towards missing some of the impact relations. Also, in some cases, some of the impact relations were not captured due to the lack of lexico-syntactic patterns that we have used in the relation extraction system. We have noticed that in certain cases, multiple mutation impact on PPI were annotated from a single sentence. If we failed to identify one such relation, it contributed to multiple false negatives. For example, consider the sentence in Example 8. The relation extractor failed to connect the three mutations’ impact on 14-3-3eta binding, thus contributing to three FNs. But in reality we have missed one impact relation from a single sentence. Therefore, to get a difference view of the system’s performance in identifying impact relations from each individual sentence, we adjusted the results (e.g. Example 8 will account for one false negative instead of three false negatives). This yielded an F-measure of 0.85, with precision of 0.91 and recall

71 of 0.79.

Example 8: “14-3-3eta could bind to the linker region of parkin but not parkin with ARJP-causing R42P, K161N, and T240R mutations.” (PMID:16096643)

Table 5.1: Evaluation results for detection of mutation impact on protein-protein interactions. Task P R F 4-tuple 0.95 0.76 0.84 PPI detection 0.99 0.85 0.91 Mutation detection 0.97 0.84 0.90 Mutated-gene detection 0.99 0.84 0.91

5.7 Conclusion We have developed a system to find mutation impact on protein-protein interac- tions from abstracts of scientific literature. With growing importance on personalized treatments, it is crucial to understand how genetic profiles influence functionalities at the molecular level. We believe our system will greatly help in this regard, by quickly identifying relations from text and assist speed-up of manual curation.

72 Chapter 6

CONCLUSION

Extraction of biomedical relations from scientific literature is an active area of research, which in turn assists the development of biological research. In particular, with the advent of next generation sequencing techniques, there has been a tremen- dous growth of literature, connecting genomic anomalies to various bio-entities, such as diseases, drug responses as well as genomic functions. Manual curation of such information cannot keep pace with the rapid growth of literature. The ability to automatically extract this mutational information from large amount of text greatly feeds the information need of personalized care for patients. To address this need of automatic information extraction for precision medicine, we adopted various natural language processing (NLP) techniques to mine mutational relations from text. The text mining (TM) work in thesis will facilitate and assist various biological tasks, such as knowledge discovery, bio-curation and hypothesis generation for personalized care. This thesis contributes not only in assisting biomedical researchers, but also to advance the state-of-the-art in text mining. The work presented here will provide insight into how NLP techniques can be used for biomedical relation extraction, and potentially lead to development of further text mining systems in future. In the following sections, we summarize the contributions of this dissertation and outline possible future directions.

6.1 Thesis summary and contributions The contribution of this dissertation can be divided into four parts, each ad- dressing a different challenge in extracting mutational information from natural text.

73 The contributions are described below.

The first aspect of this dissertation is the development of a mutation detection tool MeX, that extracts mutation mentions from biomedical text and associate them with corresponding genes. MeX detects mentions that follow HGVS [54] format as well as mentions expressed in natural text. Compared to the existing tools, our system has a wider coverage of mutation types. Evaluations on several corpora demonstrated that we achieved state-of-the-art results (F-measures of 0.91 to 0.94). Additionally, we developed a novel algorithm to associate mutation mentions with genes by taking into account of the syntactic and semantic nature of text. Evaluation of mutation- gene association showed high precision (upto 0.95) and recall (upto 0.94) on several extrinsic corpora. We developed MeX as a standalone tool and encapsulated it as a docker container to facilitate higher portability and parallel execution. We believe that our mutation detection system MeX, with its wide coverage of mutation types and mutation-gene association, will be a valuable resource for researchers and will further assist other text mining systems.

The second aspect of this dissertation concerns the development of a text mining system, named DiMeX [14] that finds associations between mutations and diseases from biomedical text. As such associations vary both in syntax and semantics, we employed a general relation extraction framework [33] that exploits the syntactic dependencies in text. EDG is able to produce similar dependency structure for same semantic relations even though the syntactic relations vary. The uniqueness of our approach lies in lever- aging the lexico-syntactic structure of sentences whereas previous approaches mainly depended on co-occurrence of entities. DiMeX achieved high precision and recall when evaluated on three different datasets for mutation-disease associations (F-measures of 0.88 to 0.91), outperforming the existing systems. We encapsulated DiMeX in a docker container to facilitate higher portability and parallel execution. The scalability and ro- bustness of the system were validated by applying it on a large set of Medline abstracts.

74 The extracted mutation-disease associations, coupled with additional relevant informa- tion that was obtained from text as well, are available through a database. We believe that automatic extraction of mutation-disease associations from text will greatly assist life science researchers to understand the effect of mutations in diseases.

The third aspect of this dissertation is the development of a text mining system, eGARD [34], that extracts the impact of genomic anomalies on drug responses from scientific literature. Genomic anomalies include not only mutations but also differential expression of genes and proteins. Drug responses represent change in response rate, sensitivity to drugs, or outcome of treatment, which includes overall survival and pro- gression free survival etc. eGARD is a natural language processing-based text mining system that exploits the syntactic nature of sentences coupled with various textual fea- tures to extract relations between genomic anomalies and drug response. Additionally, the system extracts information that helps determine the confidence level of extrac- tion to support prioritization of curation. eGARD achieved high precision, recall and F-measure of up to 0.95, 0.86 and 0.90, respectively, on annotated evaluation datasets created in-house and obtained externally from PharmGKB [24]. eGARD also achieved true negative rate (TNR) of 0.97 indicating its ability to reject non-relevant articles, thereby reducing valuable curation time. Similar to MeX and DiMeX, we encapsu- lated the system in a docker container. eGARD was applied on a large set of Medline abstracts and the extraction results are incorporated in iTextMine [98] system. We believe that eGARD will greatly help on personalized treatment decisions based on enormous volumes of public data already available.

Finally, the last part of this dissertation is regarding the development of a text mining system that extracts mutational impact on protein-protein interactions (PPI). In personalized medicine, it is crucial to understand the relations between genomic profiles and diseases or treatments. This study is informed by knowing how mutations

75 impact molecular functions. As a starting point, our system attempts to find the im- pact of mutations on PPI. We used the EDG relation extraction framework to extract the mutations, the PPI interactants and the connections between them. We adopted the approach from Chapter 2 to identify the mutated gene, thus completing the extrac- tion as a 4-tuple. Evaluation results on an in-house annotated dataset (created from BioCreative VI precision medicine task corpora) showed that the system achieved pre- cision and recall 0.95 and 0.76, respectively. We believe our system will greatly aid the rapidly growing field of precision medicine, by quickly identifying mutation impact of PPIs from text and assist speed-up of manual curation.

6.2 Future work The research in this dissertation can be expanded in few possible directions.

This dissertation targets to mine information from text that are related to ge- nomic variations. Thus, it is important to increase the coverage of mutation mention detection in text. As the Sequence Variant Nomenclature from HGVS is periodically updated, mutation detection could be updated accordingly.

Currently, all the tools described in this thesis are applied on Medline abstracts only. The systems could be extended to apply on PMC full-length articles, as full-length articles naturally contain more information than abstracts. Extracted information from full-length articles can be used to validate the results found in abstracts, thus boosting the confidence score of extracted relations. Full-length articles tend to contain valuable information regarding the experimental setup, such as the prospective or retrospective nature of the study, prognostic or diagnostic nature of biomarkers etc. These kinds of information from full-length can be valuable for biocurators. The method section of full-length articles usually describes the context of the experiment being done in detail; so, extraction of context information, such as information regarding subjects, controls, diseases and drugs could be found more accurately. Often times, we have used various

76 heuristic methods in conjunction with NLP concepts to obtain information. For exam- ple, in eGARD, we extract the disease and the drug from the context if the arguments of relations do not contain them. As full-length articles tend to repeat information that are stated in abstracts, extracting the same information from full-length can be used to validate the heuristic results, which are otherwise low-confidence.

In this dissertation, we identified mutational impact of protein-protein interac- tions (PPIs). As a future direction of research, mutation impact beyond that on PPIs could also be extracted. The existing system can be extended to post-translational modifications in a straightforward manner. The next step could extract mutations im- pact on protein functions and cellular processes. Such extractions will provide further insight into how mutations influence diseases, potentially lead to better understanding of biological pathways.

Extracted results from the developed systems in this dissertation are accom- panied by additional information which helps life science researchers or curators to prioritize the results. This gives one way of ranking the results for the users. However, with proper guidance from domain experts, the tools can integrate ranking algorithms that can sort information based on users needs. This ranking could be interpreted as a confidence score on extraction results. As an alternative, machine learning based techniques could be used to devise ranking/confidence score of extracted results.

77 REFERENCES

[1] Jun Zhang, Rod Chiodini, Ahmed Badr, and Genfa Zhang. The impact of next- generation sequencing on genomics. J. Genet. Genomics, 38(3):95–109, March 2011.

[2] Emidio Capriotti, Nathan L Nehrt, Maricel G Kann, and Yana Bromberg. Bioin- formatics for personal genome interpretation. Brief. Bioinform., 13(4):495–512, July 2012.

[3] Akram Alyass, Michelle Turcotte, and David Meyre. From big data analysis to personalized medicine for all: challenges and opportunities. BMC Med. Ge- nomics, 8:33, June 2015.

[4] Dietrich Rebholz-Schuhmann, Stephane Marcel, Sylvie Albert, Ralf Tolle, Georg Casari, and Harald Kirsch. Automatic extraction of mutations from medline and cross-validation with OMIM. Nucleic Acids Res., 32(1):135–142, January 2004.

[5] Florence Horn, Anthony L Lau, and Fred E Cohen. Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics, 20(4):557–568, March 2004.

[6] Muge Erdogmus and Osman Ugur Sezerman. APPLICATION OF AUTOMATIC MUTATION–GENE PAIR EXTRACTION TO DISEASES. J. Bioinform. Com- put. Biol., 05(06):1261–1275, 2007.

[7] Lawrence C Lee, Florence Horn, and Fred E Cohen. Automatic extraction of protein point mutations using a graph bigram association. PLoS Comput. Biol., 3(2):e16, February 2007.

[8] J Gregory Caporaso, William A Baumgartner, Jr, David A Randolph, K Breton- nel Cohen, and . MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics, 23(14):1862– 1865, July 2007.

[9] Yum Lina Yip, Nathalie Lachenal, Violaine Pillet, and Anne-Lise Veuthey. Re- trieving mutation-specific information for human proteins in UniProt/Swiss-Prot knowledgebase. J. Bioinform. Comput. Biol., 5(6):1215–1231, December 2007.

78 [10] Philippe Thomas, Tim Rockt¨aschel, J¨orgHakenberg, Yvonne Lichtblau, and Ulf Leser. SETH detects and normalizes genetic variants in text. Bioinformatics, June 2016. [11] Chih-Hsuan Wei, Bethany R Harris, Hung-Yu Kao, and Zhiyong Lu. tmvar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics, 29(11):1433–1439, June 2013. [12] Ryan T McDonald, R Scott Winters, Mark Mandel, Yang Jin, Peter S White, and Fernando Pereira. An entity tagger for recognizing acquired genomic variations in cancer literature. Bioinformatics, 20(17):3249–3251, November 2004. [13] Emily Doughty, Attila Kertesz-Farkas, Olivier Bodenreider, Gary Thompson, Asa Adadey, Thomas Peterson, and Maricel G Kann. Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature. Bioinformatics, 27(3):408–415, February 2011. [14] A S M Ashique Mahmood, Tsung-Jung Wu, Raja Mazumder, and K Vijay- Shanker. DiMeX: A text mining system for Mutation-Disease association extrac- tion. PLoS One, 11(4):e0152725, April 2016. [15] UniProt Consortium. Activities at the universal protein resource (UniProt). Nucleic Acids Res., 42(Database issue):D191–8, January 2014. [16] Simon A Forbes, Nidhi Bindal, Sally Bamford, Charlotte Cole, Chai Yin Kok, David Beare, Mingming Jia, Rebecca Shepherd, Kenric Leung, Andrew Menzies, Jon W Teague, Peter J Campbell, Michael R Stratton, and P Andrew Futreal. COSMIC: mining complete cancer genomes in the catalogue of somatic mutations in cancer. Nucleic Acids Res., 39(Database issue):D945–50, January 2011. [17] Tsung-Jung Wu, Amirhossein Shamsaddini, Yang Pan, Krista Smith, Daniel J Crichton, Vahan Simonyan, and Raja Mazumder. A framework for organizing cancer-related variations from existing databases, publications and NGS data using a high-performance integrated virtual environment (HIVE). Database, 2014:bau022, March 2014. [18] Joanna Amberger, Carol A Bocchini, Alan F Scott, and Ada Hamosh. McKusick’s online mendelian inheritance in man (OMIM). Nucleic Acids Res., 37(Database issue):D793–6, January 2009. [19] Peter D Stenson, Matthew Mort, Edward V Ball, Katy Howells, Andrew D Phillips, Nick St Thomas, and David N Cooper. The human gene mutation database: 2008 update. Genome Med., 1(1):13, January 2009. [20] Christophe B´eroud, Dalil Hamroun, Gwena¨elle Collod-B´eroud, Catherine Boileau, Thierry Soussi, and Mireille Claustres. UMD (universal mutation database): 2005 update. Hum. Mutat., 26(3):184–191, September 2005.

79 [21] Gudmundur A Thorisson, Owen Lancaster, Robert C Free, Robert K Hast- ings, Pallavi Sarmah, Debasis Dash, Samir K Brahmachari, and Anthony J Brookes. HGVbaseG2P: a central genetic association database. Nucleic Acids Res., 37(Database issue):D797–802, January 2009.

[22] Arti Singh, Adebayo Olowoyeye, Peter H Baenziger, Jessica Dantzer, Maricel G Kann, Predrag Radivojac, Randy Heiland, and Sean D Mooney. MutDB: update on development of tools for the biochemical analysis of genetic variation. Nucleic Acids Res., 36(Database issue):D815–9, January 2008.

[23] S T Sherry, M H Ward, M Kholodov, J Baker, L Phan, E M Smigielski, and K Sirotkin. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29(1):308–311, January 2001.

[24] Caroline F Thorn, Teri E Klein, and Russ B Altman. PharmGKB: the pharma- cogenomics knowledge base. Methods Mol. Biol., 1015:311–320, 2013.

[25] Melissa J Landrum, Jennifer M Lee, George R Riley, Wonhee Jang, Wendy S Rubinstein, Deanna M Church, and Donna R Maglott. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res., 42(Database issue):D980–5, January 2014.

[26] J P Plazzer, R H Sijmons, M O Woods, P Peltom¨aki,B Thompson, J T Den Dun- nen, and F Macrae. The InSiGHT database: utilizing 100 years of insights into lynch syndrome. Fam. Cancer, 12(2):175–180, June 2013.

[27] M Schenck, O Politz, and P Groth. Extraction of genetic mutations associated with cancer from public literature. Journal of Health & Medical, 2013.

[28] Dean Cheng, Craig Knox, Nelson Young, Paul Stothard, Sambasivarao Dama- raju, and David S Wishart. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res., 36(Web Server issue):W399–405, July 2008.

[29] Komandur Elayavilli Ravikumar, Kavishwar B Wagholikar, Dingcheng Li, Jean- Pierre Kocher, and Hongfang Liu PhD. Text mining facilitates database curation - extraction of mutation-disease associations from bio-medical literature. BMC Bioinformatics, 16:185, June 2015.

[30] Ayush Singhal, Michael Simmons, and Zhiyong Lu. Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature. J. Am. Med. Inform. Assoc., 23(4):766–772, July 2016.

[31] Karin M Verspoor, Go Eun Heo, Keun Young Kang, and Min Song. Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts. BMC Med. Inform. Decis. Mak., 16 Suppl 1:68, July 2016.

80 [32] Alexis Allot, Yifan Peng, Chih-Hsuan Wei, Kyubum Lee, Lon Phan, and Zhiyong Lu. LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res., 46(W1):W530–W536, July 2018.

[33] Yifan Peng, Samir Gupta, Cathy Wu, and Vijay Shanker. An extended depen- dency graph for relation extraction in biomedical texts. Proceedings of BioNLP 15, pages 21–30, 2015.

[34] A S M Ashique Mahmood, Shruti Rao, Peter McGarvey, Cathy Wu, Subha Mad- havan, and K Vijay-Shanker. eGARD: Extracting associations between genomic anomalies and drug responses from text. PLoS One, 12(12):e0189663, December 2017.

[35] Yael Garten and Russ B Altman. Pharmspresso: a text mining tool for ex- traction of pharmacogenomic concepts and relationships from full text. BMC Bioinformatics, 10 Suppl 2:S6, February 2009.

[36] J¨orgHakenberg, Dmitry Voronov, V˜oH`aNguyˆen,Shanshan Liang, Saadat An- war, Barry Lumpkin, Robert Leaman, Luis Tari, and Chitta Baral. A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions. J. Biomed. Inform., 45(5):842–850, October 2012.

[37] Rong Xu and Quanqiu Wang. A knowledge-driven conditional approach to extract pharmacogenomics specific drug-gene relationships from free text. J. Biomed. Inform., 45(5):827–834, October 2012.

[38] Bastien Rance, Emily Doughty, Dina Demner-Fushman, Maricel G Kann, and Olivier Bodenreider. A mutation-centric approach to identifying pharmacoge- nomic relations in text. J. Biomed. Inform., 45(5):835–841, October 2012.

[39] Fabio Rinaldi, Gerold Schneider, and Simon Clematide. Relation mining exper- iments in the pharmacogenomics domain. J. Biomed. Inform., 45(5):851–861, October 2012.

[40] S Pakhomov, B T McInnes, J Lamba, Y Liu, G B Melton, Y Ghodke, N Bhise, V Lamba, and A K Birnbaum. Using PharmGKB to train text mining approaches for identifying potential gene targets for pharmacogenomic studies. J. Biomed. Inform., 45(5):862–869, October 2012.

[41] MEDLINE R /PubMed R resources guide. November 2006.

[42] Andrew Chatr-Aryamontri, Rose Oughtred, Lorrie Boucher, Jennifer Rust, Christie Chang, Nadine K Kolas, Lara O’Donnell, Sara Oster, Chandra Theesfeld, Adnane Sellam, Chris Stark, Bobby-Joe Breitkreutz, Kara Dolinski, and Mike Tyers. The BioGRID interaction database: 2017 update. Nucleic Acids Res., 45(D1):D369–D379, January 2017.

81 [43] Sandra Orchard, Mais Ammari, Bruno Aranda, Lionel Breuza, Leonardo Brig- anti, Fiona Broackes-Carter, Nancy H Campbell, Gayatri Chavali, Carol Chen, Noemi del Toro, Margaret Duesbury, Marine Dumousseau, Eugenia Galeota, Ursula Hinz, Marta Iannuccelli, Sruthi Jagannathan, Rafael Jimenez, Jyoti Khadake, Astrid Lagreid, Luana Licata, Ruth C Lovering, Birgit Meldal, Anna N Melidoni, Mila Milagros, Daniele Peluso, Livia Perfetto, Pablo Porras, Arathi Raghunath, Sylvie Ricard-Blum, Bernd Roechert, Andre Stutz, Michael Tognolli, Kim van Roey, Gianni Cesareni, and Henning Hermjakob. The MIntAct project– IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res., 42(Database issue):D358–63, January 2014.

[44] Martin Krallinger, Florian Leitner, Carlos Rodriguez-Penagos, and Alfonso Va- lencia. Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol., 9 Suppl 2:S4, September 2008.

[45] Domonkos Tikk, Philippe Thomas, Peter Palaga, J¨orgHakenberg, and Ulf Leser. A comprehensive benchmark of kernel methods to extract protein-protein inter- actions from literature. PLoS Comput. Biol., 6:e1000837, July 2010.

[46] Sofie Van Landeghem, Yvan Saeys, Bernard De Baets, and Yves Van de Peer. Extracting protein-protein interactions from text using rich feature vectors and feature selection. In 3rd International symposium on Semantic Mining in Biomedicine (SMBM 2008), pages 77–84. Turku Centre for Computer Sciences (TUCS), 2008.

[47] Zhehuan Zhao, Zhihao Yang, Hongfei Lin, Jian Wang, and Song Gao. A protein- protein interaction extraction approach based on deep neural network. Int. J. Data Min. Bioinform., 15(2):145–164, January 2016.

[48] Yifan Peng and Zhiyong Lu. Deep learning for extracting protein-protein inter- actions from biomedical literature. In BioNLP 2017, pages 29–38, Stroudsburg, PA, USA, 2017. Association for Computational Linguistics.

[49] Nona Naderi and Ren´eWitte. Automated extraction and semantic analysis of mutation impacts from the biomedical literature. BMC Genomics, 13 Suppl 4:S10, June 2012.

[50] Qingyu Chen, Nagesh C Panyam, Aparna Elangovan, Melissa Davis, and Karin Verspoor. Document triage and relation extraction for Protein-Protein interac- tions affected by mutations. Emu, 6(900):52–51, 2017.

[51] Tung Tran and Ramakanth Kavuluru. Exploring a deep learning pipeline for the BioCreative VI precision medicine task. In Proceddings of the BioCreative VI Workshop. page 106, volume 109, 2017.

82 [52] Rezarta Islamaj Dogan, Andrew Chatr-aryamontri, Sun Kim, Chih-Hsuan Wei, Yifan Peng, Donald Comeau, and Zhiyong Lu. BioCreative VI precision medicine track: creating a training corpus for mining protein-protein interactions affected by mutations. BioNLP 2017, pages 171–175, 2017.

[53] Karin Verspoor, Antonio Jimeno Yepes, Lawrence Cavedon, Tara McIntosh, Asha Herten-Crabb, Zo¨eThomas, and John-Paul Plazzer. Annotating the biomedical literature for the human variome. Database, 2013:bat019, April 2013.

[54] Johan T den Dunnen, Raymond Dalgleish, Donna R Maglott, Reece K Hart, Marc S Greenblatt, Jean McGowan-Jordan, Anne-Francoise Roux, Timothy Smith, Stylianos E Antonarakis, and Peter E M Taschner. HGVS recommen- dations for the description of sequence variants: 2016 update. Hum. Mutat., 37(6):564–569, June 2016.

[55] Meenakshi Narayanaswamy, K E Ravikumar, and K Vijay-Shanker. A biological named entity recognizer. Pac. Symp. Biocomput., pages 427–438, 2003.

[56] Antonio Jimeno Yepes and Karin Verspoor. Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Res., 3:18, January 2014.

[57] Malachi Griffith, Nicholas C Spies, Kilannin Krysiak, Joshua F McMichael, Adam C Coffman, Arpad M Danos, Benjamin J Ainscough, Cody A Ramirez, Damian T Rieke, Lynzey Kujan, Erica K Barnell, Alex H Wagner, Zachary L Skidmore, Amber Wollam, Connor J Liu, Martin R Jones, Rachel L Bilski, Robert Lesurf, Yan-Yang Feng, Nakul M Shah, Melika Bonakdar, Lee Trani, Matthew Matlock, Avinash Ramu, Katie M Campbell, Gregory C Spies, Aaron P Graubert, Karthik Gangavarapu, James M Eldred, David E Larson, Jason R Walker, Benjamin M Good, Chunlei Wu, Andrew I Su, Rodrigo Dienstmann, Adam A Margolin, David Tamborero, Nuria Lopez-Bigas, Steven J M Jones, Ron Bose, David H Spencer, Lukas D Wartman, Richard K Wilson, Elaine R Mardis, and Obi L Griffith. CIViC is a community knowledgebase for expert crowdsourc- ing the clinical interpretation of variants in cancer. Nat. Genet., 49(2):170–174, January 2017.

[58] Docker. https://www.docker.com/. Accessed: 2018-8-4. [59] John D Burger, Emily Doughty, Ritu Khare, Chih-Hsuan Wei, Rajashree Mishra, John Aberdeen, David Tresner-Kirsch, Ben Wellner, Maricel G Kann, Zhiyong Lu, and Lynette Hirschman. Hybrid curation of gene–mutation relations com- bining automated extraction and crowdsourcing. Database, 2014, January 2014.

[60] Rainer Winnenburg, Conrad Plake, and Michael Schroeder. Improved mutation tagging with gene identifiers applied to membrane protein stability prediction. BMC Bioinformatics, 10 Suppl 8:S3, August 2009.

83 [61] S¨uveyda Yeniterzi and Ugur Sezerman. EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstracts. BMC Bioinformatics, 10 Suppl 8:S2, August 2009.

[62] Laura I Furlong, Holger Dach, Martin Hofmann-Apitius, and Ferran Sanz. OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature. BMC Bioinformatics, 9:84, February 2008.

[63] Rebecca E Saunders and Stephen J Perkins. CoagMDB: a database analysis of missense mutations within four conserved domains in five vitamin k-dependent coagulation serine proteases using a text-mining tool. Hum. Mutat., 29(3):333– 344, March 2008.

[64] Min Song, Won Chul Kim, Dahee Lee, Go Eun Heo, and Keun Young Kang. PKDE4J: Entity and relation extraction for public knowledge discovery. J. Biomed. Inform., 57:320–332, October 2015.

[65] Kyubum Lee, Sunwon Lee, Sungjoon Park, Sunkyu Kim, Suhkyung Kim, Kwanghun Choi, Aik Choon Tan, and Jaewoo Kang. BRONCO: Biomedical entity relation ONcology COrpus for extracting gene-variant-disease-drug rela- tions. Database, 2016, April 2016.

[66] Behrouz Bokharaeian, Alberto Diaz, Nasrin Taghizadeh, Hamidreza Chitsaz, and Ramyar Chavoshinejad. SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature. J. Biomed. Semantics, 8(1):14, April 2017.

[67] Marie-Catherine de Marneffe and Christopher D Manning. The stanford typed dependencies representation. In Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, CrossParser ’08, pages 1–8, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.

[68] Samir Gupta, Asm Ashique Mahmood, Karen Ross, Cathy Wu, and K Vijay- Shanker. Identifying comparative structures in biomedical text. BioNLP 2017, pages 206–215, 2017.

[69] Larry McKnight and Padmini Srinivasan. Categorization of sentence types in medical abstracts. AMIA Annu. Symp. Proc., pages 440–444, 2003.

[70] K Hirohata, N Okazaki, S Ananiadou, M Ishizuka, and others. Identifying sec- tions in scientific abstracts using conditional random fields. 2008.

[71] Su Nam Kim, David Martinez, Lawrence Cavedon, and Lars Yencken. Automatic classification of sentences to support evidence based medicine. BMC Bioinfor- matics, 12 Suppl 2:S5, March 2011.

84 [72] Maria Liakata, Shyamasree Saha, Simon Dobnik, Colin Batchelor, and Dietrich Rebholz-Schuhmann. Automatic recognition of conceptualization zones in scien- tific articles and two life science applications. Bioinformatics, 28(7):991–1000, April 2012.

[73] Makoto Miwa, Paul Thompson, John McNaught, Douglas B Kell, and Sophia Ananiadou. Extracting semantically enriched events from biomedical literature. BMC Bioinformatics, 13:108, May 2012.

[74] Paul Thompson, Raheel Nawaz, John McNaught, and Sophia Ananiadou. En- riching a biomedical event corpus with meta-knowledge annotation. BMC Bioin- formatics, 12:393, October 2011.

[75] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res., 41(Web Server issue):W518–22, July 2013.

[76] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The stanford CoreNLP natural language pro- cessing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Stroudsburg, PA, USA, 2014. Association for Computational Linguistics.

[77] Eugene Charniak and Mark Johnson. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In Proceedings of the 43rd Annual Meeting on Associ- ation for Computational Linguistics, ACL ’05, pages 173–180, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics.

[78] David McClosky and Eugene Charniak. Self-training for biomedical parsing. In Proceedings of the 46th Annual Meeting of the Association for Computational Lin- guistics on Human Language Technologies: Short Papers, HLT-Short ’08, pages 101–104, Stroudsburg, PA, USA, 2008. Association for Computational Linguis- tics.

[79] Apache spark - unified analytics engine for big data. https://spark.apache. org/. Accessed: 2018-8-4.

[80] Heidi L Rehm, Jonathan S Berg, Lisa D Brooks, Carlos D Bustamante, James P Evans, Melissa J Landrum, David H Ledbetter, Donna R Maglott, Christa Lese Martin, Robert L Nussbaum, Sharon E Plon, Erin M Ramos, Stephen T Sherry, Michael S Watson, and ClinGen. ClinGen–the clinical genome resource. N. Engl. J. Med., 372(23):2235–2242, June 2015.

[81] M A Levy, C M Lovly, L Horn, R Naser, and W Pao. My cancer genome: Web- based clinical decision support for genome-directed lung cancer treatment. J. Clin. Orthod., 29(15 suppl):7576–7576, May 2011.

85 [82] Wilco W M Fleuren and Wynand Alkema. Application of text mining in the biomedical domain. Methods, 74:97–106, March 2015.

[83] Fei Zhu, Preecha Patumcharoenpol, Cheng Zhang, Yang Yang, Jonathan Chan, Asawin Meechai, Wanwipa Vongsangnak, and Bairong Shen. Biomedical text mining and its applications in cancer research. J. Biomed. Inform., 46(2):200– 211, April 2013.

[84] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. GNormPlus: An integrative approach for tagging genes, gene families, and protein domains. Biomed Res. Int., 2015:918710, August 2015.

[85] Robert Leaman, Rezarta Islamaj Dogan, and Zhiyong Lu. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics, 29(22):2909–2917, November 2013.

[86] Robert Leaman, Chih-Hsuan Wei, and Zhiyong Lu. tmchem: a high performance approach for chemical named entity recognition and normalization. J. Chemin- form., 7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3, January 2015.

[87] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. SR4GN: a species recognition software tool for gene normalization. PLoS One, 7(6):e38460, June 2012.

[88] Ayush Singhal, Michael Simmons, and Zhiyong Lu. Text mining Genotype- Phenotype relationships from biomedical literature for database curation and precision medicine. PLoS Comput. Biol., 12(11):e1005017, November 2016.

[89] Kyubum Lee, Wonho Shin, Byounggun Kim, Sunwon Lee, Yonghwa Choi, Sunkyu Kim, Minji Jeon, Aik Choon Tan, and Jaewoo Kang. HiPub: translating PubMed and PMC texts to networks for knowledge discovery. Bioinformatics, 32(18):2886–2888, September 2016.

[90] Vivian Law, Craig Knox, Yannick Djoumbou, Tim Jewison, An Chi Guo, Yifeng Liu, Adam Maciejewski, David Arndt, Michael Wilson, Vanessa Neveu, Alexan- dra Tang, Geraldine Gabriel, Carol Ly, Sakina Adamjee, Zerihun T Dame, Beom- soo Han, You Zhou, and David S Wishart. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res., 42(Database issue):D1091–7, January 2014.

[91] Kyubum Lee, Byounggun Kim, Yonghwa Choi, Sunkyu Kim, Wonho Shin, Sun- won Lee, Sungjoon Park, Seongsoon Kim, Aik Choon Tan, and Jaewoo Kang. Deep learning of mutation-gene-drug relations from the literature. BMC Bioin- formatics, 19(1):21, January 2018.

86 [92] Yael Garten, Adrien Coulet, and Russ B Altman. Recent progress in automati- cally extracting information from the pharmacogenomic literature. Pharmacoge- nomics, 11(10):1467–1489, October 2010.

[93] A Coulet, K B Cohen, and R B Altman. The state of the art in text mining and natural language processing for pharmacogenomics. J. Biomed. Inform., 2012.

[94] Samir Gupta, Karen E Ross, Catalina O Tudor, Cathy H Wu, Carl J Schmidt, and K Vijay-Shanker. miRiaD: A text mining tool for detecting associations of microRNAs with diseases. J. Biomed. Semantics, 7(1):9, April 2016.

[95] Yifan Peng, Manabu Torii, Cathy H Wu, and K Vijay-Shanker. A generaliz- able NLP framework for fast development of pattern-based biomedical relation extraction systems. BMC Bioinformatics, 15:285, August 2014.

[96] Warren A Kibbe, Cesar Arze, Victor Felix, Elvira Mitraka, Evan Bolton, Gang Fu, Christopher J Mungall, Janos X Binder, James Malone, Drashtti Vasant, Helen Parkinson, and Lynn M Schriml. Disease ontology 2015 update: an ex- panded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res., 43(Database issue):D1071–8, January 2015.

[97] Ariel S Schwartz and Marti A Hearst. A simple algorithm for identifying abbre- viation definitions in biomedical text. Pac. Symp. Biocomput., pages 451–462, 2003.

[98] Jia Ren, Gang Li, and Cathy H Wu. iTextMine: Integrated text mining system for Large-Scale knowledge extraction from literature.

[99] Shruti Rao, Robert A Beckman, Shahla Riazi, Cinthya S Yabar, Simina M Boca, John L Marshall, Michael J Pishvaian, Jonathan R Brody, and Subha Madhavan. Quantification and expert evaluation of evidence for chemopredictive biomarkers to personalize cancer treatment. Oncotarget, 8(23):37923–37934, June 2017.

[100] Thomas A Peterson, Emily Doughty, and Maricel G Kann. Towards precision medicine: advances in computational approaches for the analysis of human vari- ants. J. Mol. Biol., 425(21):4047–4063, November 2013.

[101] National institutes of health (NIH) — all of us. https://allofus.nih.gov/. Accessed: 2018-8-5.

[102] Andrew Chatr-Aryamontri, Bobby-Joe Breitkreutz, Rose Oughtred, Lorrie Boucher, Sven Heinicke, Daici Chen, Chris Stark, Ashton Breitkreutz, Nadine Kolas, Lara O’Donnell, Teresa Reguly, Julie Nixon, Lindsay Ramage, Andrew Winter, Adnane Sellam, Christie Chang, Jodi Hirschman, Chandra Theesfeld,

87 Jennifer Rust, Michael S Livstone, Kara Dolinski, and Mike Tyers. The Bi- oGRID interaction database: 2015 update. Nucleic Acids Res., 43(Database issue):D470–8, January 2015.

[103] Juan Miguel Cejuela, Aleksandar Bojchevski, Carsten Uhlig, Rustem Bek- mukhametov, Sanjeev Kumar Karn, Shpend Mahmuti, Ashish Baghudana, Ankit Dubey, Venkata P Satagopam, and Burkhard Rost. nala: text mining natural language mutation mentions. Bioinformatics, 33(12):1852–1858, June 2017.

[104] Rezarta Islamaj Dogan, Andrew Chatr-aryamontri, Sun Kim, Chih-Hsuan Wei, Yifan Peng, Donald Comeau, and Zhiyong Lu. BioCreative VI precision medicine track: creating a training corpus for mining protein-protein interactions affected by mutations. In BioNLP 2017, pages 171–175, Stroudsburg, PA, USA, 2017. Association for Computational Linguistics.

[105] Haiming Tang and Paul D Thomas. Tools for predicting the functional impact of nonsynonymous genetic variation. Genetics, 203(2):635–647, June 2016.

88 Appendix A

LEXICO-SYNTACTIC PATTERNS

A.1 Patterns for Association type sentences • NP 1 VG passive{associate} with NP 2 • NP 1 VG active{associate} with NP 2 • NP{associate} of NP 1 with/and NP 2 • NP{associate} between NP 1 and NP 2 • NP{associate} in NP 1 at NP 2 • NP 1 [VG active] NP{associate} with NP 2 • NP{role} of/for NP 1 in NP 2 • NP 1 [VG active] NP{role} in NP 2 • NP 1 VG active{contribute} to NP 2 • NP{contribute} of NP 1 in/to NP 2 • NP{effect} of NP 1 in/on NP 2 • NP 1 [VG active] NP{effect} in/on NP 2 • NP 1 VG active{affect} NP 2 • NP 1 [VG active] NP{susceptibility} to NP 2 • NP 1 VG passive{susceptible} for NP 2 • NP 1 [VG active] NP{risk} for NP 2 • NP 1 VG passive{involve} in NP 2 • NP 1 VG active{confer} NP 2 • NP{relationship} between NP 1 and NP 2 • NP{link} between NP 1 and NP 2 • NP 1 VG passive{correlate} with NP 2 • NP 1 VG active{correlate} with NP 2 • NP{correlation} of NP 1 and NP 2

89 A.2 Patterns for Comparison type sentences • VG active{compare} to/with NP 1, NP 2 VG active • NP{increased/decreased/more/less/higher/lower/longer/shorter} in NP 3 • NP 1 VG active NP{difference} between NP 2 and NP 3 • NP{difference} of/in NP 1 between NP 2 and NP 3 • NP 1 VG active NP{increased/decreased/more/less/higher/lower/longer/shorter} NP 2 NP{compare} to/with NP 3 • NP 1 VG active NP{increased/decreased/more/less/higher/lower/longer/shorter} NP 2 than NP 3 • NP 1 VG passive{increased/decreased/more/less/higher/lower/longer/shorter} in NP 2 NP{compare} to/with NP 3 • NP 1 VG passive{increased/decreased/more/less/higher/lower/longer/shorter} in NP 2 than NP 3

A.3 Patterns for Sensitization type sentences • NP 1 VG active{sensitize} NP{disease (cells)} to NP 2 • NP{disease (cells)} VG passive{sensitize} to NP 1 by NP 2

A.4 Patterns for Biomarker types sentences • NP 1 VG is a{is a/acts as/are/functions as/serves as} P{predictor/biomarker/marker} of/for NP 2 • NP 1 VG active{predict/indicate} NP 2

A.5 Patterns for drug detection • NP{treatment} with NP{drugname} • NP{patients} VG{treat} with {drugname} • NP{patients} VG{receive} NP{drugname} • NP{drugname therapy} • NP{efficacy/sensitivity} of NP{drugname} • NP{response/sensitivity/resistance} to NP{drugname}

90 Appendix B

GENE-DRUG COMBINATIONS

The list of these seven gene-drug combinations that were used to create dataset InHouseSet1 and InHouseSet2 in chapter4 are listed below. ERCC1, platinum MGMT, temozolomide RRM1, gemcitabine TOP2A, anthracyclines TOPO1, irinotecan & topotecan TUBB3, Taxanes TYMS, 5-FU & capecitabine

The list of FDA-approved drugs for 13 genes in cancer study that were used to filter PharmGKB dataset in chapter4 are listed below.

BRAF, vemurafenib BRAF, dabrafenib BRAF, trametinib EML4-ALK, crizotinib EML4-ALK, ceritinib EML4-ALK, alectinib EGFR, erlotinib EGFR, gefitinib EGFR, afatinib EGFR, osimertinib EGFR, cetuximab EGFR, panitumumab ERBB2, trastuzumab ERBB2, lapatinib ERBB2, pertuzumab ERBB2, ado-trastuzumab and emtansine BRCA1/2, olaparib SMO and PTCH1, vismodegib

91 SMO and PTCH1, sonidegib CDK4/6, palbociclib cKIT, imatinib cKIT, sunitinib cKIT, regorafenib mTOR, everolimus mTOR, temsirolimus COL1A1-PDFFB, imatinib VEGF/VEGFR, bevacizumab VEGF/VEGFR, ramucirumab VEGF/VEGFR, regorafenib VEGF/VEGFR, ziv-aflibercept VEGF/VEGFR, axitinib VEGF/VEGFR, pazopanib VEGF/VEGFR, sunitinib VEGF/VEGFR, sorafenib RET, vandetanib RET, cabozantinib RET, lenvatinib PDGFRA, imatinib PDGFRA, sunitinib

92 Appendix C

PERMISSIONS

Part of chapter2,3 and4 were published as articles on PLOS ONE Open Access journal. Their copyright policy is as follows:

“PLOS applies the Creative Commons Attribution (CC BY) license to works we publish. This license was developed to facilitate Open Access-namely, free immediate access to, and unrestricted reuse of, original works of all types. Under this license, authors agree to make articles legally available for reuse, without permission or fees, for virtually any purpose. Anyone may copy, distribute or reuse these articles, as long as the author and original source are properly cited. Additionally, the journal platform that PLOS uses to publish research articles is Open Source.” (Source: https://www.plos.org/open-access)

93