Measuring Relatedness Between Scientific Entities in Annotation
Total Page:16
File Type:pdf, Size:1020Kb
Measuring Relatedness Between Scientific Entities in Annotation Datasets Guillermo Palma Maria-Esther Vidal Eric Haag Universidad Simón Bolívar Universidad Simón Bolívar University of Maryland Caracas, Venezuela Caracas, Venezuela College Park, USA [email protected] [email protected] [email protected] Louiqa Raschid Andreas Thor University of Maryland University of Leipzig College Park, USA Germany [email protected] [email protected] leipzig.de ABSTRACT 1. INTRODUCTION Linked Open Data has made available a diversity of scien- Linked Open Data has made available a diversity of scien- tific collections where scientists have annotated entities in tific collections. Scientists have annotated entities in the col- the datasets with controlled vocabulary terms (CV terms) lections with controlled vocabulary terms (CV terms) from from ontologies. These semantic annotations encode scien- ontologies or taxonomies. Annotations describe properties tific knowledge which is captured in annotation datasets. of these concepts. For example, the functions of genes are One can mine these datasets to discover relationships and described using Gene Ontology (GO) CV terms. patterns between entities. Determining the relatedness (or similarity) between entities becomes a building block for Annotations induce an annotation graph where nodes corre- graph pattern mining, e.g., identifying drug-drug relation- spond to scientific concepts or ontology terms, and edges ships could depend on the similarity of the diseases (condi- represent relationships between concepts. Figure 1 illus- tions) that are associated with each drug. Diverse similarity trates a portion of the Web of Data that can induce an metrics have been proposed in the literature, e.g., i) string- annotation graph. Consider clinical trials linked to a set of similarity metrics; ii) path-similarity metrics; iii) topological- diseases or conditions in the NCI Thesaurus (NCIt). Clin- similarity metrics; all measure relatedness in a given taxon- ical trials from LinkedCT1 are represented by blue ovals; omy or ontology. In this paper, we consider a novel annota- they are associated with interventions or drugs (green rect- tion similarity metric AnnSim that measures the relatedness angles) and diseases or conditions (pink rectangles). Both between two entities in terms of the similarity of their an- interventions and conditions are then annotated with terms notations. We model AnnSim as a 1-to-1 maximal weighted from the NCI Thesaurus (red circles). Some annotations of bipartite match, and we exploit properties of existing solvers a drug may correspond to terms in the NCIt that identify to provide an efficient solution. We empirically study the ef- the drug, while others may correspond to the diseases or fectiveness of AnnSim on real-world datasets of genes and conditions that have been treated with this drug. their GO annotations, clinical trials, and a human disease benchmark. Our results suggest that AnnSim can provide a Knowledge captured within scientific collections, the annota- deeper understanding of the relatedness of concepts and can tions and the ontologies are rich and complex. For example, provide an explanation of potential novel patterns. the NCI Thesaurus version 12.05d has 93,788 terms. The LinkedCT dataset circa September 2011 includes 142,207 in- terventions, 167,012 conditions or diseases, and 166,890 links Categories and Subject Descriptors to DBPedia, DrugBank, and Diseasome. Thus, the challenge H.4 [Information Systems Applications]: Miscellaneous is to explore these rich and complex datasets and to dis- cover patterns, e.g., patterns of annotations across multiple Keywords disease conditions or multiple drug interventions. For gene Annotation datasets; topological distance; Annotation sim- functional annotations, patterns may involve cross-genome ilarity; weighted bipartite match. functional annotation, e.g., combining the GO functional annotations of two model organisms such as Arabidopsis thaliana (a plant) and Caenorhabditis elegans (a nematode or worm), to predict new functions. As a first step to discovering complex patterns, we consider an important building block that determines the relatedness (or similarity) of a pair of scientific concepts, based on their annotations with respect to one or more ontologies. An ex- 1http://linkedct.org/ ample is identifying the relatedness or similarity of (drug, drug) pairs, based on the annotation evidence of diseases (conditions) from the NCIt. This can lead to discoveries of new targets for existing drugs, or it can predict potential side-effects of drugs. NCIt:C3211 Non-Hodgkin Lymphoma Alemtuzumab NCIt:C9300 NCT00690066 Lymphoma NCIt:C9357 Rituximab Diabetes NCT01559025 NCIt:C2985 Mellitus NCT00669318 Leukemia Pentostatin NCT00563004 Sargramostin NCT00837759 Figure 2: Annotation subgraph representing the annotations of Brentuximab vedotin and Catumaxomab. Interventions are green rectangles; conditions are pink rectangles; ontology terms in the NCIt are red ovals. Figure 1: Annotation graph of Clinical Trials from A naive implementation of AnnSim would require us to com- LinkedCT (blue ovals). Interventions are green rect- pute the topological similarity of all pairs of ontology terms, angles; conditions are pink rectangles and CV terms or the cartesian product, over the two sets of NCIt ontology from the NCI Thesaurus are red ovals. terms that annotate each of the pair of concepts. However, many of these pairs of terms may be unrelated. We model A broad variety of similarity metrics have been proposed AnnSim as a 1-to-1 maximal weighted bipartite matching, in the literature [9, 12, 14, 15, 16, 17, 19, 20, 23, 27, and we exploit properties of existing solvers to provide an 29]. Existing similarity metrics can be of diverse types: efficient solution. i) string-similarity metrics that measure similarity using (ap- proximate) string matching functions (e.g., [11]); ii) path- We empirically study the effectiveness of AnnSim on real- similarity metrics such as PathSim and HeteSim that com- world datasets of genes and their GO annotations, evidence pute relatedness based on the paths that connect concepts from clinical trials, and a well known human disease bench- in a graph (e.g., [23, 27]); and iii) topological-similarity met- mark. We compare the quality of AnnSim with respect to rics that measure relatedness in terms of the closeness of CV existing similarity metrics including d [6], d [18], Het- terms in a given taxonomy or ontology (e.g., [6, 15, 18]). tax ps eSim [23]. Example 1.1. Antineoplastic agents and monoclonal an- We use the sequence-based similarity for genes based on the tibodies are two popular and independent intervention regimes normalized Smith and Waterman scores [25] computed by that have been successfully applied to treat a large range of BLAST2, further normalized as suggested in [8], as ground cancers. There are 12 drugs that fall within their intersec- truth for genes. We also use the evolutionary phylogenetic tion, and scientists are interested in studying the relation- tree for a family of related genes as ground truth. ships between these drugs and the corresponding diseases. Consider the two drugs Brentuximab vedotin and Catumax- The contributions of this paper can be summarized as fol- omab. Figure 2 represents a subgraph of the annotation graph lows: of Figure 1. Each path between a pair of conditions, e.g., Carcinoma and Anaplastic Large Cell Lymphoma through the NCI Thesaurus is identified using red circles which repre- • The formalization of an annotation-based similarity sent ontology terms from the NCIt. The count of red circles metric AnnSim that defines the relatedness of two con- represents the length of a path. To simplify the figure, we cepts in terms of the sets of their annotations. An only illustrate the paths from the term Carcinoma. implementation relies on an existing 1-to-1 maximal weighted bipartite graph matching solver. In this paper, we propose a novel annotation similarity met- • An empirical study that validates AnnSim using a vari- ric AnnSim, that measures the relatedness between two en- ety of ground truth datasets including human curation tities in terms of the similarity or relatedness of the sets of as well as sequence based and phylogenetic evidence. their annotations. AnnSim combines properties of path- and topological-based similarity metrics to decide the relatedness • Our results suggest that AnnSim can provide a deeper of two scientific concepts. To the best of our knowledge, our understanding of the relatedness of concepts, and in research is the first to consider both the shared annotations some cases it can also provide an explanation of pat- between a pairs of concepts, as well as the relatedness of the terns. annotations (CV terms) within some ontology, to determine the resulting relatedness of the two concepts. 2http://blast.ncbi.nlm.nih.gov/ This paper is organized as follows: Section 2 summarizes re- the relatedness of the sets of annotations. Further, we use lated work. Section 3 gives the preliminary knowledge of this an ontology structure to determine ontological relatedness. work and illustrates the performance of existing approaches We extend the Dice coefficient to measure set agreement be- in a real-world example. Section 4 presents our approach. tween the sets of annotations in the 1-1 weighted maximal Experimental results are reported in Section 5. Finally, we bipartite match; the AnnSim score will be penalized if one conclude in Section 6 with an outlook to future