A Framework for Exploring Associations Between Biomedical Terms in Pubmed
Total Page:16
File Type:pdf, Size:1020Kb
www.impactjournals.com/oncotarget/ Oncotarget, 2017, Vol. 8, (No. 61), pp: 103100-103107 Research Paper A framework for exploring associations between biomedical terms in PubMed Haixiu Yang1,*, Lingling Zhao2,*, Ying Zhang3, Hong Ju4, Dong Wang5, Yang Hu6, Jun Zhang3 and Liang Cheng1 1College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, PR China 2School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, PR China 3Department of Rehabilitation and Pharmacy, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin 150088, PR China 4Department of Information Engineering, Heilongjiang Biological Science and Technology Career Academy, Harbin 150001, PR China 5Department of Academic Research, Heilongjiang University of Science & Technology, Harbin 150022, PR China 6School of Life Science and Technology, Harbin Institute of Technology, Harbin 150001, PR China *These authors contributed equally to this work Correspondence to: Liang Cheng, email: [email protected] Jun Zhang, email: [email protected] Yang Hu, email: [email protected] Keywords: co-occurrence relationship, text mining, framework, term association Received: June 12, 2017 Accepted: September 08, 2017 Published: October 05, 2017 Copyright: Yang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License 3.0 (CC BY 3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. ABSTRACT Co-occurrence relationships in PubMed between terms accelerate the recognition of term associations. The lack of manually curated relationships in vocabularies and the rapid increase of biomedical literatures highlight the importance of co-occurrence relationships. Here we proposed a framework to explore term associations based on a standard procedure that comprises multiple tools of text mining and relationship degree calculation methods. The text of PubMed were segmented into sentences by Apache OpenNLP first, and then terms of sentences were recognized by MGREP. After that two terms occurring in a common sentence were identified as a co-occurrence relationship. The relationship degree is then calculated using Normalized MEDLINE Distance (NMD) or relationship-scaled score (RSS) method. The framework was utilized in exploring associations between terms of Gene Ontology (GO) and Disease Ontology (DO) based on co-occurrence relationship. Results show that pairs of terms with more co-occurrence relationships indicate shared more semantic relationships of ontology and genes. The identified association terms based on co-occurrence relationships were applied in constructing a disease association network (DAN). The small giant component confirms with the observation that diseases in the same class have more linkage than diseases in different classes. INTRODUCTION (MeSH) [3] classified biomedical terms in 16 categories based on set inclusion relationship between terms for A large number of biomedical terms occurs with easing store and retrieve literatures [3]. After applying the rapid development of researches and the increasement annotating the function of molecules, the advantages of the literatures. Associations between these terms play of associations between terms gradually appears. important roles in exploring literature and linking term- Recent studies have utilized these term associations in related molecules [1, 2]. e.g. Medical Subject Headings constructing functional similarity network of non-coding www.impactjournals.com/oncotarget 103100 Oncotarget RNAs (ncRNAs) [4, 5], predicting novel disease related to annotate literature with biomedical ontologies and ncRNAs [5], and so on. explore term associations. First, the abstract of literatures Term associations can often be reflected by were segmented into sentences. And then the terms of the manually curated relationships between terms in sentences were mapped to biomedical ontology terms. ontologies or their annotating genes. Terms of a domain Finally, the similarity based on co-occurrence relationships are collected as a biomedical ontology or a category of was calculated based on existing methods to reflect the an ontology. For example, terms of biological process relationship degree of pair-wise terms. (BP), molecular function (MF) and cell component (CC) were organized in three categories of Gene Ontology RESULTS (GO) [6]. ‘IS_A’ relationships between terms of sub- categories of GO form a directed acyclic graph (DAG). Co-occurrence relationships between GO and The DAG was frequently used in calculating similarity of DO terms in PubMed pair-wise terms, and the high similarity scores represent the association between terms [7–9]. Currently, Wang et Terms with co-occurrence relationships were al.’s method [4, 10] was validated suitable in biomedical extracted based on our framework in February 2017. After ontologies, such as GO, disease ontology (DO) [11], removing the annotations without more than two terms in and so on. Because no relationships between different a sentence, 2,057 CC terms (52.65%–2057/3907), 3,708 terms across ontologies exist, it is not easy to explore MF terms (37.12%–3708/9988), 9,588 BP terms (33.95%– associations between terms across ontologies based on the 9588/28245), and 5,291 DO terms (76.46%–5291/6920) DAG of ontology completely. Fortunately, the functional in 15,922,610 sentences of abstracts were left. Totally, a annotations of genes help for this. To explore associations larger number of BP terms are related with other terms, between terms across ontologies, three methods involving and the highest proportion of disease terms is analyzed Vector Space Model (VSM) [12], Association Rule with other terms. It is rational that disease domain Mining (ASR) [13], and Cross-Category Gene Ontology deserves more attention. Measurement (CroGO) methods [14] were designed. The more detailed annotation results are shown Two terms occurs in common sentence, abstract, in Figure 1. In all 1,453,119 pairs of terms with co- documents of a literature is deemed as co-occurrence occurrence relationships were obtained. Among them, relationships. The co-occurrence relationships occur 43,840 CC term pairs, 57,876 MF term pairs, 230,367 BP widely in common literatures of PubMed. In comparison term pairs, 298,523 DO term pairs have co-occurrence with DAG and term gene relationship, these relationships relationships, and larger number of relationships across were easier to be ignored for exploring term associations. ontologies were extracted. In comparison with the number Currently, four methods involving Normalized MEDLINE of ‘IS_A’ relationships from ontology, more relationships Distance (NMD) [15], relationship-scaled score (RSS) exist in PubMed. e.g. almost 8 times the number of [16], extensional Mutual Information (EMI) [17], and relationships between CC terms (7.80–43840/5618) occur adjusted RSS based on information content (ARSSIC) in PubMed. Therefore, it is very important to access co- methods [18] were presented based on co-occurrence occurrence relationships to aid to explore the association relationships. NMD method is based on Normalized between terms of an ontology. Since few number of Google Distance (NGD) method [19], which is manually curated relationships were provided between implemented in google search engine for calculating the terms across ontologies, co-occurrence relationship plays relationship degree of two search terms based on their important roles in exploring term associations across co-occurrence relationships. RSS and EMI methods are ontologies across ontologies. designed based on MI (Mutual Information) [20] to rank relativities between terms in literature. Compared with Related terms based on co-occurrence other three methods, ARSSIC method incorporates DAG relationships indicate shared semantic of ontology and the co-occurrence relationships. relationships of ontology Although multiple methods have been presented for exploring associations between terms based on their In order to assess the performance of term relativity co-occurrence relationships in PubMed, two issues limit based on co-occurrence relationships in PubMed, we the application of these methods. One issue is that two compared the correlation between term relativity with terms occurring in common abstract of literature were structure-based term similarity. The term similarity based extracted for co-occurrence relationships. Obviously, two on ontology performs well and is widely used in biomedical terms occurring in common sentence of abstract should domain as best of our knowledge. Thus, the correlation can be more likely to have potential linkages. The other show the performance of term relativity. Because RSS and issue is that no procedure was presented to extract co- NMD methods [15, 16] are two typical methods based on occurrence relationships between terms from literature. MI and NGD respectively, these two methods are used for To solve these two issues, here we designed a framework calculating relative degree of pair-wise terms here. Wang’s www.impactjournals.com/oncotarget 103101 Oncotarget method completely depends on the structure of ontology. based on RSS score. And the correlation based on NMD Therefore, Wang’s method was utilized for calculating term score is slightly less than the correlation based on RSS similarity of BP, MF, CC and DO. score. Since both RSS and NMD methods are based on Figure