Disease Candidate Gene Identification and Gene Regulatory Network Building Through Medical Literature Mining
Total Page:16
File Type:pdf, Size:1020Kb
Disease Candidate Gene Identification and Gene Regulatory Network Building Through Medical Literature Mining Yong Wang, Chenyang Jiang, Jinbiao Cheng and Xiaoqun Wang Abstract Finding key genes associated with diseases is an essential problem of disease diagnosis and treatment, and drug design. Bioinformatics takes advantage of computer technology to analyze biomedical data to help finding the information about these genes. Biomedical literatures, which consists of original experimental data and results, are attracting more attention from bio-informatics researchers because literature mining technology can extract knowledge more efficiently. This paper designs an algorithm to estimate the association degree between genes according to their co-citations in biomedical literatures from PubMed database, and to further predict the causative genes associated with a disease. The paper also uses hierarchical clustering algorithm to build a specific genes regulation network. Experiments on uterine cancer shows that the proposed algorithm can identify pathogenic genes of uterine cancer accurately and rapidly. Keywords Literature mining · Co-citation · Pathogenic gene · Gene regulatory network · Hierarchical clustering Y. Wang (B) · C. Jiang · J. Cheng School of Software, Beijing Institute of Technology, Beijing 100081, China e-mail: [email protected] C. Jiang e-mail: [email protected] J. Cheng e-mail: [email protected] X. Wang Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China e-mail: [email protected] © Springer International Publishing Switzerland 2017 453 V.E. Balas et al. (eds.), Information Technology and Intelligent Transportation Systems, Advances in Intelligent Systems and Computing 455, DOI 10.1007/978-3-319-38771-0_44 454 Y. Wang et al. 1 Introduction Finding genes critical to the formation and development of a disease from disease candidate genes are of significance to diagnosis and treatment of diseases, which is an important goal of bioinformatics [1]. Bioinformatics technologies are used to manage and analyze mass data generated by biomedical experiments, and to give some predictions or directive conclusions for researchers further works. In recent years, a new branch of bioinformatics, literature mining [2] is more and more used to partially replace biomedical experiments, and finally to accelerate the whole research process. Literature is the most important way to publish experimental data, experi- mental results and experimental conclusion, so it contains a lot of original knowledge that has been proved in experiments. Literature mining can automatically analyze the huge amount of biomedical literature, if combined with artificial verification, then the researchers can not only save a lot of manpower by greatly narrowing down the scope of artificial reading, but also can extract and discover knowledge more efficiently and can give reliable guide to biomedical experiments. AI. Mubaid et al. [3] explore the relationship between genes and diseases by studying cooccurrence of genes and they find 6 candidate genes, all of which can be verified by PubMed literatures. Chun et al. [4] developed a named entity recogni- tion and mining system based on maximum entropy for automatic identification of prostate cancer related genes, then apply it to Medline database with the accuracy of 92.1%. Some researchers [5] take advantage of Protein–Protein Interaction (PPI) networks to predict disease related genes. So there are many PPI databases, such as HPRD (Human Protein Reference Database) [6], DIP (the Database of Interact- ing Proteins) [7] and OPHID (Online Predicted Human Interaction Database) [8]. Some researchers don’t use these well-built databases, rather build their own gene networks. Ozgur et al. built a prostate cancer related gene network based on some seed genes through text mining, and only one candidate gene cannot be supported by literatures [9]. Many different algorithms have been proposed to evaluate gene networks, such as shortest path [10, 11], similarity assisted [12, 13] and centrality based [9, 14]. In this paper, we design an algorithm to predicate disease candidate genes by doing text mining on the PubMed biomedical literature database combined with human gene ontology database. We first compute the degree of the association between genes, and then predict the causative genes associated with diseases. Further, we use hierarchical clustering algorithm to build specific genes regulation network and analyze the network by taking the results of text mining into consideration. The experimental results show that the proposed algorithm can identify pathogenic genes of uterine cancer accurately and rapidly. Those less referenced genes (low-frequency genes) should get more attention in future research. This work can provide a good guide for the study in the field of biomedical area, and show the broad application prospects of text mining in the biomedical field as well. Disease Candidate Gene Identification and Gene Regulatory … 455 2 System Design and Implementation 2.1 Getting Literature Data Source First, we download the data of more than 20,000 genes encoding proteins in the human genome (exons) from NCBI Gene database (ftp://ftp.ncbi.nih.gov/gene/). The data contains the ontology of each gene and the corresponding gene ID, saved as CSV files. Then write a Java program to read the CSV file for each gene ontology symbol, and set the searching parameters through URL, and then search the literature which relate to genes ontology in the database through the E-utilities API, and finally retrieve XML flies from search results. For each gene, parse the XML files, retrieve the PMID of top 100 most related papers. Remove duplicate PMID and then add the PMIDs to a set. For each PMID, search and retrieve the XML files of corresponding literature in PubMed, parse these files to get the titles, abstracts and keywords. Finally we store them into a local MySQL database, and take each literature’s PMID as the primary key. The article’s title, abstract and keywords are data source for the following text mining. 2.2 Processing Literature Data Identifying the gene ontologies. We use word segmentation method to retrieve gene ontologies in data source. There are many ways to represent gene ontologies in these literatures and some representations in early papers are nonstandard, so we need a collection of standard gene ontologies for reference before segmentation. In order to identify gene ontologies more easily, we make the following special preprocessing to the data source before segmenting it. First, read CSV files of all IDs of gene ontolo- gies that describe human genome, then read CSV files of the corresponding gene ontologies and genes alias from NCBI GENE database. Store these files into mem- ory, and add these files to a hash table named HashMap<String,String> sym2id, the key can be gene ontology or gene’s alias, and the value is a standard gene ontology. This hash table will be referenced in following steps. When segment the data source, we first replace special symbols, such as ‘(’, ‘)’, ‘+’, ‘,’, ‘.’, ‘[’, ‘]’, ‘<’, ‘>’, ‘?’, ‘/’, ‘\’, ‘&’ with space. Then process some fuzzy words, for example, spaces before and after was, an, and met should be removed. Remove “T cells” or “T cell” to pre- vent confusion in segmentation. For each segmented word, lookup in the hash table sym2id to determine whether the word is a certain gene, namely whether it is a gene ontology symbol (a standard gene ontology symbol or alias). If it is, it will be the only standard symbol of the gene ontology mapping for the gene, and will be stored in hash table, waiting for further processing. To evaluate the above identification algorithm, we select 100 papers randomly from the local database and identify gene ontolo- gies in them, then evaluate the precision rate by manual verification. The accuracy (F1-Measure) of algorithm reaches more than 97% as shown in Table 1. At the aspect 456 Y. Wang et al. Table 1 Gene Ontology Identification Evaluation Results No. The number The total The total Precision Recall F1-measure of correct number of number of all rate (PR) rate (RR) gene identified gene ontologies ontology gene in selected identification ontologies papers 1 698 707 731 0.987 0.955 0.971 2 1002 1007 1025 0.995 0.977 0.986 3 423 423 442 1 0.957 0.978 4 486 486 499 1 0.974 0.987 5 769 774 786 0.994 0.978 0.986 6 553 559 571 0.989 0.968 0.979 7 673 673 695 1 0.968 0.984 8 733 733 741 1 0.989 0.995 9 529 533 552 0.992 0.958 0.975 10 638 638 653 1 0.977 0.989 of performance, the whole process of identify gene ontologies in 20,000 papers only takes about 5 s. In which, PR, RR and F1-Measure (the harmonic mean of PR and RR) are com- puted as in (1), (2) and (3) respectively. the number of correct gene ontology identi f ication PR = (1) the total number of identi f ied gene ontologies the number of correct gene ontology identi f ication RR = (2) the total number of gene ontologies in sample papers PR+ RR 0 < F1 − Measure = < 1(3) 2 Calculating the correlation degree between genes. Both identifying candidate genes of diseases and building gene regulation networks need analyze specific genes correlation, so we design an algorithm based on co-occurrence of two genes in the same literatures to calculate quantitative correlation between genes. First, calculate the eigenvalue of the nth gene ontology in the mth paper, i.e. TFIDFmn, which represents the importance of each gene in each literature and the focus of a paper as well. TFIDFmn can be calculated as in (4), (5) and be normalized as in (6). In which, IDF (Inverse Document Frequency) represents the distribution of a gene ontology in a document, TF (Term Frequency) represents the frequency of a gene ontology in a literature.