Disease Candidate Identification and Gene Regulatory Network Building Through Medical Literature Mining

Yong Wang, Chenyang Jiang, Jinbiao Cheng and Xiaoqun Wang

Abstract Finding key associated with diseases is an essential problem of disease diagnosis and treatment, and drug design. Bioinformatics takes advantage of computer technology to analyze biomedical data to help finding the information about these genes. Biomedical literatures, which consists of original experimental data and results, are attracting more attention from bio-informatics researchers because literature mining technology can extract knowledge more efficiently. This paper designs an algorithm to estimate the association degree between genes according to their co-citations in biomedical literatures from PubMed database, and to further predict the causative genes associated with a disease. The paper also uses hierarchical clustering algorithm to build a specific genes regulation network. Experiments on uterine cancer shows that the proposed algorithm can identify pathogenic genes of uterine cancer accurately and rapidly.

Keywords Literature mining · Co-citation · Pathogenic gene · Gene regulatory network · Hierarchical clustering

Y. Wang (B) · C. Jiang · J. Cheng School of Software, Beijing Institute of Technology, Beijing 100081, China e-mail: [email protected] C. Jiang e-mail: [email protected] J. Cheng e-mail: [email protected] X. Wang Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China e-mail: [email protected]

© Springer International Publishing Switzerland 2017 453 V.E. Balas et al. (eds.), Information Technology and Intelligent Transportation Systems, Advances in Intelligent Systems and Computing 455, DOI 10.1007/978-3-319-38771-0_44 454 Y. Wang et al.

1 Introduction

Finding genes critical to the formation and development of a disease from disease candidate genes are of significance to diagnosis and treatment of diseases, which is an important goal of bioinformatics [1]. Bioinformatics technologies are used to manage and analyze mass data generated by biomedical experiments, and to give some predictions or directive conclusions for researchers further works. In recent years, a new branch of bioinformatics, literature mining [2] is more and more used to partially replace biomedical experiments, and finally to accelerate the whole research process. Literature is the most important way to publish experimental data, experi- mental results and experimental conclusion, so it contains a lot of original knowledge that has been proved in experiments. Literature mining can automatically analyze the huge amount of biomedical literature, if combined with artificial verification, then the researchers can not only save a lot of manpower by greatly narrowing down the scope of artificial reading, but also can extract and discover knowledge more efficiently and can give reliable guide to biomedical experiments. AI. Mubaid et al. [3] explore the relationship between genes and diseases by studying cooccurrence of genes and they find 6 candidate genes, all of which can be verified by PubMed literatures. Chun et al. [4] developed a named entity recogni- tion and mining system based on maximum entropy for automatic identification of prostate cancer related genes, then apply it to Medline database with the accuracy of 92.1%. Some researchers [5] take advantage of Protein–Protein Interaction (PPI) networks to predict disease related genes. So there are many PPI databases, such as HPRD (Human Protein Reference Database) [6], DIP (the Database of Interact- ing Proteins) [7] and OPHID (Online Predicted Human Interaction Database) [8]. Some researchers don’t use these well-built databases, rather build their own gene networks. Ozgur et al. built a prostate cancer related gene network based on some seed genes through text mining, and only one candidate gene cannot be supported by literatures [9]. Many different algorithms have been proposed to evaluate gene networks, such as shortest path [10, 11], similarity assisted [12, 13] and centrality based [9, 14]. In this paper, we design an algorithm to predicate disease candidate genes by doing text mining on the PubMed biomedical literature database combined with human database. We first compute the degree of the association between genes, and then predict the causative genes associated with diseases. Further, we use hierarchical clustering algorithm to build specific genes regulation network and analyze the network by taking the results of text mining into consideration. The experimental results show that the proposed algorithm can identify pathogenic genes of uterine cancer accurately and rapidly. Those less referenced genes (low-frequency genes) should get more attention in future research. This work can provide a good guide for the study in the field of biomedical area, and show the broad application prospects of text mining in the biomedical field as well. Disease Candidate Gene Identification and Gene Regulatory … 455

2 System Design and Implementation

2.1 Getting Literature Data Source

First, we download the data of more than 20,000 genes encoding proteins in the () from NCBI Gene database (ftp://ftp.ncbi.nih.gov/gene/). The data contains the ontology of each gene and the corresponding gene ID, saved as CSV files. Then write a Java program to read the CSV file for each gene ontology symbol, and set the searching parameters through URL, and then search the literature which relate to genes ontology in the database through the E-utilities API, and finally retrieve XML flies from search results. For each gene, parse the XML files, retrieve the PMID of top 100 most related papers. Remove duplicate PMID and then add the PMIDs to a set. For each PMID, search and retrieve the XML files of corresponding literature in PubMed, parse these files to get the titles, abstracts and keywords. Finally we store them into a local MySQL database, and take each literature’s PMID as the primary key. The article’s title, abstract and keywords are data source for the following text mining.

2.2 Processing Literature Data

Identifying the gene ontologies. We use word segmentation method to retrieve gene ontologies in data source. There are many ways to represent gene ontologies in these literatures and some representations in early papers are nonstandard, so we need a collection of standard gene ontologies for reference before segmentation. In order to identify gene ontologies more easily, we make the following special preprocessing to the data source before segmenting it. First, read CSV files of all IDs of gene ontolo- gies that describe human genome, then read CSV files of the corresponding gene ontologies and genes alias from NCBI GENE database. Store these files into mem- ory, and add these files to a hash table named HashMap sym2id, the key can be gene ontology or gene’s alias, and the value is a standard gene ontology. This hash table will be referenced in following steps. When segment the data source, we first replace special symbols, such as ‘(’, ‘)’, ‘+’, ‘,’, ‘.’, ‘[’, ‘]’, ‘<’, ‘>’, ‘?’, ‘/’, ‘\’, ‘&’ with space. Then process some fuzzy words, for example, spaces before and after was, an, and met should be removed. Remove “T cells” or “T cell” to pre- vent confusion in segmentation. For each segmented word, lookup in the hash table sym2id to determine whether the word is a certain gene, namely whether it is a gene ontology symbol (a standard gene ontology symbol or alias). If it is, it will be the only standard symbol of the gene ontology mapping for the gene, and will be stored in hash table, waiting for further processing. To evaluate the above identification algorithm, we select 100 papers randomly from the local database and identify gene ontolo- gies in them, then evaluate the precision rate by manual verification. The accuracy (F1-Measure) of algorithm reaches more than 97% as shown in Table 1. At the aspect 456 Y. Wang et al.

Table 1 Gene Ontology Identification Evaluation Results No. The number The total The total Precision Recall F1-measure of correct number of number of all rate (PR) rate (RR) gene identified gene ontologies ontology gene in selected identification ontologies papers 1 698 707 731 0.987 0.955 0.971 2 1002 1007 1025 0.995 0.977 0.986 3 423 423 442 1 0.957 0.978 4 486 486 499 1 0.974 0.987 5 769 774 786 0.994 0.978 0.986 6 553 559 571 0.989 0.968 0.979 7 673 673 695 1 0.968 0.984 8 733 733 741 1 0.989 0.995 9 529 533 552 0.992 0.958 0.975 10 638 638 653 1 0.977 0.989

of performance, the whole process of identify gene ontologies in 20,000 papers only takes about 5 s. In which, PR, RR and F1-Measure (the harmonic mean of PR and RR) are com- puted as in (1), (2) and (3) respectively.

the number of correct gene ontology identi f ication PR = (1) the total number of identi f ied gene ontologies

the number of correct gene ontology identi f ication RR = (2) the total number of gene ontologies in sample papers

PR+ RR 0 < F1 − Measure = < 1(3) 2 Calculating the correlation degree between genes. Both identifying candidate genes of diseases and building gene regulation networks need analyze specific genes correlation, so we design an algorithm based on co-occurrence of two genes in the same literatures to calculate quantitative correlation between genes. First, calculate the eigenvalue of the nth gene ontology in the mth paper, i.e. TFIDFmn, which represents the importance of each gene in each literature and the focus of a paper as well. TFIDFmn can be calculated as in (4), (5) and be normalized as in (6). In which, IDF (Inverse Document Frequency) represents the distribution of a gene ontology in a document, TF (Term Frequency) represents the frequency of a gene ontology in a literature. Sm represents the total number of gene ontologies in first m papers. The closer Wmn to 1, the more influence the gene will have on this paper. Disease Candidate Gene Identification and Gene Regulatory … 457   the number of correct gene ontology identi f ication IDF = lg (4) the total number of identi f ied gene ontologies

Wmn = TFmn × IDFn (5) × =  TFmn IDFn TFIDFmn  (6) Sm ( × )2 n=1 TFmn IDFn

If two genes both appear in a literature and their normalized eigenvalues are similar, then the two genes are likely to have similar influence in the literature. So we can find genes closely related to other genes by evaluate the similarity between the genes’ normalized eigenvalues in all the documents. After calculating all the normalized eigenvalues, we can get an n × m matrix P, where n represents the total number of gene ontologies in human genome, m repre- sents the total number of papers associated with these gene ontology. If a gene ontol- ogy doesn’t appear in a literature, the corresponding Wmn issetto0.Thentheith row of P is the eigenvalue vector of the ith gene, denoted as Ai = [W1i , W2i ,...,Wmi]. For the ith gene and jth gene, we can use cosine similarity of vector Ai and vector A j to evaluate their similarity in literatures. Predicting disease candidate genes. First, we predict candidate genes of disease based on some already known causal genes. For each virulence gene of a disease, select the most relevant genes (whose eigenvalue vector similarity is larger than a threshold) from biomedical literature corpus using the algorithm described in the above section. After the process, there will be some collections of genes associated with virulence genes, then the elements in the intersection of these collections are the candidate genes of the disease. When we know nothing about a disease’s virulence genes, we cannot use genes’ association for prediction directly. But a similar gene ontology mining method can still be used to do prediction. Taking the disease terms (such as “lung cancer”) in Medical Subject Headings (MeSH) and the gene ontologies in human genome as a special dictionary, we do text mining in biomedical literatures to calculate the terms’ TF, TF-IDF and normalized eigenvalues, and finally look for gene ontologies with high enough co-occurrences rate with the disease terms in the literatures, which would be the disease’s candidate genes. Using hierarchical clustering algorithm to build gene regulation network.We combine the association degrees between genes and hierarchical clustering algorithm to construct gene regulation network. In general, we find out 50 most relevant genes to the target gene through mining the medical literatures. According to the normal- ized eigenvalues matrix P, we establish a matrix P1 of the cosine distances between all the 51 genes, and then use a hierarchical clustering algorithm to build a binary tree based on P1. The process consists of the following steps: 458 Y. Wang et al.

1. Initialization: Each gene is a cluster, so there are 51 clusters at beginning. 2. Clustering: Select two clusters A and B with the nearest cosine distance in matrix P1, A and B compose a new cluster C. Calculate the relevance (cosine distance) between C and other clusters except A and B. 3. Updating cosine distance matrix P1: Delete the rows and columns that contain cluster A and B from matrix P1, insert a new row and column for cluster C, and update all elements in matrix P1. 4. Repeat step 2 and 3 until there is no new clusters inserted.

The above process of clustering can be recorded to construct gene regulation networks. So we save the information of cluster A,clusterB and cluster C in each execution of step 2, and all the information can be used to build a regulation network for a specific gene.

3 Evaluation and Results Analysis

We design some experiments to evaluate the disease candidate gene identification algorithms and the gene regulation network building method described in Sect.2.

3.1 Evaluations of the Prediction Results of Candidate Gene

Evaluating candidate genes predication based on some known causal genes.The candidate genes for 7 common diseases based on some known causal genes, that are grouped by their frequencies are shown in Table2. According to the frequency of genes appeared in the literatures, genes can be divided into four categories: ultrahigh frequency genes, high frequency genes, medium frequency genes and low frequency genes. It can be seen that most can- didate genes are ultrahigh frequency genes or high frequency genes, which are the current research focus. Those medium frequency genes and low frequency genes are less studied by researchers, so they would become new research fields in future. Evaluating candidate genes predication without known causal genes. We find out the most relevant 20 genes associated with lung cancer as shown in Table3.

3.2 Evaluation of the Gene Regulation Network Construction Algorithm

We construct the gene regulation network for COMT as shown in Fig.1.COMTs GRN is similar to a biology evolutionary tree, which reflects the genes correlations in such a way that the branches of higher relevant genes meet earlier. There are 50 Disease Candidate Gene Identification and Gene Regulatory … 459

Table 2 Disease Candidate Genes Disease Known causal Candidate genes genes Ultrahigh High Medium Low frequency frequency frequency frequency genes genes genes genes Lung cancer EGFR, CDKN2A, NRAS, PIK3CA, HRAS, BRAF KRAS, TP53, ALK Uterine cancer PGR, ESR1 CYP1A1 CYP1A2, GSTM2, ADH1B GSTM1, NUDT1, NQO1, GSTT2 CYP19A1, CYP17A1 Neuroma NF2, BRAF, PTEN, SHH, ATRX, NF1, IL17RA SEMA3F AKT1 SHANK3 PRKAR1A Lung EGFR, CDKN2A, carcinoma NRAS, TP53, BRAF, HRAS, PIK3CA KRAS, ALK Goiter TG, TPO, MBP, BDNF TRH Pre-eclampsia FLT1, PGF EN2, MBP, PAX3, TGFA, ADM BVES KIT, PTEN, CXCR4, STAT3, ANG, NOTCH4 EGFR Microcephaly MCPH1, CDK6 STIL, WDR62 ZNF335, ASPM, CENPJ, CDK5RAP2 CASC5, CEP135

Table 3 Lung Cancer Candidate Genes TRIM21 CRP CHI3L1 ADAMTS8 IL6 0.0920 0.0725 0.1542 0.0598 0.1342 IRAK3 LYZ VWF MIB1 LPIN2 0.0659 0.0819 0.0903 0.1960 0.085 TANK IL1RN CCL21 TLR2 MMP13 0.1253 0.0613 0.1052 0.0712 0.0818 PDGFRB CILP ALG1 SETD1A TIMP2 0.0873 0.1465 0.0725 0.0794 0.0810 460 Y. Wang et al.

TW

CYP1A2 MYB NET1

UGT1A1

SULT2B1

COMT

SULT1E1 APOE

INS

MET

GSTT2

OPRM1

NQ01

CYP19A1

DRD4

CYP1BI

CYP1AI

MTHFR

GSTP1 ESR1 DRD2

DRD3

SLC6A4

NAT2

CYP17A1

SLC6A3

RGS4

MAOA

DAOA

GSTM1

SULT1A1

CYP2D6

PNMT

SDHB

XRCC1

DBH

NLK

CYP3A7

VPREB1

SDHD

SDHC

ANKK1

VV

NA

UGT1A8

ADO

GRM3

TPMT

PRODH

GCH1

Fig. 1 Simple gene regulation network for COMT most relevant genes of COMT in the network. In general, the high frequency genes such as SULT1E1, SULT2B1 converge with other genes most early, while the low frequency genes such as TW, GCH1 and PRODH meet with others in the end. The goal of our regulation networks for specific genes is help researchers reduce their workload of building GRN. Since we dont involve any biological semantic dur- ing the literature mining process, we cannot determine whether one genes expression will influence other genes expression, so we cannot build a complete Boolean net- work. But our network does provide reasonable structure of GRN for researchers, so that they can determine the gene ontology and relevance between genes in GRN more easily. Disease Candidate Gene Identification and Gene Regulatory … 461

4 Conclusion

In this paper, we studied the approaches to evaluate the correlation between genes through mining the huge amount of biomedical literatures. Based on the relationship information, we further design algorithms to predicate candidate virulence genes of diseases, and to construct simple gene regulation network for specific genes. The experiments on PubMed biomedical literature database show that our algorithms can help researchers determine their goal genes for future biomedical research efficiently. Our future work includes improving the efficiency of retrieving papers from Pub- Med database and integrating semantic into gene regulation network.

References

1. Lander ES, Weinberg RA (2000) Genomics: journey to the center of biology. Science, U.S. 287, pp 1777–1782 2. Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Gen Lond 7:119–129 3. AI-Mubaid H, Singh RK (2005) A new text mining approach for finding protein-to-disease associations. Am J Biochem Biotechnol 1(3):145–152 4. Chun HW, Tsuruoka Y, Kim JD et al (2006) Automatic recognition of topic-classified relations between prostate cancer and genes using MEDLINE abstracts. BMC Bioinform 7(1):1–8 5. Chen JY, Shen C, Sivachenko AY (2006) Mining Alzheimer disease relevant proteins from integrated protein interactome data. Pac Symp Biocomput 11:367–378 6. Human Protein Reference Database (2009). http://www.hprd.org/ 7. Database of Interacting Proteins (2014). http://dip.doe-mbi.ucla.edu/dip/Main.cgi 8. Interologous Interaction Database (2015). http://ophid.utoronto.ca/ophidv2.204/ 9. Ozgur A, Vu TG, Radev DR (2008) Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics 24(13):i277–i285 10. Liu B, Jiang T, Ma S et al (2006) Exploring candidate genes for human brain diseases from a brain-specific gene network. Biochem Biophys Res Commun 349(4):1308C–1314 11. Radivojac P, Peng K, Clark WT, Peters BJ et al (2008) An integrated approach to inferring gene-disease associations in humans. Proteins 72(3):1030–1037 12. Wu X, Liu Q, Jiang R (2009) Align human interaetome with phenome to identify causative genes and networks underlying disease families. Bioinformatics 25(1):98–104 13. Miozzi L, Piro RM, Rosa F, Ala U, Silengo L et al (2008) Functionnl annotation and identifi- cation of candidate disease genes by computational analysis of normal tissue data. PLoS One 3(6):24–39 14. Ortutay Y, Vihinen M (2009) Identification of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeficiencies. Nucleic Acids Res 37(2):622–628