BEE : a Web Service for Biomedical Entity Exploration
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2020.01.03.893594; this version posted January 3, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. BEE : a web service for biomedical entity exploration Jin-uk Jung1,, Jin-Muk Lim1,, Hyunwhan Joe1,, and Hong-Gee Kim1, 1410, Dentisty Graduate School of Seoul National University, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea Recently there has been a trend in bioinformatics to produce on intuitiveness and simplicity. BEE’s straightforward inter- and manage large quantities of data to better explain complex face minimizes the level of questions and input parameters, life phenomena through relationship and interactions among reducing the difficulty for new users. In addition, the sys- biomedical entities. This increase in data leads to a need for tem can be easily expanded by minimizing the difficulty of more efficient management and searching capabilities. As a re- adding and modifying a new biomedical entity with a simple sult, Semantic Web technologies have been applied to biomedi- data structure. cal data. To use these technologies, users have to learn a query language such as SPARQL in order to ask complex queries such as ‘What are the drugs associated with the disease breast carci- MATERIALS AND METHODS : Inside of noma and Osteoporosis but not the gene ESR1’. BEE was devel- BEE oped to overcome the limitations and difficulties of learning such query languages. Our proposed system provides an intuitive A. Design principle. The development focus was to mini- and effective query interface based on natural language. Our mize the difficulty of query generation for users, which was system is a heterogeneous biomedical entity query system based the limitation of existing search systems. Therefore, devel- on pathway, drug, microRNA, disease and gene datasets from opment proceeded by setting up the interface to be as simple DGIdb, Tarbase, Human Phenotype Ontology and Reactome , and intuitive as possible. Gene Ontology, KEGG gene set of MSigDB. User queries can be joined with union, intersection and negation operators. The Subsequent considerations were focused on minimizing the system also allows for selected results to be saved and later com- query step and reducing the required parameters. Also, for in- bined with newly created queries. To the best of our knowledge, tuitive use, the parameter input follows the order of words in BEE is the first system that supports condition search based on natural language. This minimizes the learning curve of how the relationship of heterogeneous biomedical entities and is ex- to use the system and allows the user to use the application pected to be used in various fields of bioinformatics such as in immediately without any difficulties. The system provides drug repositioning candidate selection as well as simple knowl- features that save/load the search history, and supports ex- edge search. porting search results into various formats such as clipboard, Correspondence: [email protected] CSV, and Excel. In addition, BEE has a flexible schema that allows developers to add existing and new biomedical entity types, while ensuring that there is no impact of change to the Introduction existing system. In the last decades, bioinformaticians have been trying to dis- cover the relationships between genes and other biomedical B. Design principle. There are five entity types used in entities. As a result, the amount of data on gene associationsDRAFTBEE: gene, disease, drug, pathway and microRNA. Drug- with biomedical entities such as diseases, drugs and pathways gene interaction data were extracted from DGIdb(1). DGIdb have exploded. Consequently, this growth in data has lead to is a dataset that integrates human genes and drug-related an increasing need for an efficient search tool for the relation- data related to diseases based on 13 major sources. It pro- ship between more complex biomedical entities. vides more than 14,000 drug-gene interaction data between BEE is a web service that was developed for the needs of more than 2,600 genes and 6,300 drugs. It was last up- these users. The system is an entity search system that can dated on 2018-01-25. The gene symbol, Entrez ID, drug answer simple questions such as ’what are the pathways con- name and ChEmBL ID included in the schema were extracted taining gene A and gene B’, to more complex questions such and loaded into the BEE database. Second, disease-gene as- as ’what are the drugs associated with disease s1 and s2 with- sociation is based on data provided by the Human Pheno- out gene g1 targeted drugs’, which requires knowledge be- type Ontology(HPO)(2). HPO provides a standard vocabu- tween heterologous biomedical entities. Our primary goal lary of phenotypic features of human genetic or other dis- is to develop an easy-to-create query system that can get re- eases. HPO data is updated once a month, and the prepro- sults for complex queries. To search with multi-queries in cessing module in our system extracts the entrez id, gene existing systems, the user needs to learn a query language symbol, HPO term, and HPO term ID data from the files such as SPARQL or adapt to the use of the complex inter- provided by HPO and stores them in our database. Third, face provided by the system. It results in a deterioration of microRNA data is based on TarBase(3). TarBase provides user accessibility. The design principle of BEE is focused sequence data, target gene information and annotations on Jin-uk Jung et al. | bioRχiv | January 3, 2020 | 1–4 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.03.893594; this version posted January 3, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. a web service, and integrates information on cell-type spe- bles (https://datatables.net/). cific miRNA-gene regulation information. Our preprocessing module extracted the gene symbol and miRNA information from the data provided by the source, and filtered it by the species Homo sapiens. TarBase was last updated on 2018- 3-12. Fourth, the pathway-gene association data was col- lected from the Gene Ontology(Biological process)(2)(4), Reactome(5), and KEGG(6) Gene set data is provided by the Molecular Signatures Database(MSigDB)(7). MSigDB pro- vides only gene information for each pathway and removes the context. Our system imported data which was released in MSigDB 7.0. Finally, gene data was composed by integrating Fig. 1. Entity association in BEE all gene symbols linked to the above disease, drug, pathway, and microRNA. The BEE database consists of 25,033 genes, Results 20,370 diseases, 5256 pathways, 12785 drugs, and 1034 mi- croRNAs. E. BEE web server. The system is based on the gene, drug, disease, pathway, and microRNA database loaded by the pre- C. System model. A Model of BEE B = (G,E,RE−G). processing module, and the loaded data continues to grow as G is a set of gene {g1,...,gv}. E is a set of entity types the data sources in each silo update. {D,S,P,M}. It includes a set of drugs D{d1,...,dw}, a On the first screen of BEE, the user can enter a query. As set of diseases S{s1,...,sx}, a set of pathways P {p1,...,py}, a result, the system provides network visualization informa- and a set of microRNAs M{m1,...,mz}. And RE−G is a tion and query results between entities according to the input set of relations between ei ∈ E and gi ∈ G that is RE−G ⊆ query, and the query result can be saved and used as part of E × G. For example, as shown in Figure 1, given dw ∈ D the query chain. and gv ∈ G, (dw,gv) ∈ Rdw−gv means "drug dw target to gene gv". i.e. RD−G = {(g1,d1),(g1,d2)}, RS−G = F. Query interface. The system provides an interface for {(g1,s1),(g2,s2),(g3,s2),(g4,s2)}. getting answers by combining two or more query sets and And also, the system model has three functions, their operators. There are three elements that make up a query gen,rgen,ext. First, gen, given an entity ei ∈ E, set: Answer entity type, Question entity type and Question gen(ei) = {g ∈ G|(e,g) ∈ Rei−G}. The function takes an entity name. As shown in Table 1, in query set 1, for the element of entity as parameter, returns the gene set associated question "What are the drugs associated with Breast carci- with the input entity. E.g. gen({s1}) = {g1,g2}. Second, noma", the Answer entity type is ’Drug’. ’Disease’ is the given a gene gi ∈ G, rgen(gi) = {{e} ⊆ |(e,g) ∈ RE−G}. question entity type and ’Breast carcinoma’ is the Question The function rgen takes an element of gene entity as a pa- entity name. There are also three query operators which are rameter and returns all entities associated with the input gene. union, intersection and negation. For example, the question E.g. rgen({g1}) = {d1,d2,s1}, rgen({g2}) = {s1,s2}. ’What are the drugs associated with disease Breast carcinoma Third, the ext function extracts and returns a specific type and Osteoporosis but not gene ESR1’ can be obtained by set- of entity from the set of input entities.