Integrating Text Mining, Data Mining, and Network Analysis for Identifying
Total Page:16
File Type:pdf, Size:1020Kb
Jurca et al. BMC Res Notes (2016) 9:236 DOI 10.1186/s13104-016-2023-5 BMC Research Notes RESEARCH ARTICLE Open Access Integrating text mining, data mining, and network analysis for identifying genetic breast cancer trends Gabriela Jurca1, Omar Addam1, Alper Aksac1, Shang Gao2, Tansel Özyer3, Douglas Demetrick4 and Reda Alhajj1,5* Abstract Background: Breast cancer is a serious disease which affects many women and may lead to death. It has received considerable attention from the research community. Thus, biomedical researchers aim to find genetic biomarkers indicative of the disease. Novel biomarkers can be elucidated from the existing literature. However, the vast amount of scientific publications on breast cancer make this a daunting task. This paper presents a framework which investigates existing literature data for informative discoveries. It integrates text mining and social network analysis in order to identify new potential biomarkers for breast cancer. Results: We utilized PubMed for the testing. We investigated gene–gene interactions, as well as novel interactions such as gene-year, gene-country, and abstract-country to find out how the discoveries varied over time and how overlapping/diverse are the discoveries and the interest of various research groups in different countries. Conclusions: Interesting trends have been identified and discussed, e.g., different genes are highlighted in relation- ship to different countries though the various genes were found to share functionality. Some text analysis based results have been validated against results from other tools that predict gene–gene relations and gene functions. Keywords: Breast cancer, Data mining, Text mining, Network analysis Background reducing the number of molecules to consider as poten- Introduction tial biomarkers. CANCER is one of the most serious and harmful diseases Cancer is a result of damage (mutation) to a cell’s DNA threatening humanity and may lead to death. Unfortu- (deoxyribonucleic acid), so that the cell loses normal nately there is no discovered robust treatment which functionality and instead gains the ability to indefinitely leads to guaranteed cure from cancer. Thus, researchers multiply until normal tissue functions are impaired [1]. from various domains are still working hard to identify Cancerous DNA mutations may occur from a complex molecules (mainly genes or proteins) which could be han- mixture of inherited and external (environmental) fac- dled and targeted as cancer biomarkers. Various methods tors, where these mutations are usually located in cell have been developed. The research spans a wide range division genes [1]. There are over 100 known different of techniques from wet-lab testing by biologists to com- types of cancer, depending on the cell type which was putational methods by computer scientists. The latter originally affected [1]. Additionally, each patient may research is promising because it helps in tremendously have a different set of cancerous mutations in various genes, which may lead to different subtypes of the cancer. In order to personalize therapeutic strategies for cancer *Correspondence: [email protected] patients, medical researchers aim to identify and char- 1 Department of Computer Science, University of Calgary, Calgary, AB, acterize the biomarkers of each type of cancer, so that Canada they can provide the most accurate diagnosis to patients Full list of author information is available at the end of the article © 2016 Jurca et al. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Jurca et al. BMC Res Notes (2016) 9:236 Page 2 of 35 [2]. A cancer biomarker refers to a substance or process that serves as indication of cancer in the body, where one common example of a cancer biomarker is genetics [3]. The basic unit of genetic biomarkers are genes. A gene is one unit of the DNA which often contains the informa- tion needed to produce proteins. The central dogma is that genes are transcribed into an intermediate molecules called RNA, and the RNA is then translated into proteins, where proteins carry out the basic functions of life [4]. If a gene codes for a protein whose function is to suppress cancer, then if that gene is damaged or is downregulated (not transcribed enough), then the cell may become can- Fig. 1 Growth in number of abstracts about breast cancer in Pub- cerous. Similarly, if a gene codes for a protein whose func- Med tion is to promote cancer, then if that gene is upregulated (transcribed more than usual), then that cell may also become cancerous. Therefore, finding the different genes techniques are described in more detail in “Results and and conditions which are likely to lead to cancer, should discussion” section. Each data mining technique uti- the genes be upregulated or downregulated, is an impor- lizes different interestingness metrics, so it is useful to tant task for characterizing types of cancer. The problem apply many techniques to a data set. Another technique is not trivial because there are various internal and exter- we used on the genes extracted from the breast cancer nal factors that might affect the cells leading to cancer. abstracts was network analysis, or “Social Network Anal- People do not have the same habits and behavior. Thus ysis” as it is sometimes referred to [9]. Network analysis they may develop the same cancer differently based on the has its roots in sociology, as it was first used to study environment they live in, their diet, drinking, etc. Also, the relationships and community structures in social some types of cancer, such as breast and prostate cancer data. However, network analysis has since been applied can be strongly influenced by inherited gene mutations, in other fields such as bioinformatics in order to find key and often run in families [5]. Therefore, these heritable molecular markers and communities within an interac- types of cancer may be predicted by examining a person’s tion network. DNA before they develop cancer. Identifying the heritable To validate genes linked to cancer, one of the most genetic mutations that increase the likelihood for cancer effective ways is to analyze disease specific gene expres- are critical to developing predictive genetic tests. sion data [10]. Our framework described in this paper is built on the Gene expression data is experimental data which hypothesis which could be articulated as follows. To can be used to check whether a gene has indeed been investigate cancer biomarkers, one may investigate the upregulated or downregulated with respect to a dis- literature which contains a huge amount of informa- ease. This methodology compares to what level genes tion hidden in the form of scientific articles. However, a were expressed in cancerous cells versus healthy cells. query for “breast cancer” to PubMed can retrieve over It is unaffordable and infeasible to try wet-lab analysis 250,000 articles, which makes it impossible to get a full- of such a huge set of genes. Therefore, machine learn- picture of the field by reading them. The trend is that the ing and data mining techniques (including frequent pat- number of PubMed articles are steadily increasing, and tern mining, clustering and classification) can be used so are articles on the topic of breast cancer that mention to lower this number of genes down to a manageable set gene names argued as potential biomarkers (see Fig. 1). of genes which are anticipated to be statistically linked Therefore, using text mining techniques to gather new with the disease. This way, biologists will concentrate knowledge from many existing scientific sources can be only on the identified small set as potential cancer bio- an effective way to investigate the literature for new bio- markers instead of unrealistic case of testing every gene markers. One type of relationship which can be discov- in the wet-lab as potential cancer biomarker. In other ered is gene-disease, that shows which gene is involved in words, data mining techniques can save the time and cost which disease [6]. Another type of relationship which can of cancer researchers, turning their research goals into be found are gene–gene interactions [7]. something potentially achievable. This is illustrated by Some data mining techniques that can be used to the test results reported in this paper. extract hidden information from a database are hard The paper is organized as the following sections. The clustering, soft clustering, hierarchical clustering, and problem explanation is made in “Problem explana- frequent pattern mining [8]. All of the aforementioned tion” section. “Related work” section describes the work Jurca et al. BMC Res Notes (2016) 9:236 Page 3 of 35 related to our solution. In “The developed solution” sec- engines in terms of the scope of content delivered. For tion, the developed methodology is given in detail. The example, PubMed produces a comprehensive search of experimental results are depicted in “Evaluation of the articles, but only retrieves the abstracts of the articles. In developed solution” section. Lastly, contributions and contrast, UKPMC returns the full text of articles [13]. future work are mentioned in “Results and discussion” section. Entity recognition (NER) Once we have a subset of the available scientific literature which pertains to our topic, Problem explanation we must identify terms which are relevant to our study. Identifying cancer biomarkers is not a trivial task. NER has the aim of identifying terms within the gathered Despite all the effort, time, and money invested so far, text, such as the names of different proteins or genes.