Research Scenario of Bio Informatics in Big Data Approach
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Electronics and Communication Systems Volume 4 Issue 1 Research Scenario of Bio Informatics in Big Data Approach S.Jafar Ali Ibrahim1, Dr.M.Thangamani2, D. Sarathkumar3 1Doctoral Research Fellow, School of Information and Communication Engineering, Anna University, Chennai, Tamil Nadu, India 2Assistant Professor, Department of Computer Technology, Kongu Engineering College, Perundurai, Tamil Nadu, India 3Assistant Professor, Department of Electrical & Electronics Engineering, Kongu Engineering College, Perundurai, Tamil Nadu, India Email: [email protected] DOI: http://doi.org/10.5281/zenodo.2596987 Abstract Big Data can unify all patient related data to get a 360-degree view of the patient to analyze and predict outcomes. This investigation examines the concepts and characteristics of Big Data, concepts about Translational Bio Informatics and some public available big data repositories and major issues of big data. This issue covers the area of medical and healthcare applications and its opportunities. Keywords: Big Data, Bio Informatics, Drug Discovery, Computational Intelligence Methods, Health Informatics, Health care data mining. Big Data Concepts Big data is a blanket term for the non- Big data life cycle looks like traditional strategies and technologies So how is data really handled when needed to gather, organize, process, and managing with a big data framework? gather insights from large datasets. While ideas to exertion differ, there are Characteristics of big data can be some populace in the scenario and described us 6 V’s, that are following software that we can discuss for the most Volume, Velocity, Variety, Value, part. While the means exhibited Variability and Veracity [1, 2, 3] underneath won't not be valid in all cases, they are broadly utilized. Volume The general tier of task embroiled with big It refers to as terabytes, petabytes, and data processing is: zettabytes of data. This focus on near Ingesting data into the system instant feedback has driven many big data Persisting the data in storage practitioners away from a batch-oriented Computing and Analyzing data approach and closer to a real-time Visualizing the results streaming system. Data is constantly being In Big data technology, we will take a added, massaged, processed, and analyzed moment to talk about clustered computing, in order to keep up with the influx of new an important strategy employed by most information and to surface valuable big data solutions. information early when it is most relevant. CLUSTERED COMPUTING Variety Resource Pooling: Combining the While more traditional data processing available storage space to hold data is a systems might expect data to enter the clear benefit, but CPU and memory pipeline already labeled, formatted, and pooling is also extremely important. organized, big data systems usually accept and store data closer to its raw state. 18 Page 18-27 © MAT Journals 2019. All Rights Reserved Journal of Electronics and Communication Systems Volume 4 Issue 1 High Availability: Clusters can provide health plan websites and smartphone, etc.) varying levels of fault tolerance and [10] availability guarantees to prevent hardware or software failures from affecting access Clinical reference and health to data and processing. publication data It refers to reference data for clinical, Easy Scalability: Clusters make it easy to claim, and business data to enable scale horizontally by adding additional interoperability, drive compliance, and machines to the group. improve operational efficiencies. There is often noisy data or false Text-based publications (journals articles, information in big data. The focus of Big clinical research and medical reference Data is on correlations, not causality [4]. material) and clinical text-based reference practice guidelines and health product CATEGORIES OF MEDICAL BIG (e.g., drug information) data [7, 12]. DATA Data in healthcare can be categorized as Administrative, Business and External follows. Data Insurance claims and related financial Genomic Data data, billing and scheduling [10] Such data are gathered by a bioinformatics Biometric data: Fingerprints, system or genomic data processing handwriting and iris scans, etc software. Data sequencing analysis Other Important Data techniques and variation analysis are Device data, adverse events and patient common processes performed on genomic feedback, etc. [9] data. The aim of genomic data analysis is The content from portal or Personal to determine the functions of specific Health Records (PHR) messaging genes. It refers to genotyping, gene (such as e-mails) between the patient expression and DNA sequence [6, 7]. and the provider team; the data generated in the PHR Ingesting data Clinical Data into the system A term defined in the context of a clinical t Persisting the data in storage rial for data pertaining to the health status Computing and Analyzing data of a patient or subject [8]. About 80% of Visualizing the results this type data are unstructured documents, images and clinical or transcribed notes [9] Big data in Health Informatics: Structured data (e.g., laboratory data, However, the scope of this study will be structured EMR/HER) research that uses data mining in order to answer questions throughout the various Behaviour Data and Patient Sentiment levels of health[13]. Data Behavioural data refers to information The scope of data used by the subfield produced as a result of actions, typically TBI, on the other hand, exploits data from commercial behaviour using a range of each of these levels, from the molecular devices connected to the Internet, such as a level to entire populations [14]. PC, tablet, or Smartphone. Behavioural data tracks the sites visited, the apps BIG DATA AND DRUG DISCOVERY downloaded, or the games played. • Web In today drug discovery environment, Big and social media data Search engines, Data plays a vital role due to its 5 V Internet consumer use and networking concepts. These databases provide sites (Facebook, Twitter, Linkedin, blog, information about the drugs, their adverse 19 Page 18-27 © MAT Journals 2019. All Rights Reserved Journal of Electronics and Communication Systems Volume 4 Issue 1 reactions, 1chemical formula, information (protein/peptide) drugs, 112 nutraceuticals about metabolic pathways, drug targets, and over 5,125 experimental drugs. disease for which a particular drug is used Additionally, 4,924 non-redundant protein etc. None of the existing (i.e. drug pharmacogenomic databases carry the target/enzyme/transporter/carrier) complete integrated information and hence sequences are linked to these drug entries. there is a need to develop a database which Each Drug Card entry contains more than integrates data from all the widely used 200 data fields with half of the information databases [38]. being devoted to drug/chemical data and Integrating big data analytics and the other half devoted to drug target or validating drugs in silico has the potential protein data. to improve the cost-effectiveness of the drug development pipeline. Big data– CTD driven strategies are being increasingly The whole database is categorized in to 11 used to address these challenges. types: Computational prediction of drug toxicity Chemicals, genes, chemical-gene/protein and pharmacodynamic/pharmacokinetic interactions, diseases, gene-disease properties, based on integration of multiple associations, chemical-disease data types, helps prioritize compounds for associations, references, organisms, gene in vivo and human testing, potentially ontology, pathways and exposures. reducing costs[39]. Reactome DRUG DISCOVERY RELATED BIG It has cross-referenced to several other DATA SOURCES databases such as Ensembl [44] and Data sets and resources available on UniProt. The pathways within the database Related to drug discovery are scattered in especially those pertaining to those in various databases and online resources and humans may be used for research and most of these databases are interlinked analysis, pathways modelling, systems based on the information they carry. Some biology as well as pharmacogenomics of these databases include PharmGKB applications to analyze effects of drug [40], DrugBank [41], CTD [42], Reactome pathway alterations on drug response and [43], KEGG [46], STITCH [47], PACdb phenotypes [45]. [48], dbGaP [49] IGVdb, PGP [50]. Brief explanation of the databases are given in KEGG the following section and also tabulated in It is an integrated resource of systems table 2. information (KEGG Pathways, KEGG Brite, KEGG Module, KEGG Disease, PharmGKB KEGG Drug and KEGG Environ), PharmGKB is a pharmocogenomics genomics information (KEGG Orthology, database that carries all the clinical KEGG Genes, KEGG Genome, KEGG information along with the dosage DGenes and KEGG SSDB) and chemical guidelines, gene-drug associations and information (KEGG Compounds, KEGG genotype phenotype relationships. It also Glycans, KEGG Reaction, KEGG RPair, has information about Variant KEGG RClass and KEGG Enzyme). Annotations, Clinical Annotations and Very Important Pharmacogene (VIP) STITCH summaries, drug-centered pathways. STITCH (Search Tool for Interacting Chemicals) is a database of known and Drug Bank predicted interactions between chemicals Drug Bank database is the open resource and proteins. The interactions include for drug, drug targets, and chemo direct (physical) and indirect (functional) informatics. It contains 11,067 drug entries associations; they stem from including