Data Mining in Bioinformatics Rajesh Kumar
Total Page:16
File Type:pdf, Size:1020Kb
31138 Rajesh Kumar M and R.Rajasekaran/ Elixir Bio Tech. 80 (2015) 31138-31142 Available online at www.elixirpublishers.com (Elixir International Journal) Bio Technology Elixir Bio Tech. 80 (2015) 31138-31142 Data Mining in Bioinformatics Rajesh Kumar. M1 and R.Rajasekaran 2,* 1Periyar Maniammai University, Chennai, Tamil nadu, India. 2Department of Biology, College of Science, Eritrea Institute of Technology, Mai Nefhi, P.O.Box no:-12676, Asmara, Eritrea, North East Africa. ARTICLE INFO Article history: ABSTRACT Received: 27 January 2015; Bioinformatics is a science typically associated with databases in genomics and proteomics Received in revised form: and structure and Function information for genes and proteins, of all forms of life on earth. 28 February 2015; In the past decade there has been a ‘cyber-war’, with the introduction of a number of Accepted: 16 March 2015; biological databases on genomics and proteomics. The major aim here is to introduce data mining techniques as an automated means of reducing the complexity of biological data in Keywords large bioinformatics databases and of discovering meaningful, useful patterns and relationships in data. The main purpose of data mining in the field of bioinformatics is the Databases, mining of complex data which is fast growing and can be said to be outgrowing our Bioinformatics, processing power. Data mining, Genomics, © 201 5 Elixir All rights reserved. Proteomics. Introduction be discerned. The biotech revolution was ignited by decoding Bioinformatics is the science of data management system in the human genome. It also created one of the biggest hurdles the genomics and proteomics of life forms. It is a comparatively Information Age has ever faced: which was to make sense of young discipline in information technology and has progressed three billion base pairs of human DNA. It promised to cure very fast in the last few years. Bioinformatics is practiced every disease with a genetic component, including cancer, worldwide by biotechnologists to access various databases for asthma, diabetes and mental illness. This grand, genomic research and to exchange information for comparison, dilemma gave birth to the newborn discipline of bioinformatics, confirmation, storage and analysis. The term Bioinformatics was the use of computers and information technology to tackle coined by Paulien Hogeweg in 1979 for the study of informatics biology problems 2. processes in biotic systems (www.wikipedia.org). As on date, There are three important sub-disciplines within bioinformatics there are a number of databases on specific genes and proteins • The development of new algorithms and statistics with which pertaining to human, animals, plants, bacteria, and other life to assess relationships among members of large data sets. forms. These are being enriched and updated through research in • The analysis and interpretation of various types of data modern biology with the practice of bioinformatics. It can be including nucleotide and amino acid sequences, protein said that this scientific field deals with the computational domains, and protein structures. management of all kinds of biological information, whether it • The development and implementation of tools that enable may be about genes and their products, whole organisms or even efficient access and management of different types of ecological systems. Most of the bioinformatics work that is information. being done deals mainly with analyzing biological data, One of the tasks used in bioinformatics concerns the although a growing number of projects deal with the creation and maintenance of database of biological information. organization of biological information. There are various ways While the storage and organization of millions of nucleotides is in which Bioinformatics may be defined. Most commonly it is far from trivial, designing a database and developing an defined as the interface between biological and computational interface whereby researches can both access existing sciences. information and submit new entries is only the beginning. It may also be defined as the application of computer There are various types of data involved in the fields of technology to biology; a combination of techniques and models Bioinformatics. Some are in statistical, computational, and life sciences to understand the • Amino acid sequences 3. significance of biological data. In more formal terms • Protein domain cartoons. Bioinformatics is the recording, annotation, storage, analysis, • Different renderings of three-dimensional structures. and searching/retrieval of nucleic acid sequence (genes and • Protein hydrophobicity data (www.fasta.bioch.virginia.edu/o RNAs), protein sequence and structural information. This - fasta/grease.htm) includes databases of the sequences and structural information Databases consisting of data derived experimentally, such as well methods to access. Search, visualize and retrieve the as nucleotide sequences and three-dimensional structures are information 1. It basically, is the field of science in which known as primary databases. Those data that are derived from biology, computer science, and information technology merge the analysis or treatment of primary data such as secondary into a single discipline. The ultimate goal of the field is to enable structures, hydrophobicity plots URL: the discovery of new biological insights as well as to create a http://fasta.bioch.Virginia.edu/o-fasta/grease.htm and domains global perspective from which unifying principle in biology can are stored in secondary databases. A protein database Tele: E-mail addresses: [email protected] © 2015 Elixir All rights reserved 31139 Rajesh Kumar M and R.Rajasekaran/ Elixir Bio Tech. 80 (2015) 31138-31142 consisting of the conceptual translation of nucleotide sequences Data mining is used in the: would also be considered a secondary database. There has been • Analysis of Micro array Data considerable mention of biological database in the above • Mining free text paragraphs. Let us look into what exactly does this biological • Structural Genomics-Protein Crystallization database mean. A database is basically a collection of • Predicting structures from sequence information that is structured, searchable, updated periodically Data mining is a collection of techniques for extracting new, and cross-referenced. The databases can be of many types: valid, interesting and surprising information from large • Flat file database databases. Traditional data analysis is assumption-driven, which • Relational database basically means that a hypothesis is developed in advance and • Object oriented database tested against data. The new techniques discussed here are Flat file database are simple text only files discovery-driven. The extracted knowledge itself acts as a Relational databases have data stored in tables and starting point for a new hypothesis. Researchers in areas such as relationships can be defined on a many-to-one basis or a many- machine learning, pattern recognition, and artificial intelligence to-many basis. The latter type requires a separate relationships first developed these techniques 5. Most of these systems are able table linking the two tables. All records in a given table have to to import and export models in PMML (Predictive Model necessary had identical features and any unique feature have to Markup Language) which provides a standard way to represent be stored in a separate table. For example, a sequence table data mining models so that these can be shared between might contain records with the attributes accession number and different statistical applications. PMML is an XML-based protein sequence while a function table might contain records language developed by the Data Mining Group (DMG) an with the attributes accession number and protein function. In independent group composed of many data mining companies. Object-oriented databases, Data are defined as objects which PMML version 4.0 was released in June 2009 have a class hierarchy, that is, they can be grouped into classes (www.wikipedia.org). and subclasses etc. In a hierarchical manner. With the amount of The data mining process typically involves the following data available, one of the most pressing tasks is the analysis and • Data cleansing – handling of noisy, missing or irrelevant data. interpretation of data. This can be done in various ways. • Data integration – integration of multiple data sources. • Individual entries in sequence and structure databases can be • Data selection – retrieving relevant data. complied to reveal patterns and trends in biology. • Data transformation – reducing the size of the data-set to be • Common sequence features in sequence families can be examined to a minimum. 4 identified in multiple alignments . Clustering of sequences into • Data analysis – applying intelligent pattern extraction trees reflects the degree of similarity between each sequence and algorithms. all of the others in the family reveals evolutionary relationships. • Pattern evaluation – assign measures of “interestingness” to • Finally, identification of homologs to each gene in well- the patterns. characterized metabolic pathways provides information about • Presenting the extracted knowledge to the user. the prevalence of that pathway in other organisms. The process of data mining is concerned with extracting Getting the sequence the structure of data to molecular patterns from data by using techniques such as classification, biology databases and