<<

View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Journal of Advanced Applied Scientific Research (JOAASR)

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42

APPLICATIONS OF IN AND

K.Gobalan 1, Dr.S.Ahamed John2

1. PG and Research Department of , Jamal Mohamed College, Trichy-20.

2. PG and Research Department of Jamal Mohamed College, Trichy –20.

ABSTRACT

Bioinformatics is the application of and science to the field of molecular . The term bioinformatics was coined by in 1979 for the study of bioinformatics processes in biotic systems. Its primary use since at least the late 1980s has been in genomics and proteomics, particularly in those areas of genomics involving in large-scale DNA and proteomics in structure prediction. Bioinformatics now entitle the creation and advancement of bases, , computational and statistical techniques and theory to solve formal and practical problems arising from the management and analysis of . Over the past few decades rapid developments in genomic and proteomics. Research technologies and developments in technologies have combined to produce tremendous amount of information related to . It is the name given to these mathematical and computing approaches used to clear understanding of biological processes. Common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning different DNA and protein sequences to compare them and creating and viewing 3-D models of protein structures. The primary goal of bioinformatics is to increase the understanding of biological processes. Bioinformatics is focus on developing and applying computationally intensive techniques (e.g., , machine algorithms, and ) to achieve this goal. Major research efforts in the field include , finding, assembly, , , alignment, protein structure prediction, prediction of and protein- protein interactions, genome-wide association studies and the modeling of .

29

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42

1. Introduction statistical modeling of protein-protein

In recent years, rapid developments , etc. Therefore, we see a great in genomics and proteomics have potential to increase the interaction generated a large amount of biological between data mining and bioinformatics. data. Drawing conclusions from these data 2. Bioinformatics requires sophisticated computational The term bioinformatics was coined by analyses. Bioinformatics, or computational Paulien Hogeweg in 1979 for the study of biology, is the interdisciplinary science of processes in biotic systems. It interpreting biological data using was primary used since late 1980s has been and computer in genomics and , particularly in science. The importance of this new field those areas of genomics involving large- of inquiry will grow as we continue to scale DNA sequencing. Bioinformatics generate and integrate large quantities of can be defined as the application of genomic, proteomic, and other data. A computer technology to the management particular active area of research in of biological information. Bioinformatics bioinformatics is the application and is the science of storing, extracting, development of data mining techniques to organizing, analyzing, interpreting and solve biological problems. Analyzing large utilizing information from biological biological data sets requires making sense sequences and . It has been of the data by inferring structure or mainly fueled by advances in DNA generalizations from the data. Examples of sequencing and mapping techniques. Over this type of analysis include protein the past few decades rapid developments in structure prediction, gene classification, genomic and other molecular research classification based on technologies and developments in data, clustering of gene expression data, information technologies have combined

30

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42 to produce a tremendous amount of software system was designed in 1995 by information related to molecular biology. Dr. Owen White.

The primary goal of bioinformatics is to 2.3. Analysis of gene expression increase the understanding of biological The) tag sequencing, massively processes. Some of the grand area of parallel signature sequencing (MPSS), or research in bioinformatics includes: various applications of expression of many

Major research areas: can be determined by measuring

2.1. mRNA levels with various techniques such

as microarrays, expressed cDNA sequence Sequence analysis is the most tag (EST) sequencing, serial analysis of primitive operation in computational gene expression (SAGE multiplexed in- biology. This operation consists of finding situ hybridization etc. All of these which part of the biological sequences are techniques are extremely noise-prone and alike and which part differs during medical subject to bias in the biological analysis and genome mapping processes. measurement. Here the major research area The sequence analysis implies subjecting a involves developing statistical tools to DNA or peptide sequence to sequence separate from noise in alignment, sequence , repeated High-throughput gene expression studies. sequence searches, or other bioinformatics methods on a computer. 2.4. Analysis of protein expression 2.2. Genome annotation Gene expression is measured in In the context of genomics, annotation many ways including mRNA and protein is the of marking the genes and expression; however protein expression is other biological features in a DNA one of the best clues of actual gene activity sequence. The first genome annotation since are usually final catalysts of

31

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42 activity. Protein microarrays and high polymorphism arrays to detect throughput (HT) (MS) known point . Another type of can provide a snapshot of the proteins data that requires novel informatics present in a biological sample. development is the analysis of

Bioinformatics is very much involved in found to be recurrent among many tumors. making sense of and 2.6. Protein structure prediction

HT MS data. The sequence of a

2.5 Analysis of mutations in cancer protein (so-called, primary structure) can

In cancer, the of affected be easily determined from the sequence on cells are rearranged in complex or even the gene that codes for it. In most of the unpredictable ways. Massive sequencing cases, this primary structure uniquely efforts are used to identify previously determines a structure in its native unknown point mutations in a variety of environment. Knowledge of this structure genes in cancer. Bioinformaticians is vital in understanding the of the continue to produce specialized automated protein. For lack of better terms, structural systems to manage the sheer volume of information is usually classified as sequence data produced, and they create secondary, tertiary and quaternary new algorithms and software to compare structure. Protein structure prediction is the sequencing results to the growing one of the most important for drug design collection of and the design of novel . A

Sequences and general solution to such predictions polymorphisms. New physical detection remains an open problem for the technologies are employed, such as researchers. microarrays to identify chromosomal gains and losses and single-

32

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42

2.7. simulations of biological systems, like

Comparative genomics is the study cellular subsystems such as the networks of the relationship of genome structure and of metabolites and Enzymes, signal function across different biological transduction pathways and gene regulatory . Gene finding is an important networks to both analyze and visualize the application of comparative genomics, as is complex connections of these cellular discovery of new, non-coding functional processes. is an attempt to elements of the genome. Comparative understand evolutionary processes via the genomics exploits both similarities and of simple life forms differences in the proteins, RNA, and 2.9. High-throughput regulatory regions of different . Computational technologies are used to

Computational approaches to genome accelerate or fully automate the comparison have recently become a processing, quantification and analysis of common research topic in computer large amounts of high-information-content science. biomedical images. Modern image

2.8. Modeling biological systems analysis systems augment an observer's

Modeling biological systems is a ability to make measurements from a large significant task of and or complex set of images. A fully mathematical biology. Computational developed analysis system may completely systems biology aims to develop and use replace the observer. Biomedical imaging efficient algorithms, data structures, and is becoming more important for both visualization and tools for diagnostics and research. Some of the the integration of large quantities of examples of research in this area are: biological data with the goal of computer clinical image analysis and visualization, modeling. It involves the use of computer

33

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42 inferring clone overlaps in DNA mapping, of finding new interesting patterns and

Bioimage informatics, etc. relationship in huge amount of data. It is

2.10. Protein-protein docking defined as “the process of discovering

meaningful new correlations, patterns, and In the last two decades, tens of trends by digging into large amounts of thousands of protein three-dimensional data stored in warehouses”. Data mining is structures have been determined by X-ray also sometimes called Knowledge crystallography and Protein nuclear Discovery in Databases (KDD). Data magnetic resonance spectroscopy (protein mining is not specific to any industry. It NMR). One central question for the requires intelligent technologies and the biological is whether it is practical willingness to explore the possibility of to predict possible protein-protein hidden knowledge that resides in the data. interactions only based on these 3D Data Mining approaches seem ideally shapes, without doing protein-protein suited for Bioinformatics, since it is data- interaction . A variety of rich, but lacks a comprehensive theory of methods have been developed to tackle life’s organization at the molecular level. The Protein-protein docking problem, The extensive databases of biological though it seems that there is still much information create both challenges and work to be done in this field. opportunities for development of novel

3. Bioinformatics Tools KDD methods. Mining biological data Following are the some of the important helps to extract useful knowledge from tools for bioinformatics (figure 1) massive datasets gathered in biology, and 4. Data Mining in other related life sciences areas such as Data mining refers to extracting or medicine and . “mining” knowledge from large amounts of data. Data Mining (DM) is the science

34

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42

Figure 1: Some of the important tools for bioinformatics

4.1. Data mining tasks meaningful new patterns from the data, are:

Classification: Classification is learning a The two "high-level" primary goals of data function that maps (classifies) a data item into one of several predefined classes. mining, in practice, are prediction and Estimation: Given some input data, coming up with a value for some unknown description. The main tasks well suited for continuous variable. Prediction: Same as classification & data mining, all of which involves mining estimation except that the records are classified according to some future

behavior or estimated future value).

35

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42

Association rules: Determining which 5. Application of Data Mining in things go together, also called dependency Bioinformatics modeling. Clustering: Segmenting a into Applications of data mining to a number of subgroups or clusters. Description & visualization: Representing bioinformatics include gene finding, the data using visualization techniques.

protein function domain detection, Learning from data falls into two function motif detection, protein function categories: directed (“supervised”) and inference, disease diagnosis, disease undirected (“unsupervised”) learning. The prognosis, disease treatment optimization, first three tasks – classification, estimation protein and gene interaction network and prediction – are examples of reconstruction, data cleansing, and protein . The next three tasks – sub-cellular location prediction. association rules, clustering and For example, microarray technologies are description & visualization – are examples used to predict a patient’s outcome. On the of . In unsupervised basis of patients’ genotypic microarray learning, no variable is singled out as the data, their survival time and risk of tumor target; the goal is to establish some metastasis or recurrence can be estimated. relationship among all can be used for peptide The variables. Unsupervised learning identification through mass spectroscopy. attempts to find patterns without the use of Correlation among fragment ions in a a particular target field. The development tandem mass spectrum is crucial in of new data mining and knowledge reducing stochastic mismatches for peptide discovery tools is a subject of active identification by searching. An research. One motivation behind the efficient scoring that considers development of these tools is their the correlative information in a tunable and potential application in modern biology. comprehensive manner is highly desirable

36

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42

have recourse to a spectra of algorithmic,

5.1 Comparative genomics statistical and mathematical techniques,

The core of comparative genome analysis ranging from exact, , fixed is the establishment of the correspondence parameter and approximation algorithms between genes (orthology analysis) or for problems based on parsimony models other genomic features in different to Monte Carlo algorithms organisms. It is these inter-genomic maps for Bayesian analysis of problems based on that make it possible to trace the probabilistic models. Many of these evolutionary processes responsible for the studies are based on the of two genomes. A multitude detection and ’s of evolutionary events acting at various computation. predictions remains an open organizational levels shape genome problem. As of now, most efforts have evolution. At the lowest level, point been directed towards heuristics that work mutations affect individual . At most of the time. a higher level, large chromosomal One of the key ideas in bioinformatics is segments undergo duplication, lateral the notion of homology. In the genomic transfer, inversion, branch of bioinformatics, homology is

Transposition, deletion and insertion. used to predict the function of a gene: if the

Ultimately, whole genomes are involved in sequence of gene A, whose function is processes of hybridization, known, is homologous to the sequence of polyploidization and endosymbiosis, often gene B, whose function is unknown, one leading to rapid . The could infer that B may share A's function. complexity of genome evolution poses In the structural branch of bioinformatics, many exciting challenges to developers of homology is used to determine which parts mathematical models and algorithms, who of a protein are important in structure

37

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42 formation and interaction with other metabolites and enzymes which comprise proteins. In a technique called homology , pathways modeling, this information is and gene regulatory networks) to analyze used to predict the structure of a protein and visualize the complex connections of once the structure of a homologous protein these cellular processes. Artificial life or is known. This currently remains the only virtual evolution attempts to understand way to predict protein structures reliably. evolutionary processes via the computer

One example of this is the similar simulation of simple (artificial) life forms. protein homology between hemoglobin in 5.3 High-throughput image analysis humans and the hemoglobin in legumes Computational technologies are

(). Both serve the same used to accelerate or fully automate the purpose of transporting oxygen in the processing, quantification and analysis of . Though both of these proteins large amounts of high-information-content have completely different amino acid biomedical imagery. Modern image sequences, their protein structures are analysis systems augment an observer's virtually identical, which reflects their near ability to make measurements from a large identical purposes. Other techniques for or complex set of images, by improving predicting protein structure include protein accuracy, objectivity, or speed. A fully threading and de novo (from scratch) developed analysis system may completely physics-based modeling. replace the observer. Although these and structural domain. systems are not unique to biomedical

5.2 Modeling biological systems imagery, biomedical imaging is becoming

Systems biology involves the use more important for both diagnostics and of computer simulations of cellular research. Some examples are: subsystems (such as the networks of

38

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42

• High-throughput and high-fidelity so-called primary structure, can be easily quantification and sub-cellular localization determined from the sequence on the gene

(high-content screening, that codes for it. In the vast majority of

Cytohistopathology, Bio image cases, this primary structure uniquely informatics) determines a structure in its native

• Morph metrics environment. (Of course, there are

• Clinical image analysis and visualization exceptions, such as the bovine spongiform

• determining the real-time air-flow encephalopathy – a.k.a. Mad Cow Disease patterns in breathing of living – .) Knowledge of animals This structure is vital in understanding the

• quantifying occlusion size in real-time function of the protein. For lack of better imagery from the development of and terms, structural information is usually recovery during arterial injury classified as one of secondary, tertiary and

• making behavioral observations from quaternary structure. A viable general extended video recordings of laboratory solution to such animals 5.5 Molecular Interaction

• Infrared measurements for metabolic Efficient software is available activity determination today for studying interactions among

• inferring clone overlaps in DNA proteins, ligands and peptides. Types of mapping, e.g. the Sulston score interactions most often encountered in the

5.4 Structural Bioinformatics field include – Protein–ligand (including Approaches drug), protein–protein and protein– Prediction of protein structure peptide. Molecular dynamic simulation of Protein structure prediction is another movement of about rotatable bonds important application of bioinformatics. is the fundamental principle behind The amino acid sequence of a protein, the

39

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42

Computational algorithms, termed standalone web-services available from docking algorithms for studying various bioinformatics companies or molecular interactions. public institutions.

See also: protein–protein interaction Open source bioinformatics software prediction. Many free and open source 5.6 Docking algorithms software tools have existed and continued In the last two decades, tens of to grow since the 1980s.[8] The thousands of protein three-dimensional combination of a continued need for new structures have been determined by X-ray algorithms for the analysis of emerging crystallography and Protein nuclear types of biological readouts, the potential magnetic resonance spectroscopy (protein for innovative experiments, and NMR). One central question for the freely available open code bases have biological scientist is whether it is practical helped to create opportunities for all to predict possible protein–protein research groups to contribute to both interactions only based on these 3D bioinformatics and the of open shapes, without doing protein–protein source software available, regardless of interaction experiments. A variety of Their funding arrangements. The open methods have been developed to tackle source tools often act as incubators of the Protein–protein docking problem, ideas, or -supported plug-ins in though it seems that there is still much commercial applications. They may also work to be done in this field. provide de facto standards and shared Software and tools object models for assisting with the Software tools for bioinformatics challenge of bioinformation integration. range from simple command-line tools, to The range of open source software more complex graphical programs and packages includes titles such as 40

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42

Bioconductor, BioPerl, , amongst potential users, so it can be

BioJava, BioRuby, , EMBOSS, difficult for the database curators to

Taverna workbench, and UGENE. In order provide access mechanism appropriate to to maintain this tradition and create further all. The integration of biological databases opportunities, the non-profit Open is also a problem. Data mining and

Bioinformatics Foundation [8] have bioinformatics are fast growing research supported the annual Bioinformatics Open area today. It is important to examine what

Source Conference (BOSC) since 2000.[8] is the important research Issues in

6. Conclusion and challenges bioinformatics and develop new data

Bioinformatics and data mining are mining methods for scalable and effective developing as interdisciplinary science. analysis.

Data mining approaches seem ideally References

[1] Zaki , J.; Wang , T.L. and Toivonen, T.T. (2001). suited for bioinformatics, since BIOKDD01: Workshop on Data Mining in bioinformatics is data-rich but lacks a Bioinformatics”. comprehensive theory of life’s [2] Li, J.; Wong, L. and Yang, Q. (2005 ). Data Mining

in Bioinformatics, IEEE Intelligent System, IEEE organization at the molecular level. Computer Society. However, data mining in bioinformatics is [3] Liu, H.; Li, J. and Wong, L. (2005). Use of Extreme hampered by many facets of biological Patient Samples for Outcome Prediction from Gene

Expression Data, Bioinformatics, vol. 21, no. 16, pp. databases, including their size, number, 3377–3384 diversity and the lack of a standard [4] Yang, Qiang. Data Mining and Bioinformatics: Some ontology to aid the querying of them as Challenges, http://www.cse.ust.hk/~qyang

[5] Berson, Alex, Smith, Stephen and Threaling, Kurt, well as the heterogeneous data of the “Building Data Mining Application for CRM”, Tata quality and information they McGraw Hill. contain. Another problem is the range of [6] Zhang, Yanqing; C., Jagath, Rajapakse, Machine

Learning in Bioinformatics, Wiley, ISBN: 978-0-470- levels the domains of expertise present 11662-3

41

Journal of Advanced Applied Scientific Research - ISSN: 2454-3225 K.Gobalan et.al JOAASR-RTB-2016- February-2016:29- 42

[7] Kuonen, Diego. Challenges in Bioinformatics for [16] Luis, T.; Chitta; B. and Kim, S. (2008). Fuzzy c-

Statistical Data Miner, Bulletin of the Swiss Statistical clustering with prior biological knowledge,

Society, 46; 10-17. Journal of Biomedical Informatics.

[8] Richard, R.J. A. and Sriraam, N. (2005). A Feasibility [17] Han and Kamber (2006). Data Mining concepts and

Study of Challenges and Opportunities in Computational techniques, Morgan Kaufmann Publishers.

Biology: A Malaysian Perspective, American Journal of [18] Hand, D. J.; Mannila, H. and Smyth, P. Principles of

Applied Sciences 2 (9): 1296-1300. Data Mining, MIT Press.

[9] Tang, Haixu and Kim, Sun. Bioinformatics: mining [19] Aluru, S., ed. (2006). Handbook of Computational the massive data from high throughput genomics Molecular Biology. Chapman & Hall/Crc, experiments, analysis of biological data: a soft computing [20] Baxevanis, A.D.; Petsko, G.A.; Stein, L.D. and approach, edited by Sanghamitra Bandyopadhyay, Indian Stormo, G.D., eds. (2007). Current Protocols in

Statistical Institute, India Bioinformatics. Wiley.

[10] Nayeem, Akbar; Sitkoff, Doree, and Krystek, Jr., [21] Mount, D. W. (2002). Bioinformatics: Sequence and

Stanley. (2006) A comparative study of available Genome Analysis Spring Harbor Press. software for highaccuracy : From [22] Gilbert, D. (2004). Bioinformatics software sequence alignments to structural models, Protein Sci. resources. Briefings in Bioinformatics, Briefings in

April; 15(4): 808–824 Bioinformatics.

[11] N., Cristianini and M., Hahn. (2006) Introduction to [23] Jiong, Lei Liu; Yang, A. and Tung, K. H. Data

Computational Genomics, Cambridge University Press. Mining Techniques for Microarray Datasets,

ISBN 0-5216-7191-4. Proceedings of the 21st International Conference on Data

[12] SJ, Wodak and Janin, J. (1978). Computer Analysis Engineering (ICDE 2005). of Protein-Protein Interactions. Journal of Molecular [24] Pevzner, P. A. (200). Computational Molecular

Biology 124 (2): 323–42. Biology: An Algorithmic Approach The MIT Press.

[13] Mewes, H.W.; Frishman, D.; X.Mayer, K. F.; [25] Soinov, L. (2006). Bioinformatics

Munsterkotter, M., Noubibou , O.; Pagel, P. and Rattei, T. (2006) Nucleic Acids Research, 34, D169. [14] Lee, Kyoungrim. (2008). Computational Study for

Protein-Protein Docking Using Global Optimization and

Empirical Potentials, Int. J. Mol. Sci. 9, 65-77.

[15] Hirschman, Lynette; C. Park, Jong; T., Junichi,

Wong, L. and H. Wu., Cathy (2002). Accomplishments and challenges in literature data mining for biology,

BIOINFORMATICS REVIEW, Vol. 18 no. 12, 1553–

1561

42