A Review of Data Mining in Bioinformatics

A Review of Data Mining in Bioinformatics

Vincent Limo A REVIEW OF DATA MINING IN BIOINFORMATICS Thesis CENTRIA UNIVERSITY OF APPLIED SCIENCES Information Technology December 2019 ABSTRACT Centria University Date Author of Applied Sciences December 2019 Vincent Limo Degree programme Information Technology Name of thesis A REVIEW OF DATA MINING IN BIOINFORMATICS Instructor Pages Kauko Kolehmainen 31 Supervisor Kauko Kolehmainen In the beginning of the 20th century, commonly known as the information age, there has been a phe- nomenal growth of potentially deadly group of abnormal diseases such as cancer. Because of this, there is a need for cancer and other biomedical research in and around transcriptomics, genomic and genetics which have a direct application of computer science methods such as data analysis and mathematics. The aim of this bachelor’s thesis is to highlight and discuss in detail the application of data mining techniques in bioinformatics. It begins by discussing the interdisciplinary relationship between data mining, knowledge discovery and bioinformatics before a comprehensive descriptive research in data mining techniques and their application in bioinformatics. The results stablished that gene expression analysis and gene sequencing rely on the application of clustering techniques such hierarchical, fuzzy, graph, and distance clustering while classification techniques, such as machine vector learning, super- vised learning, support vector machine and random forest are fundamental in genomic and proteomic synthesizing. It recommends data transformation, cleaning, and scalable statistical models as solutions to the prominent data quality and computational challenges in data mining. This thesis is divided into four main parts, Introduction, Data mining, Application of data mining in bioinformatics and a conclu- sion. Pages : 33 Key words Artificial Neural Networks, Bayesian System, Bioinformatics, Clustering, Data Mining, Decision Tree, Fuzzy C, K-Means, K-Nearest Neighbor, Knowledge Discovery, M-Means, P-DBSCAN, Self-Organ- izing Maps, Support Vector Machine. CONCEPT DEFINITIONS DE - Data Extraction DNA - Deoxyribonucleic Acid DP - Data Preprocessing HGP – Human genome Project KD – Knowledge Discovery KDD – Knowledge Discovery from Data mRNA – Massager Ribonucleic Acid OSTs – Open Source Tools RNA - Ribonucleic Acid ROI –Return on investment SNP(s) - Single-Nucleotide Polymorphism(s) CONTENTS 1 INTRODUCTION ................................................................................................................................ 1 2 DATA MINING ................................................................................................................................... 3 2.1 Data analysis (DA) .......................................................................................................................... 3 2.1.1 The exploratory analysis ..................................................................................................... 4 2.1.2 Descriptive statistics ............................................................................................................. 4 2.1.3 The confirmatory data analysis .......................................................................................... 4 2.2 Knowledge Discovery ..................................................................................................................... 5 2.2.1 Data Mining and Knowledge Discovery ............................................................................. 6 2.2.2 Uses of Knowledge Discovery Tools ................................................................................. 10 2.3 Application of Data Science ......................................................................................................... 11 2.4 Data Mining Examples ................................................................................................................ 12 2.4.1 Marketing ........................................................................................................................... 12 2.4.2 Banking ............................................................................................................................... 12 2.4.3 Medicine .............................................................................................................................. 13 2.4.4 Telecommunication ............................................................................................................ 13 3 DATA MINING AND BIOINFORMATICS ................................................................................... 14 3.1 Bioinformatics .............................................................................................................................. 14 3.1.1 Human Genome Project (HGP) ........................................................................................ 15 3.1.2 Transcription and Translation ......................................................................................... 16 3.2 Clustering Techniques ................................................................................................................. 17 3.2.1 Fuzzy C ................................................................................................................................ 17 3.2.2 K-Means .............................................................................................................................. 18 3.2.3 M-Means ............................................................................................................................. 19 3.2.4 P-DBSCAN ......................................................................................................................... 19 3.2.5 Self- Organizing Maps (SOM) .......................................................................................... 20 3.3 Machine Learning Algorithms for Bioinformatics ................................................................... 20 3.3.1 Support Vector Machine (SVM) ....................................................................................... 21 3.3.2 K – Nearest Neighbor (KNN) ............................................................................................ 22 3.3.3 Decision Tree ...................................................................................................................... 23 3.3.4 Bayesian System ................................................................................................................. 23 3.3.5 Artificial Neural Networks ................................................................................................ 24 3.4 Perl and Python Libraries ........................................................................................................... 25 3.5 Google Cloud Platform - Cloud Life Sciences ........................................................................... 26 3.6 Biological Database and Integration .......................................................................................... 27 3.7 Common Example ........................................................................................................................ 28 3.7.1 Genomics ............................................................................................................................. 28 3.7.2 Proteomics ........................................................................................................................... 28 3.7.3 Micro-Arrays ...................................................................................................................... 29 3.7.4 Phylogenetic ........................................................................................................................ 29 3.7.5 Population Genetics ........................................................................................................... 30 3.8 Challenges Facing Data Mining in Bioinformatics ................................................................... 30 3.9 The Future of Bioinformatics and Data Analysis ...................................................................... 32 4 CONCLUSION .................................................................................................................................. 33 REFERENCES ...................................................................................................................................... 34 GRAPHS GRAPH 1. Fuzzy clustering.................................................................................................................... 18 FIGURES FIGURE 1. Data mining process ............................................................................................................. 8 FIGURE 2. Data mining adopts techniques from many domains .......................................................... 10 FIGURE 3. Complex healthcare data .................................................................................................... 15 FIGURE 4. Bioinformatics Clustering Techniques ............................................................................... 21 FIGURE 5. Application of SVMs .......................................................................................................... 22 FIGURE 6. KNN Algorithm .................................................................................................................. 23 FIGURE 7. Artificial neural network .................................................................................................... 25 TABLES TABLE 1. Cloud life access

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    40 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us