Need and Role of Scala Implementations in Bioinformatics

(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 Need and Role of Scala Implementations in Bioinformatics Abbas Rehman Muhammad Atif Sarwar Department of Computer Science Department of Computer Science COMSATS Institute of Information Technology COMSATS Institute of Information Technology Sahiwal, Pakistan Sahiwal, Pakistan Ali Abbas Javed Ferzund Department of Computer Science Department of Computer Science COMSATS Institute of Information Technology COMSATS Institute of Information Technology Sahiwal, Pakistan Sahiwal, Pakistan Abstract—Next Generation Sequencing has resulted in the evolutionary change in data generation of different sequences. generation of large number of omics data at a faster speed that NGS machines are generating a huge amount of sequence data was not possible before. This data is only useful if it can be stored per day that needs to be stored, analyzed and managed well to and analyzed at the same speed. Big Data platforms and tools like seek the maximum advantages from this. Existing Apache Hadoop and Spark has solved this problem. However, bioinformatics techniques, tools or software are not keeping most of the algorithms used in bioinformatics for Pairwise pace with the speed of data generation. Old Bioinformatics alignment, Multiple Alignment and Motif finding are not tools have very less performance, accuracy and scalability implemented for Hadoop or Spark. Scala is a powerful language while analyzing large amount of data. When storing, managing supported by Spark. It provides, constructs like traits, closures, and analyzing large amount of data which is being generated functions, pattern matching and extractors that make it suitable now a days, these tools require more time and cost with less for Bioinformatics applications. This article explores the Bioinformatics areas where Scala can be used efficiently for data accuracy. analysis. It also highlights the need for Scala implementation of Apache Hadoop is best Platform for Big Data processing. algorithms used in Bioinformatics. Hadoop is open source Java Platform that contains thousands of clusters that is used for parallel processing and execution of Keywords—Scala; Big Data; Hadoop; Spark; Next Generation Big Data. Its main components are Pig, HBase, Hive, HDFS Sequencing; Genomics; RNA; DNA; Bioinformatics (Hadoop Distributed File System), MapReduce and Apache I. INTRODUCTION Spark Framework. Pig is High level language that is used for scripts. It includes load store operators and provides users the Today, we are living in the world of Big Data. Huge capability of creating own built-in-functions (extensible). amount of data is being produced on daily basis. Major sources HBase is used for automatic sharding and sparse data of data include social media, enterprise systems, sensor based processing by replacing RDBMS (Relational Database applications, Bioinformatics sequencing machines, smart Management System). Hive is not used for real time processing phones, digital videos or pictures and World Wide Web. Big but it is used for large analytics and efficient query processing Data’s characteristics are Veracity, Velocity, Variety, Volume with the help of meta-store unit. HDFS is file system that is and Potential Value (these are known as 5 V’s). To make this developed for processing and execution of large files in data useful, it needs to be stored and analyzed with accuracy database that is created by Hadoop components. Its two units and speed. Traditional techniques are unable to store and are data node and name node. MapReduce is designed for analyze such large amount of data. These techniques are better parallel execution and processing of large datasets in Hadoop for a limited amount of data analyses as the cost of analysis Platform. Apache Spark is framework especially designed for increases with increment in data volume. Analytics by using the Languages Java, Python, C and Scala. To deal with this hurdle, Big Data platforms and tools are Its main components are caching, action and transformation. introduced which can analyze a large amount of data with Many Bioinformatics Algorithms are implemented in Scala accuracy, speed and scalability. Using Big Data Platforms like language for Apache Spark Framework. Scala is functional, Hadoop, cost of analysis is also reduced as it runs on statically typed and object oriented language. It is better for commodity hardware. Major challenges for Big Data are speed, concurrent processing. Its main features are traits, closures and performance, efficiency, scalability and accuracy. Big Data functions that are used for processing of multiple Genome platforms and tools like Hadoop (distributed management Sequencing Algorithms. Scala mostly works like C++ System) and Apache Spark (for big data analysis) address these language. issues. NGS (Next Generation Sequencing) machines bring an 285 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 Scala consists of Arrays, Loops, Strings, Classes, Objects, illustrates further research ideas in his paper in which Machine collections, Pattern Matching and Extractors. All of these Learning Techniques and Algorithms are implemented in structures and statements are used for Bioinformatics Hadoop and Spark Framework. Sarwar et al. [2] have proposed Algorithmic comparison by Scala in Spark Framework. Scala review study about Bioinformatics Tools. They demonstrate the also contains many Built-in-Methods, Libraries and Functions implementations of Tools for Alignment Viewers, Database that are very useful for designing Bioinformatics Algorithms. Search and Genomic Analysis on Hadoop and Apache Spark Scala language plays an imperative role in Bioinformatics Framework using Scala language. It also describes further Applications. research domains for the implementation of Bioinformatics Tools on Hadoop and Apache Spark using various languages Genome Sequencing, Motif Finding, Pairwise Alignment such as Java, Scala and Python. and Multiple Alignment are main features for Bioinformatics. Scala language is very important for these Algorithms. In SeqPig is a library and tool for Analysis and query Genome and Multiple Sequencing, a lot of algorithms are used sequencing data with scalability [3]. It uses the Hadoop engine, for handling Biological Sequences. These Algorithms are Apache Pig, that automatically parallelizes and distributes tasks implemented in Scala language. In Apache Spark, Motif that are translated into sequence of MapReduce jobs. It Finding Algorithms are implemented using Scala language. In provides extension mechanism for library functions supported Pairwise Alignment, Scala language is very significant for by languages (Python, Java and JavaScript) and also provides pattern Matching. import and export functions for file format such as Fastq, Qseq, FASTA SAM and BAM. It allows the user to load and export Spark provides the facility of Scala shell for the sequencing data. SeqPig provides five read statistics. (a) implementation of these Bioinformatics Algorithms. Primitive average base quality read; (b) length of reads; (c) base by Types and anonymous functions in Scala perform well for position inside the read; (d) GC content of read. Finally managing arrangements of Multiple Sequences. Anonymous combined with single script, it is also used for ad-hoc Analysis functions are used in transformations, actions and loading files but SparkSeq is the best option for ad-hoc analysis. for Analytics of Bioinformatics datasets in Apache Spark Framework. Shared variables and key-value pairs are used in Wiewiorka et al. [4] have launched bioinformatics tool used Hadoop using Scala language for Bioinformatics Algorithms. to build genome pipeline in Scala and for RNA and DNA sequence analysis. The purpose of this work was to determine For implementing Bioinformatics Algorithms in Scala scalability and very fast performance by analysis of large language on Hadoop Platform, datasets are stored in specific datasets such as protein, genome and DNA. A new MapReduce format. Different storage formats are used for different model has been developed for parallel and distributed Algorithms on Hadoop and Spark Platform for example, Fasta, execution in Spark. Data cannot be stored in HDFS without Fastq, CSV, ADAM, BAM (Binary Alignment Map)/ SAM BAM library (for direct access data and support formats). After (Sequence Alignment Map) and ADAM. data storage in Hadoop, Spark queries applied to sequencing The objectives of this study are: datasets and data is analyzed. To explore the Supported Languages and Supported Nordberg et al. [5] proposed the BioPig, used for analysis Platforms for Genome Sequencing, Motif Finding, of large sequencing datasets in the perspective of Scalability Pairwise Alignment and Multiple Alignment (scale with data size), Programmability (reduced development Algorithms time) and portability (without modification Hadoop). To evaluate these three perspectives, Kmer application was To analyze the need for Scala language for the implemented to check its performance and compare with other implementation of Bioinformatics Algorithms on methods. BioPig uses methods (pigKmer, pigDuster and Hadoop Platform pigDereplicator). Dataset size for Biopig ranges from 100 MB To explore the Scala Language used in existing to 500 GB. Biopig is same as SeqPig in such a way that both Bioinformatics tools use Hadoop and Pig environment and same functions

Need and Role of Scala Implementations in Bioinformatics

T-Coffee Documentation Release Version 13.45.47.Aba98c5

How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)

Introduction to Linux for Bioinformatics – Part II Paul Stothard, 2006-09-20

A Tool to Sanity Check and If Needed Reformat FASTA Files

A Tool to Sanity Check and If Needed Reformat FASTA Files

The FASTA Program Package Introduction

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Understanding the Origins, Dispersal, and Evolution of Bonamia Species (Phylum Haplosporidia) Based on Genetic Analyses of Ribosomal RNA Gene Regions

MATLAB Software for Extracting Protein Name and Sequence Information from FASTA Formatted Proteome File

The Shared Genomic Architecture of Human Nucleolar Organizer Regions

Biology 3055

Tracheophyte Genomes Keep Track of the Deep Evolution of the 2 Caulimoviridae 3 4 Authors 5 Seydina Diop1, Andrew D.W