Data-Intensive Computing with Hadoop

Total Page:16

File Type:pdf, Size:1020Kb

Data-Intensive Computing with Hadoop 15-712 Project Proposal Anthony Gitter, Alex Grubb, and Jeffrey Barnes October 8, 2007 1 Background In virtually all scientific disciplines, technological advances over the past years and decades have resulted in an exponential growth in the quantity of raw scien- tific data available for analysis. Massive datasets, ranging from hundreds of MBs to many TBs, are increasingly common and available to the public. Spanning the breadth of scientific fields, this publicly accessible information includes NASA astrophysics and astronomy collections [1], partial three-dimensional mappings of the universe [2], earthquake and seismic data [3], complete genomes for a multitude of organisms [4], and so on. However, collecting and publishing raw data is only the beginning of sci- entific understanding. Unfortunately, the enormous benefits to be gained by thoroughly analyzing these datasets is typically matched by the complexity of such analysis. For instance, sequence alignment is a powerful technique for pre- dicting the function of newly discovered genes [5], but computational limitations prevent the best algorithms available from being used on full datasets. Super- computers and customized high-performance solutions such as the CLC Bioin- formatics Cube [6] can sometimes employed, but the cost of such solutions is prohibitive. Furthermore, for some classes of scientific problems involving huge datasets (including those mentioned above), processing power can be largely wasted even on a supercomputer if the computation is disk-bound. With the advent of multi-core personal computers, clusters of commodity machines have the potential to become a reasonable alternative platform for intense scientific computing. Using clusters of readily available machines in place of supercomputers or problem-specific high-performance machines could not only reduce costs, power consumption, and possibly execution time, but it would also open the multitude of rich, massive datasets to a wider community of researchers, conceivably accelerating the rate of scientific discovery. 2 Project Overview We aim to demonstrate that a cluster of standard computers can be used instead of supercomputers for scientific computing over very large datasets. Because our objective is not to derive new results from these datasets, we will select a 1 well-known problem that 1) requires a supercomputer to solve or 2) has only been partially solved because the best algorithms running on typical computers cannot handle the complete dataset. By adapting the target algorithms to take advantage of a cluster and exploiting parallelism in this environment, we expect to reduce cost or power consumption in case 1) or increase the scale of problems that can be solved in case 2). While we have yet to decide on one particular problem and dataset to target, we present two viable alternatives below. 3 Sequence Alignment Sequence alignment allows biological researchers to identify similar genes and develop hypotheses about common functionality between sequentially similar or homologous genes [7]. The problem is non-trivial because genetic mutations may introduce gaps in sequences that are otherwise very similar, and the size of the peptide and nucleotide sequence databases to be searched continues to grow dramatically [8]. The Smith-Waterman algorithm [9] is recognized as one of, if not the very best for sequence alignment algorithm, but when used with entire sequence databases (this is the typical use case), it is unacceptably slow. Therefore, BLAST [5], an approximation of the Smith-Waterman algorithm, has emerged as the preferred way to perform sequence alignment because it accommodates large sequence databases. However, bioinformatics corporation CLC bio claims that BLAST “can po- tentially miss up to 50% of what you are searching for”[6]. Their solution is a specialized high-performance computer, the CLC Bioinformatics Cube, which is reported to run Smith-Waterman over 100 times faster than a 3.0 GHz Pen- tium desktop computer. While pricing is not provided on their corporate web site, this custom machine is most likely expensive and has limited use (if any) outside of bioinformatics applications. Thus, there is quite a gap in the options biologists wishing to do sequence alignment have. They can run an approximate algorithm, sacrificing the quality of their results, or they can invest in a (pre- sumably expensive) single-purpose machine to run the best available algorithm on all obtainable data. If we focus on this particular problem, we seek to elim- inate this gap by implementing the Smith-Waterman algorithm on a cluster of standard machines with a runtime that is reasonably close to that of BLAST. 4 Machine Learning Over Massive Datasets With the increased popularity of machine learning techniques, the amount of problems being solved with statistical data mining has grown rapidly. Many groups, including Carnegie Mellon’s Auton Lab[10] are working specifically to make these problems computationally feasible and efficiently solvable, due the the broad applicability of these techniques, and the increasing need for these techniques to be used for massive sets of data. In many of these cases, the large data sets cannot be usefully evaluated by humans due to their enormous scale, 2 and machine learning techniques will be necessary to discover the properties and phenomena emobodied in these datasets, such as those containing huge amounts of data generated by observing natural processes and systems like the datasets used in computational biology mentioned above and the astronomical data mentioned in [2]. Using the Sloan Sky Survey astronomical data as a motivating example, the need for efficient systems for solving these problems is clear. The raw data con- tained in this dataset is over 10TB of image data covering approximately 25% of the night sky. Further processed forms of this data exist and are used for analysis, but even these forms are often multiple gigabytes of data. A number of initial research into mining this data have been made [11], and smaller subsets of the dataset are often used to test the efficiency of data mining techniques[12]. Currently, much of the processing done on this dataset in particular is in the form of nearest-neighbor matching on the processed datasets to discover clusters of similar galaxies, because many other methods of data mining are computa- tionally infeasible using current methods. Obviously, further investigation is needed here, and a meeting with researchers from the Auton Lab is planned to get a better idea of the problems that could be most benefited by efficient parallelization and a proper distributed architecture, but the need for efficient systems of solving these problems is clear. 5 Parallelization The essence of our project is to adapt a resource-intensive, data-driven problem to take advantage of a highly parallel architecture. This section describes how we will do so and, in particular, what existing software frameworks we might avail ourselves of in order to implement a solution. One of the most famous frameworks relevant to our task is MapReduce [13], a programming model from Google. The basic idea is that system developers (i.e., us) frame their situation in terms of a map function and a reduce function. The map function takes a key/value pair as input, processes it, and generates a set of zero or more intermediate key/value pairs. The reduce function, which the application calls once for each intermediate key, aggregates the intermediate values associated with that key. In principle, expressing the solution in terms of these primitives is all the developer needs to do. Lower-level tasks, such as partitioning the input data, scheduling execution events, and handling inter-process communication are all automated. MapReduce is especially well suited to parallel computations over large (perhaps many terabytes) data sets, which is exactly the domain with which we are dealing. The Google implementation of MapReduce is in C++ and is proprietary. However, open-source implementations are available, of which the most mature by far is Hadoop [14, 15], a Java software framework that implements MapRe- duce. Hadoop is part of the Lucene project [16], a popular Java information retrieval library supported by the Apache Software Foundation. Hadoop boasts 3 the ability to scale up to thousands of nodes and petabytes of data. The Hadoop implementation of MapReduce runs on top of the Hadoop Distributed File Sys- tem (HDFS) [17], which differs from other distributed file systems in that it is intended to be deployed on low-cost commodity hardware that is subject to failure. In addition, HDFS is tuned to support typical file sizes of gigabytes or terabytes. In short, Hadoop is ideally suited to our aims. It presents a high-level frame- work for the development of highly parallel software, abstracting the messiest implementation details, while offering impressive performance and scalability. These are important concerns, since we may indeed be running applications on terabytes of data and dozens of cores. The advantage we gain by utilizing an ex- isting framework that offers a simple interface for highly parallel programming can hardly be overstated, given the limited time available to us for implemen- tation and our team members’ lack of experience programming applications distributed across more than one or two cores. 6 Hardware For implementation and testing, we intend to take advantage of the Intel cluster that Prof. David Andersen mentioned. We do not know the technical details of these machines, but if we’re not mistaken, the cluster has dozens of cores, which should provide a satisfactory testbed for our implementation. 7 Timeline This section presents a list of tasks that we expect to constitute our project. This serves two purposes. First, it provides a more concrete description of what our project will entail. Second, it provides us with a set of milestones by which to judge our progress during the course of the semester (and targets for which to aim). In total, there are slightly more than eight weeks between the submittal of this proposal and the scheduled presentation of our project.
Recommended publications
  • (12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat Et Al
    US 2003O21, 1987A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat et al. (43) Pub. Date: Nov. 13, 2003 (54) METHODS AND MATERIALS RELATING TO Apr. 7, 2000 (US)........................................... O9545,714 STEM CELL GROWTH FACTOR-LIKE Apr. 11, 2000 (US)........................................... O9547358 POLYPEPTIDES AND POLYNUCLEOTDES Publication Classification (76) Inventors: Ivan Labat, Mountain View, CA (US); Y Tom Tang, San Jose, CA (US); Radoje T. Drmanac, Palo Alto, CA (51) Int. Cl." ....................... A61K 38/18; CO7K 14/475; (US); Chenghua Liu, San Jose, CA C12O 1/68; CO7H 21/04; (US); Juhi Lee, Fremont, CA (US); C12M 1/34; C12P 21/02; Nancy K Mize, Mountain View, CA C12N 5/08 (US); John Childs, Sunnyvale, CA (52) U.S. Cl. ......... 514/12; 435/69.1; 435/6; 435/320.1; (US); Cheng-Chi Chao, Cupertino, CA 435/366; 530/399; 536/23.5; (US) 435/287.2 Correspondence Address: MARSHALL, GERSTEIN & BORUN LLP (57) ABSTRACT 6300 SEARS TOWER 233 S. WACKER DRIVE The invention provides novel polynucleotides and polypep CHICAGO, IL 60606 (US) tides encoded by Such polynucleotides and mutants or variants thereof that correspond to a novel human Secreted (21) Appl. No.: 10/168,365 Stem cell growth factor-like polypeptide. These polynucle otides comprise nucleic acid Sequences isolated from cDNA (22) PCT Filed: Dec. 23, 2000 libraries prepared from human fetal liver Spleen, ovary, adult (86) PCT No.: PCT/US00/35260 brain, lung tumor, Spinal cord, cervix, ovary, endothelial cells, umbilical cord, lymphocyte, lung fibroblast, fetal (30) Foreign Application Priority Data brain, and testis.
    [Show full text]
  • UNIVERSITY of CALIFORNIA, SAN DIEGO Use Solid K-Mers In
    UNIVERSITY OF CALIFORNIA, SAN DIEGO Use Solid K-mers In MinHash-Based Genome Distance Estimation A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science by An Zheng Committee in charge: Professor Pavel Pevzner, Chair Professor Vikas Bansal Professor Melissa Gymrek 2017 Copyright An Zheng, 2017 All rights reserved. The thesis of An Zheng is approved, and it is acceptable in quality and form for publication on microfilm and electroni- cally: Chair University of California, San Diego 2017 iii TABLE OF CONTENTS Signature Page . iii Table of Contents . iv List of Figures . v List of Tables . vi Acknowledgements . vii Abstract of the Thesis . viii Chapter 1 Introduction and background . 1 1.1 Genome distance estimation . 1 1.2 Current methods . 2 1.3 MinHash . 3 1.4 Solid k-mer powered MinHash . 5 Chapter 2 Method . 7 2.1 General scheme . 7 2.2 Identification of overlapping read pairs . 8 2.2.1 Workflow . 8 2.2.2 Data . 9 2.2.3 Implementation . 9 2.3 Genome identification . 10 2.3.1 Workflow . 10 2.3.2 Data . 10 2.3.3 Implementation . 10 Chapter 3 Result . 15 3.1 Identification of overlapping read pairs . 15 3.1.1 Performance comparison between solid k-mer pow- ered MinHash and regular MinHash . 15 3.1.2 Selecting the solid k-mer threshold . 17 3.2 Genome identification . 19 Chapter 4 Discussion and future work . 21 Bibliography . 23 iv LIST OF FIGURES Figure 1.1: An example of how to use MinHash to compute the resemblance of two genome sequences.
    [Show full text]
  • Developing Bioinformatics Computer Skills.Pdf
    Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Cynthia Gibas Per Jambeck Publisher: O'Reilly First Edition April 2001 ISBN: 1-56592-664-1, 446 pages Buy Print Version Developing Bioinformatics Computer Skills will help biologists, researchers, and students develop a structured approach to biological data and the computer skills they'll need to analyze it. The book covers Copyright the Unix file system, building tools and databases for bioinformatics, Table of Contents computational approaches to biological problems, an introduction to Index Perl for bioinformatics, data mining, data visualization, and tips for Full Description tailoring data analysis software to individual research needs. About the Author Reviews Reader reviews Errata Delivered for Maurice ling Last updated on 10/30/2001 Swap Option Available: 7/15/2002 Developing Bioinformatics Computer Skills, © 2002 O'Reilly © 2002, O'Reilly & Associates, Inc. http://safari.oreilly.com/main.asp?bookname=bioskills [6/2/2002 8:49:35 AM] Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Copyright © 2001 O'Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information contact our corporate/institutional sales department: 800-998-9938 or [email protected].
    [Show full text]
  • The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM
    The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM Volume 19 | Issue 16 | Page 21 | Aug. 29, 2005 Previous | Issue Contents | Next FEATURE | SEVEN TECHNOLOGIES How 90,000 lines of code helped spark the bioinformatics explosion By Anne Harding You've just cloned and sequenced a gene, but you don't know what it does. Now Sponsored by: what do you do? In the absence of functional clues, it's hard to know where to start. One approach is to ask what other known sequences are similar to yours, thereby inferring function from homology. Each weekday, some 200,000 or so researchers do just that, asking a server at the National Center for Biotechnology Information (NCBI) in Bethesda, Md., to compare their particular sequence against GenBank, a DNA database that, at the end of 2004, held more than 40 million sequences totaling 44.5 billion nucleotides. The NCBI devotes 158 two-processor computers to those queries, 75% of which return within 22 seconds. The software these servers use, a sturdy 15-year-old program known as the Basic Local Alignment Search Tool, or BLAST, remains, for many, bioinformatics' "killer app." It wasn't the first DNA database search tool, but it was fast, and it provided metrics to assess the significance of the matches it found--all in 90,000 lines of C code. "The fact that every biologist has been using BLAST tells everything," says Jin Billy Li of the Washington University Genome Sequencing Center in St. Louis, who has used BLASTP (a protein homology tool) to identify flagellar genes in several species, including the human gene that causes Bardet-Biedl syndrome, a ciliation disorder.
    [Show full text]
  • Msc THESIS Genetic Sequence Alignment on a Supercomputing Platform
    Computer Engineering 2011 Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ MSc THESIS Genetic sequence alignment on a supercomputing platform Erik Vermij Abstract Genetic sequence alignment is an important tool for researchers. It lets them see the differences and similarities between two genetic sequences. This is used in several fields, like homology research, auto immune disease research and protein shape estimation. There are various algorithms that can perform this task and several hard- ware platforms suitable to deliver the necessary computation power. CE-MS-2011-02 Given the large volume of the datasets used, throughput is nowadays the major bottleneck in sequence alignment. In this thesis we discuss some of the existing solutions for high throughput genetic sequence alignment and present a new one. Our solution implements the well known Smith-Waterman optimal local alignment algorithm on the HC-1 hybrid supercomputer from Convey Computer. This platform features four FPGAs which can be used to accelerate the problem in question. The FPGAs, and the CPU that controls them, live in the same virtual memory space and share one large memory. We developed a hardware description for the FPGAs and a software program for the CPU. Some focus points were: a sustainable peak performance, being able to align sequences of any length, FPGA area efficient computations and the cancellation of unnecessary workload. The result is a Smith-Waterman FPGA core that can run at 100% utilization for many alignments long. They are packed per six on a FPGA running on 150 MHz, which results in a full system performance of 460 GCUPS (billion elementary operations per second).
    [Show full text]
  • Computation Resources for Molecular Biology: a Special Issue
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE Editorial provided by Elsevier - Publisher Connector Computation Resources for Molecular Biology: A Special Issue Increasingly, computational approaches are hav- ubiquitin-like folds in the protein databank (UbSRD). ing a central role across many areas of research The resource quantifies the structures of ubiquitins tackling the challenges of understanding the com- and SUMOs (small ubiquitin-like modifier proteins) plexity of biological systems. A resource such as and their different modes of protein–protein interac- BLAST (Basic Local Alignment Search Tool) [1], tions. The database allowed the authors to identify which was published in this journal in 1990, has that the ubiquitin tail is flexible and adopts a range of transformed sequence searching because of its conformations on binding. Users can browse the speed and power in detecting distant but biologically database by phylogeny, by structural properties and significant relationships. Since then, high-throughput by residue interactions. molecular biology technologies have led to a rapid The third database [4] in this Special Issue, expansion in available sequence, structural and authored by Keerthikumar et al., is ExoCarta that is -omics data for many systems being studied. a manually curated compendium of exosomal Computational biologists, statisticians and mathe- proteins, RNAs and lipids. The current version maticians have been motivated by this exponential details more than 41,000 protein, 7000 RNA and growth of data to develop enhanced novel compu- 1000 lipid molecules. Users can browse the data- tational tools. Challenges include the storage and base by organism, content type or gene.
    [Show full text]
  • Coding Sequences: a History of Sequence Comparison Algorithms As a Scientiªc Instrument
    Coding Sequences: A History of Sequence Comparison Algorithms as a Scientiªc Instrument Hallam Stevens Harvard University Sequence comparison algorithms are sophisticated pieces of software that com- pare and match identical or similar regions of DNA, RNA, or protein se- quence. This paper examines the origins and development of these algorithms from the 1960s to the 1990s. By treating this software as a kind of scien- tiªc instrument used to examine sets of biological objects, the paper shows how algorithms have been used as different sorts of tools and appropriated for dif- ferent sorts of uses according to the disciplinary context in which they were deployed. These particular uses have made sequences themselves into different kinds of objects. Introduction Historians of molecular biology have paid signiªcant attention to the role of scientiªc instruments and their relationship to the production of bio- logical knowledge. For instance, Lily Kay has examined the history of electrophoresis, Boelie Elzen has analyzed the development of the ultra- centrifuge as an enabling technology for molecular biology, and Nicolas Rasmussen has examined how molecular biology was transformed by the introduction of the electron microscope (Kay 1998, 1993; Elzen 1986; Rasmussen 1997).1 Collectively, these historians have demonstrated how instruments and other elements of the material culture of the labora- tory have played a decisive role in determining the kind and quantity of knowledge that is produced by biologists. During the 1960s, a versatile new kind of instrument began to be deployed in biology: the electronic computer (Ceruzzi 2001; Lenoir 1999). Despite the signiªcant role that 1. One could also point to Robert Kohler’s (1994) work on the fruit ºy, Jean-Paul Gaudillière (2001) on laboratory mice, and Hannah Landecker (2007) on the technologies of tissue culture.
    [Show full text]
  • STUDY of the RELATIONSHIP BETWEEN Mus Musculus PROTEIN SEQUENCES and THEIR BIOLOGICAL FUNCTIONS a Thesis Presented to the Gradua
    STUDY OF THE RELATIONSHIP BETWEEN Mus musculus PROTEIN SEQUENCES AND THEIR BIOLOGICAL FUNCTIONS A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Pawan Seth May, 2007 STUDY OF THE RELATIONSHIP BETWEEN Mus musculus PROTEIN SEQUENCES AND THEIR BIOLOGICAL FUNCTIONS Pawan Seth Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Dean of the College Dr. Zhong-Hui Duan Dr. Ronald F. Levant _______________________________ _______________________________ Committee Member Dean of the Graduate School Dr. Chien-Chung Chan Dr. George R. Newkome _______________________________ _______________________________ Committee Member Date Dr. Xuan-Hien Dang _______________________________ Committee Member Dr. Yingcai Xiao _______________________________ Department Chair Dr. Wolfgang Pelz ii ABSTRACT The central challenge in post-genomic era is the characterization of biological functions of newly discovered proteins. Sequence similarity based approaches infer protein functions based upon the homology between proteins. In this thesis, we present the similarity relationship between protein sequences and functions for mouse proteome in the context of gene ontology slim. The similarity between protein sequences is computed using a novel measure based upon the local BLAST alignment scores. The similarity between protein functions is characterized using the three gene ontology categories. In the study, the ontology categories are represented using a general tree structure. Three ontology trees are constructed using the definitions provided in gene ontology slim. The mouse protein sequences are then mapped onto the trees. We present the sequence similarity distributions at different levels of GO tree. The similarities of protein sequences across gene ontology levels and traversing branches are studied.
    [Show full text]
  • K-Mulus: a Database-Clustering Approach to Protein BLAST in the Cloud
    K-mulus: A Database-Clustering Approach to Protein BLAST in the Cloud Carl H. Albachy Sebastian G. Angely Christopher M. Hilly Mihai Pop Department of Computer Science University of Maryland, College Park College Park, MD 20741 fcalbach, sga001, cmhill, [email protected] ABSTRACT with the intention of efficiently searching for these similar- With the increased availability of next-generation sequenc- ities. The most widely used application is the Basic Local ing technologies, researchers are gathering more data than Alignment Search Tool, or BLAST[3]. they are able to process and analyze. One of the most widely performed analyses is identifying regions of similar- With the increased availability of next-generation sequenc- ity between DNA or protein sequences using the Basic Local ing technologies, researchers are gathering more data than Alignment Search Tool, or BLAST. Due to the large scale ever before. This large influx of data has become a ma- of sequencing data produced, parallel implementations of jor issue as researchers have a difficult time processing and BLAST are needed to process the data in a timely manner. analyzing it. For this reason, optimizing the performance In this paper we present K-mulus, an application that per- of BLAST and developing new alignment tools has been a forms distributed BLAST queries via Hadoop and MapRe- well researched topic over the past few years. Take the ex- duce and aims to generate speedups by clustering the database. ample of environmental sequencing projects, in which the Clustering the sequence database reduces the search space bio-diversity of various environments, including the human for a given query, and allows K-mulus to easily parallelize microbiome, is analyzed and characterized to generate on the remaining work.
    [Show full text]
  • In Its Most Basic Form a Sequence Alignment Is Simply Comparing Two Or More Sequences by Searching for Character Patterns and Other Similarities
    In its most basic form a sequence alignment is simply comparing two or more sequences by searching for character patterns and other similarities. Blasting some sequence is often the first step a researcher will take to characterize an unknown sequence or even a whole genome, but in 1970, when Needleman and Wunsch first introduced their algorithm for automated global alignment of sequences, there were still very few sequences to work with. Throughout the late seventies, determining nucleotide sequences was a torturous tedious process, but in 1977 there were two major technological breakthroughs, one pioneered by Maxam and Gilbert and the other by Sanger that opened the door for automated sequencing. These two developments represent the bedrock on which high through put sequencing is built. Alignment techniques can be broken into two components. The first is construction of a scoring matrix, and the second is the actual algorithms used to compute a score. The simplest scoring matrix is a unitary matrix. If the character matches a one is assigned, and if not the matrix element is zero. In this most fundamental case, the alignment would simply be the path through this matrix. The two most commonly used scoring matrices today are PAM and BLOSUM. PAM or percent accepted mutation rate different substitution rates of amino acids were calculated based on alignments of protein sequences that were at least eight-five percent identical (Heinkoff, 1992). Rates for PAMs correlating to different evolutionary distances were then extrapolated from these original calculations. BLOSUM takes a slightly different approach. Instead of relying on extrapolated data, the various BLOSUMs are calculated directly.
    [Show full text]
  • Thesis by Submitted in Partial Fulfillment of the Requirements For
    Sequential Optimization of Global Sequence Alignments Relative to Different Cost Functions Thesis by Enas Mohammad Odat, Master Degree in Computer Science Submitted in Partial Fulfillment of the Requirements for the degree of Masters of Computer Science King Abdullah University of Science and Technology Mathmatical and Computer Sciences and Engineering Division Computer System Thuwal, Makkah Province, Kingdom of Saudi Arabia May, 2011 2 The dissertation/thesis of Enas Mohammad Odat is approved by the examination committee Committee Chairperson: Committee Co-Chair: Committee Member: King Abdullah University of Science and Technology 2011 ABSTRACT Sequential Optimization of Global Sequence Alignments Relative to Different Cost Functions Enas Mohammad Odat The purpose of this dissertation is to present a methodology to model global sequence alignment problem as directed acyclic graph which helps to extract all pos- sible optimal alignments. Moreover, a mechanism to sequentially optimize sequence alignment problem relative to different cost functions is suggested. Sequence alignment is mostly important in computational biology. It is used to find evolutionary relationships between biological sequences. There are many algo- rithms that have been developed to solve this problem. The most famous algorithms are Needleman-Wunsch and Smith-Waterman that are based on dynamic program- ming. In dynamic programming, problem is divided into a set of overlapping sub- problems and then the solution of each subproblem is found. Finally, the solutions to these subproblems are combined into a final solution. In this thesis it has been proved that for two sequences of length m and n over a fixed alphabet, the suggested optimization procedure requires O(mn) arithmetic operations per cost function on a single processor machine.
    [Show full text]
  • Scalable Parallel Algorithms for Genome Analysis by Evangelos Georganas a Dissertation Submitted in Partial Satisfaction Of
    Scalable Parallel Algorithms for Genome Analysis by Evangelos Georganas A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Katherine A. Yelick, Chair Professor James W. Demmel Professor Daniel S. Rokhsar Summer 2016 Scalable Parallel Algorithms for Genome Analysis Copyright 2016 by Evangelos Georganas 1 Abstract Scalable Parallel Algorithms for Genome Analysis by Evangelos Georganas Doctor of Philosophy in Computer Science University of California, Berkeley Professor Katherine A. Yelick, Chair A critical problem for computational genomics is the problem of de novo genome assembly: the development of robust scalable methods for transforming short randomly sampled \shot- gun" sequences, namely reads, into the contiguous and accurate reconstruction of complex genomes. These reads are significantly shorter (typically hundreds of bases long) than the size of chromosomes and also include errors. While advanced methods exist for assembling the small and haploid genomes of prokaryotes (e.g. cells without nuclei), the genomes of eukaryotes (e.g. cells with nuclei) are more complex. Moreover, de novo assembly has been unable to keep pace with the flood of data, due to the dramatic increases in genome sequencer capabilities, combined with the computational requirements and the algorithmic complexity of assembling large scale genomes and metagenomes. In this dissertation, we address this challenge head on by developing parallel algorithms for de novo genome assembly with the ambition to scale to massive concurrencies. Our work is based on the Meraculous assembler, a state-of-the-art de novo assembler for short reads developed at the Joint Genome Institute.
    [Show full text]