A Ll a Bout BLASTBLAST in P Arallelmpi BLASTP Erformancempi BLAST

Total Page:16

File Type:pdf, Size:1020Kb

A Ll a Bout BLASTBLAST in P Arallelmpi BLASTP Erformancempi BLAST Parallelization of BLAST for RADIANT LOGO mpiBLAST: Computational Clusters HERE Aaron Darling Wu-chun Feng Dept. of Computer Science Advanced Computing Laboratory University of Wisconsin-Madison Los Alamos National Laboratory [email protected] [email protected] All About BLAST mpiBLAST Performance What is BLAST? Predicted ORF sizes for Erwinia Chrysanthemi 3937 Length of sequence entries in the nt database Modelling BLAST Usage 1500 - BLAST is a family of sequence database search algorithms [1,2] Search Name Query Type Database Type Translation 60000 blastn Nucleotide Nucleotide None - The length, number and content of the query and database 1000 - Foundation for much research in molecular genetics tblastn Peptide Nucleotide Database 40000 Frequency sequences significantly affect BLAST runtime [11] Frequency blastx Nucleotide Peptide Query 500 - Must choose queries and the database to accurately reflect 20000 - Searches for similarities between a short query sequence and a blastp Peptide Peptide None a typical BLAST usage pattern 0 large set of database sequences. 0 tblastx Nucleotide Nucleotide Query and Database 0 5000 10000 15000 - We model a common problem in genomics: using BLAST 0 5000 10000 15000 20000 ORF length in base pairs Sequence length in base pairs Standard BLAST database search types. DNA and ami- on all predicted ORFs in a newly sequenced genome [16] - Query and database sequences are either DNA or amino acid (pro- Table 1. no acid sequences are translated by BLAST, making searches with Sequence database: nt Figure 5. Histograms of the query sequence lengths and database se- tein) sequence, BLAST will translate between DNA and amino acid several combinations of database and query type possible. Queries: Predicted ORFs from the E. Chrysanthemi genome quence lengths that were used for benchmarking NCBI BLAST and mpi- sequenceson-the-fly. BLAST. Query: Performance of blastn In Low Memory >Perilla frutescens mRNA for leucoanthocyanidin dioxygenase CATCTACTCAAATTAAGAAATAGATAGAAATGGTTACGAGTGCAATGGGTCCAAGCCCGCGGGTGGAGGAACTGGCCCGA AGCGGACTCGACACGATCCCAAAAGTATACGTGCGGCCCGAAGAGCACCTGAAAAGTATCATAGGCAACATTTTGGCGGA Search time ATAAAATTAAATAATTTTTCAATTGTACTAAAAAAAAAAAAAAAAAAAAAAAAA The BLAST Algorithm Blocks read/s Performance in Low Memory Database: >gi|3123744|dbj|AB013447.1|AB013447 Brassica napus mRNA for aluminum-induced GCCTCAAACAGTTTAATTTTCTTCAAAACTAGTTTTTTTTTGGTTTTAGTTGGTATCCACGGAAGAGAGAGAAAATGTTG Goal: GGAATTTTCAGCGGACGTATAGTATCATTGCCGGAAGAGCTGGTGGCTGCCGGGAACCGAACTCCGTCGCCGAAGACAAC 1. The query sequence is indexed with a string matching data structure CGGATCGGTCCTAGTCAACAAGTTCGTAGAGAAAAATCCCTCTGCCGTGTCCGTACAGGTCGGCGACTACGTGCAGCTTG - Determine BLAST behavior when the database is too large to CTTATAGCCACCACAAAGAGAGTCCTCTCCGGCCAAGGTCATTTGGGGCTAAGGATGAGATATTCTGCTTGTTCCAAGGC >gi|221778|dbj|D00026.1|HS2HSV2P4 Herpes simplex virus type 2 gene for glycoprotein D 2. Database sequences are ‘streamed’ past the query index TTTTACTAGAGGAGTATCCCCGCTCCCGTGTACCTCTGGGCCCGTGTGGGAGGGTGGCTGGGGTATTTGGGTGGGTATTG fit in the system buffer-cache GCGGCCCGAAGGGCCCGCCGCGCATTTAAGGAGTCGCCGCCCCGACTCTGTGTCTTCGGGTGACTTGGTGCGCCGCCGTC 3. Exact subsequence matches (hits) of some minimum size are identified AGCTAGTCTCCGATCTGCCCCGACCGACGGCTCCTGCCACCCGAACATG Method: >gi|7328961|dbj|AB032155.1|AB032154S2 Homo sapiens PGFS gene for prostaglandin F synthase [AKR 1C3], exon 6, 7 4. Hits are extended to the surrounding matching region by scoring TTTTTTTCTTGATGCTGAAATCTATCCAAACATCACCAGTGACATTTCCTTGAAAGTAGTGCTTTTGTCTTTCAGACTTG - NCBI BLAST and mpiBLAST were benchmarked using CCCTCACGAGTCCTTGACCAAATTCTTGCTTTCTGGCACAATCTGAAGCCCAAAGGCTCTAAGAATCTTACTGTATCCCA each successive extension with a statistically motivated function. GATATGGAACTTGTTACATCTCCTTCTAGTTGTCAAAGGTCTTTGGAAGGCAGGATCTGTTTCATAGATGAGCTTCTGTT increasingly large database sizes TAGAAATGGCATTTCCACTTATACTTTTAGAAGATATATAAAATTTATTTCTATGAAAAAGGTTATTACTTGACAATAAT >gi|7328962|dbj|AB032156.1|AB032154S3 Homo sapiens PGFS gene for prostaglandin F synthase [AKR 1C3], exon 8 5. The extended hits, called High Scoring Pairs (HSPs), are sorted based on CAATATCTATCGAACTTAATATATAATCTTCTGTTATTTAAAAATTTGCCTCTAGATTCTTGTGGGCCTTCTAGGTACAT - Attempt to cache the DB by doing an untimed search before ATATCACATTGTCCCAAACCTGCTCAGCTCCTTATCAAATCAAAAACATTTCCATCAACTTTGTGGTCCAGGTGCCAATT score and are reported to the user in one of several formats CCACCTCCTTCATATGGAATTGCTTGCTAGATCCTG each timed search Hit Extension: Results: GTGGGTATTGGCGGCCCGAAGGGCCCGCCGCG 0 100 200 300 400 500 600 ++---+++++++++++-++-+ Database Size (Millions of Nucleotides) - When the DB is too large for buffer-cache, disk I/O increases AAGACTACGTGCGGCCCGAAGAGCACCTGAAA sharply as does BLAST search time Figure 1. An example of the BLAST algorithm Figure 6. blastn searches as the database grows larger than - mpiBLAST on a single node behaves similarly to NCBI BLAST the system buffer-cache. Disk I/O sharply increases when the da- - When run with 2 or 4 workers, the DB can be respectively tabase becomes too large, degrading blastn performance. Growth of Memory vs. Genbank 2 or 4 times larger before heavy disk I/O occurs 6000 Why Parallelize BLAST? Compressed Genbank size 5000 Memory size mpiBLAST blastn Performance Sequence DB is larger than the main memory of a standard workstation. 600 mpiBLAST tblastx Performance 4000 blastn, 1 node 2000 - DB searches must utilize slow disk I/O 500 blastn, 2 nodes 1800 tblastx, 1 node 3000 blastn, 4 nodes tblastx, 2 nodes - Sequence DBs are growing faster than they can be processed [14] 400 1600 2000 1400 tblastx, 4 nodes - Solution 1: Buy a computer with more memory Size (MB) 300 1200 - Solution 2: Use inexpensive commodity components in parallel 1000 1000 200 800 0 600 1986 1988 1990 1992 1994 1996 1998 2000 2002 100 400 BLAST searches are very CPU intensive Execution Time (s) Year 0 200 - But sequence DB entries are essentially independent and can be 0 100 200 300 400 500 600 700 800 900 1000 Execution Time (s) 0 0 100 200 300 400 500 600 700 800 900 1000 searched in parallel. Growth of computer memory size vs. Genbank size. Sequence DB size (MB) Figure 2. Sequence DB size (MB) - Communication costs during a parallel search are low Genbank is a public repository for DNA sequences that is maintained by NCBI. The compressed size of Genbank is an approximation - Parallel BLAST can utilize efficient, low cost clusters like Green Destiny Figure 7. Performance of blastn searches using mpiBLAST. Performance of tblastx searches using mpiBLAST. based on 2-bit per base encoding of the data at [3] Figure 8. blastn performance suffers severely when the database becomes tblastx is not affected by heavy disk I/O because it is CPU-bound. too large for the system buffer-cache. In this case, mpiBLAST yields mpiBLAST achieves linear speedup for tblastx searches. a super-linear speedup versus a single node. BLAST in Parallel mpiBLAST, etc. Parallel BLAST Approaches Figure 3. Our ‘hokey’ depiction of database seg- mpiBLAST blastn speedup mentation. Each worker node receives a distinct mpiBLAST Speedup 180 portion of the entire sequence database. The pro- 160 Linear Speedup Multithreading 140 mpiBLAST, Green Destiny cess does not actually involve a spaghetti press. Goal: 120 - Each thread searches a distinct portion of the database 100 - Determine the speedup and scalability of mpiBLAST on a large 80 - Implemented in NCBI BLAST 60 cluster 40 Method: 20 Query Segmentation 0 - Benchmark mpiBLAST on Green Destiny[9,10] Relative Execution Time 0 20 40 60 80 100 120 140 - Each query is farmed out to a distinct node in a cluster Worker Nodes Number of processors - Each node searches the entire database - Each node has a 667 MHz Transmeta TM5600, 640MB RAM, - Implemented in HT-BLAST[4], cBLAST[15], U. Iowa BLAST[5], and a 20GB hard disk. TurboBLAST[6,7] - Nodes are interconnected with switched 100Base-TX ethernet Figure 9 (above). The speedup of mpiBLAST on - Operating system is Linux 2.4 Green Destiny. Speedup is based on the search time Results: for 300 kb of query sequences against 5.1 GB of the nt Database Segmentation database. - Each node in a cluster searches a distinct portion of the database - mpiBLAST exhibits super-linear speedup when run on multiple (right) Green Destiny, a 240 node bladed - The query is sent to all nodes in the cluster cluster nodes. Figure 10 . beowulf based on the Transmeta Crusoe processor. - Implemented in TurboBLAST[6,7], Blackstone PowerBLAST, and - Efficiency decreases as the number of workers increases Platform LSF Future Directions mpiBLAST can be mpiBLAST is a freely available open-source implementation - Perform a detailed analysis of ‘Where the time goes’ when downloaded from doing parallel BLAST searches of database segmentation for BLAST searches - Implement query segmentation for large clusters with high http://mpiblast.lanl.gov startup costs - Use the NCBI toolkit to directly output the merged BLAST results in several formats including XML, ASN.1 and tab delimited text mpiBLAST Approach - Implement seamless fault tolerance so that a search need not be restarted when a node goes down. - Database segmentation References - Load balancing 1. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, Nucleic mpiBLAST Algorithm - Basic fault tolerance Acids Research, 25, pp. 3389-3402, 1997. - Integration with scheduler systems (PBS) 2. Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David
Recommended publications
  • Gene Prediction: the End of the Beginning Comment Colin Semple
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by PubMed Central http://genomebiology.com/2000/1/2/reports/4012.1 Meeting report Gene prediction: the end of the beginning comment Colin Semple Address: Department of Medical Sciences, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK. E-mail: [email protected] Published: 28 July 2000 reviews Genome Biology 2000, 1(2):reports4012.1–4012.3 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2000/1/2/reports/4012 © GenomeBiology.com (Print ISSN 1465-6906; Online ISSN 1465-6914) Reducing genomes to genes reports A report from the conference entitled Genome Based Gene All ab initio gene prediction programs have to balance sensi- Structure Determination, Hinxton, UK, 1-2 June, 2000, tivity against accuracy. It is often only possible to detect all organised by the European Bioinformatics Institute (EBI). the real exons present in a sequence at the expense of detect- ing many false ones. Alternatively, one may accept only pre- dictions scoring above a more stringent threshold but lose The draft sequence of the human genome will become avail- those real exons that have lower scores. The trick is to try and able later this year. For some time now it has been accepted increase accuracy without any large loss of sensitivity; this deposited research that this will mark a beginning rather than an end. A vast can be done by comparing the prediction with additional, amount of work will remain to be done, from detailing independent evidence.
    [Show full text]
  • (12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat Et Al
    US 2003O21, 1987A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat et al. (43) Pub. Date: Nov. 13, 2003 (54) METHODS AND MATERIALS RELATING TO Apr. 7, 2000 (US)........................................... O9545,714 STEM CELL GROWTH FACTOR-LIKE Apr. 11, 2000 (US)........................................... O9547358 POLYPEPTIDES AND POLYNUCLEOTDES Publication Classification (76) Inventors: Ivan Labat, Mountain View, CA (US); Y Tom Tang, San Jose, CA (US); Radoje T. Drmanac, Palo Alto, CA (51) Int. Cl." ....................... A61K 38/18; CO7K 14/475; (US); Chenghua Liu, San Jose, CA C12O 1/68; CO7H 21/04; (US); Juhi Lee, Fremont, CA (US); C12M 1/34; C12P 21/02; Nancy K Mize, Mountain View, CA C12N 5/08 (US); John Childs, Sunnyvale, CA (52) U.S. Cl. ......... 514/12; 435/69.1; 435/6; 435/320.1; (US); Cheng-Chi Chao, Cupertino, CA 435/366; 530/399; 536/23.5; (US) 435/287.2 Correspondence Address: MARSHALL, GERSTEIN & BORUN LLP (57) ABSTRACT 6300 SEARS TOWER 233 S. WACKER DRIVE The invention provides novel polynucleotides and polypep CHICAGO, IL 60606 (US) tides encoded by Such polynucleotides and mutants or variants thereof that correspond to a novel human Secreted (21) Appl. No.: 10/168,365 Stem cell growth factor-like polypeptide. These polynucle otides comprise nucleic acid Sequences isolated from cDNA (22) PCT Filed: Dec. 23, 2000 libraries prepared from human fetal liver Spleen, ovary, adult (86) PCT No.: PCT/US00/35260 brain, lung tumor, Spinal cord, cervix, ovary, endothelial cells, umbilical cord, lymphocyte, lung fibroblast, fetal (30) Foreign Application Priority Data brain, and testis.
    [Show full text]
  • UNIVERSITY of CALIFORNIA, SAN DIEGO Use Solid K-Mers In
    UNIVERSITY OF CALIFORNIA, SAN DIEGO Use Solid K-mers In MinHash-Based Genome Distance Estimation A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science by An Zheng Committee in charge: Professor Pavel Pevzner, Chair Professor Vikas Bansal Professor Melissa Gymrek 2017 Copyright An Zheng, 2017 All rights reserved. The thesis of An Zheng is approved, and it is acceptable in quality and form for publication on microfilm and electroni- cally: Chair University of California, San Diego 2017 iii TABLE OF CONTENTS Signature Page . iii Table of Contents . iv List of Figures . v List of Tables . vi Acknowledgements . vii Abstract of the Thesis . viii Chapter 1 Introduction and background . 1 1.1 Genome distance estimation . 1 1.2 Current methods . 2 1.3 MinHash . 3 1.4 Solid k-mer powered MinHash . 5 Chapter 2 Method . 7 2.1 General scheme . 7 2.2 Identification of overlapping read pairs . 8 2.2.1 Workflow . 8 2.2.2 Data . 9 2.2.3 Implementation . 9 2.3 Genome identification . 10 2.3.1 Workflow . 10 2.3.2 Data . 10 2.3.3 Implementation . 10 Chapter 3 Result . 15 3.1 Identification of overlapping read pairs . 15 3.1.1 Performance comparison between solid k-mer pow- ered MinHash and regular MinHash . 15 3.1.2 Selecting the solid k-mer threshold . 17 3.2 Genome identification . 19 Chapter 4 Discussion and future work . 21 Bibliography . 23 iv LIST OF FIGURES Figure 1.1: An example of how to use MinHash to compute the resemblance of two genome sequences.
    [Show full text]
  • Reconstructing Contiguous Regions of an Ancestral Genome
    Downloaded from www.genome.org on December 5, 2006 Reconstructing contiguous regions of an ancestral genome Jian Ma, Louxin Zhang, Bernard B. Suh, Brian J. Raney, Richard C. Burhans, W. James Kent, Mathieu Blanchette, David Haussler and Webb Miller Genome Res. 2006 16: 1557-1565; originally published online Sep 18, 2006; Access the most recent version at doi:10.1101/gr.5383506 Supplementary "Supplemental Research Data" data http://www.genome.org/cgi/content/full/gr.5383506/DC1 References This article cites 20 articles, 11 of which can be accessed free at: http://www.genome.org/cgi/content/full/16/12/1557#References Open Access Freely available online through the Genome Research Open Access option. Email alerting Receive free email alerts when new articles cite this article - sign up in the box at the service top right corner of the article or click here Notes To subscribe to Genome Research go to: http://www.genome.org/subscriptions/ © 2006 Cold Spring Harbor Laboratory Press Downloaded from www.genome.org on December 5, 2006 Methods Reconstructing contiguous regions of an ancestral genome Jian Ma,1,5,6 Louxin Zhang,2 Bernard B. Suh,3 Brian J. Raney,3 Richard C. Burhans,1 W. James Kent,3 Mathieu Blanchette,4 David Haussler,3 and Webb Miller1 1Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, Pennsylvania 16802, USA; 2Department of Mathematics, National University of Singapore, Singapore 117543; 3Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California 95064, USA; 4School of Computer Science, McGill University, Montreal, Quebec H3A 2B4, Canada This article analyzes mammalian genome rearrangements at higher resolution than has been published to date.
    [Show full text]
  • Outlier Detection in BLAST Hits∗
    Outlier Detection in BLAST Hits∗ Nidhi Shah1, Stephen F. Altschul2, and Mihai Pop3 1 Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA [email protected] 2 Computational Biology Branch, NCBI, NLM, NIH, Bethesda, MD, USA [email protected] 3 Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA [email protected] Abstract An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. The similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. On the other hand, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive. We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step.
    [Show full text]
  • Duplication, Deletion, and Rearrangement in the Mouse and Human Genomes
    Evolution’s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes W. James Kent*†, Robert Baertsch*, Angie Hinrichs*, Webb Miller‡, and David Haussler§ *Center for Biomolecular Science and Engineering and §Howard Hughes Medical Institute, Department of Computer Science, University of California, Santa Cruz, CA 95064; and ‡Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA 16802 Edited by Michael S. Waterman, University of Southern California, Los Angeles, CA, and approved July 11, 2003 (received for review April 9, 2003) This study examines genomic duplications, deletions, and rear- depending on details of definition and method. The length rangements that have happened at scales ranging from a single distribution of synteny blocks was found to be consistent with the base to complete chromosomes by comparing the mouse and theory of random breakage introduced by Nadeau and Taylor (8, human genomes. From whole-genome sequence alignments, 344 9) before significant gene order data became available. In recent large (>100-kb) blocks of conserved synteny are evident, but these comparisons of the human and mouse genomes, rearrangements are further fragmented by smaller-scale evolutionary events. Ex- of Ն100,000 bases were studied by comparing 558,000 highly cluding transposon insertions, on average in each megabase of conserved short sequence alignments (average length 340 bp) genomic alignment we observe two inversions, 17 duplications within 300-kb windows. An estimated 217 blocks of conserved (five tandem or nearly tandem), seven transpositions, and 200 synteny were found, formed from 342 conserved segments, with deletions of 100 bases or more. This includes 160 inversions and 75 length distribution roughly consistent with the random breakage duplications or transpositions of length >100 kb.
    [Show full text]
  • BIOINFORMATICS ISCB NEWS Doi:10.1093/Bioinformatics/Btp280
    Vol. 25 no. 12 2009, pages 1570–1573 BIOINFORMATICS ISCB NEWS doi:10.1093/bioinformatics/btp280 ISMB/ECCB 2009 Stockholm Marie-France Sagot1, B.J. Morrison McKay2,∗ and Gene Myers3 1INRIA Grenoble Rhône-Alpes and University of Lyon 1, Lyon, France, 2International Society for Computational Biology, University of California San Diego, La Jolla, CA and 3Howard Hughes Medical Institute Janelia Farm Research Campus, Ashburn, Virginia, USA ABSTRACT Computational Biology (http://www.iscb.org) was formed to take The International Society for Computational Biology (ISCB; over the organization, maintain the institutional memory of ISMB http://www.iscb.org) presents the Seventeenth Annual International and expand the informational resources available to members of the Conference on Intelligent Systems for Molecular Biology bioinformatics community. The launch of ECCB (http://bioinf.mpi- (ISMB), organized jointly with the Eighth Annual European inf.mpg.de/conferences/eccb/eccb.htm) 8 years ago provided for a Conference on Computational Biology (ECCB; http://bioinf.mpi- focus on European research activities in years when ISMB is held inf.mpg.de/conferences/eccb/eccb.htm), in Stockholm, Sweden, outside of Europe, and a partnership of conference organizing efforts 27 June to 2 July 2009. The organizers are putting the finishing for the presentation of a single international event when the ISMB touches on the year’s premier computational biology conference, meeting takes place in Europe every other year. with an expected attendance of 1400 computer scientists, The multidisciplinary field of bioinformatics/computational mathematicians, statisticians, biologists and scientists from biology has matured since gaining widespread recognition in the other disciplines related to and reliant on this multi-disciplinary early days of genomics research.
    [Show full text]
  • Developing Bioinformatics Computer Skills.Pdf
    Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Cynthia Gibas Per Jambeck Publisher: O'Reilly First Edition April 2001 ISBN: 1-56592-664-1, 446 pages Buy Print Version Developing Bioinformatics Computer Skills will help biologists, researchers, and students develop a structured approach to biological data and the computer skills they'll need to analyze it. The book covers Copyright the Unix file system, building tools and databases for bioinformatics, Table of Contents computational approaches to biological problems, an introduction to Index Perl for bioinformatics, data mining, data visualization, and tips for Full Description tailoring data analysis software to individual research needs. About the Author Reviews Reader reviews Errata Delivered for Maurice ling Last updated on 10/30/2001 Swap Option Available: 7/15/2002 Developing Bioinformatics Computer Skills, © 2002 O'Reilly © 2002, O'Reilly & Associates, Inc. http://safari.oreilly.com/main.asp?bookname=bioskills [6/2/2002 8:49:35 AM] Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Copyright © 2001 O'Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information contact our corporate/institutional sales department: 800-998-9938 or [email protected].
    [Show full text]
  • The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM
    The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM Volume 19 | Issue 16 | Page 21 | Aug. 29, 2005 Previous | Issue Contents | Next FEATURE | SEVEN TECHNOLOGIES How 90,000 lines of code helped spark the bioinformatics explosion By Anne Harding You've just cloned and sequenced a gene, but you don't know what it does. Now Sponsored by: what do you do? In the absence of functional clues, it's hard to know where to start. One approach is to ask what other known sequences are similar to yours, thereby inferring function from homology. Each weekday, some 200,000 or so researchers do just that, asking a server at the National Center for Biotechnology Information (NCBI) in Bethesda, Md., to compare their particular sequence against GenBank, a DNA database that, at the end of 2004, held more than 40 million sequences totaling 44.5 billion nucleotides. The NCBI devotes 158 two-processor computers to those queries, 75% of which return within 22 seconds. The software these servers use, a sturdy 15-year-old program known as the Basic Local Alignment Search Tool, or BLAST, remains, for many, bioinformatics' "killer app." It wasn't the first DNA database search tool, but it was fast, and it provided metrics to assess the significance of the matches it found--all in 90,000 lines of C code. "The fact that every biologist has been using BLAST tells everything," says Jin Billy Li of the Washington University Genome Sequencing Center in St. Louis, who has used BLASTP (a protein homology tool) to identify flagellar genes in several species, including the human gene that causes Bardet-Biedl syndrome, a ciliation disorder.
    [Show full text]
  • Open Thesisformatted Final.Pdf
    The Pennsylvania State University The Graduate School The Huck Institutes of the Life Sciences COMPUTATIONAL APPROACHES TO PREDICT PHENOTYPE DIFFERENCES IN POPULATIONS FROM HIGH-THROUGHPUT SEQUENCING DATA A Dissertation in Integrative Biosciences in Bioinformatics and Genomics by Oscar Camilo Bedoya Reina 2014 Oscar Camilo Bedoya Reina Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2014 i The dissertation of Oscar Camilo Bedoya Reina was reviewed and approved* by the following: Webb Miller Professor of Biology and Computer Science and Engineering Dissertation Advisor Chair of Committee Ross Hardison T. Ming Chu Professor of Biochemistry and Molecular Biology George Perry Assistant Professor of Anthropology and Biology Kamesh Madduri Assistant Professor of Computer Science and Engineering Peter Hudson Willaman Professor of Biology Head of the Huck Institutes of the Life Sciences *Signatures are on file in the Graduate School iii ABSTRACT High-throughput sequencing technologies are changing the world. They are revolutionizing the life sciences and will be the foundation of a promising century of innovations. In recent years, the development of new sequencing technologies has dramatically decreased the cost of genome sequencing. Less than twenty years ago, sequencing the human genome cost 3 billion dollars, and took about a decade to complete. Today, high-quality 30X full-genome coverage can be obtained in just one day for US$ 5,000, while sequencing just the ~21,000 human genes to the same depth costs only about US$ 500. The latter is sufficient for detecting most of the rare variants, along with other sources of genetic variability such as indels, copy- number variations, and inversions that are characteristic of complex diseases.
    [Show full text]
  • Msc THESIS Genetic Sequence Alignment on a Supercomputing Platform
    Computer Engineering 2011 Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ MSc THESIS Genetic sequence alignment on a supercomputing platform Erik Vermij Abstract Genetic sequence alignment is an important tool for researchers. It lets them see the differences and similarities between two genetic sequences. This is used in several fields, like homology research, auto immune disease research and protein shape estimation. There are various algorithms that can perform this task and several hard- ware platforms suitable to deliver the necessary computation power. CE-MS-2011-02 Given the large volume of the datasets used, throughput is nowadays the major bottleneck in sequence alignment. In this thesis we discuss some of the existing solutions for high throughput genetic sequence alignment and present a new one. Our solution implements the well known Smith-Waterman optimal local alignment algorithm on the HC-1 hybrid supercomputer from Convey Computer. This platform features four FPGAs which can be used to accelerate the problem in question. The FPGAs, and the CPU that controls them, live in the same virtual memory space and share one large memory. We developed a hardware description for the FPGAs and a software program for the CPU. Some focus points were: a sustainable peak performance, being able to align sequences of any length, FPGA area efficient computations and the cancellation of unnecessary workload. The result is a Smith-Waterman FPGA core that can run at 100% utilization for many alignments long. They are packed per six on a FPGA running on 150 MHz, which results in a full system performance of 460 GCUPS (billion elementary operations per second).
    [Show full text]
  • ENCODE Analysis Working Group and Data Analysis Centre Rick Myers
    ENCODE Analysis Working Group and Data Analysis Centre Rick Myers Ewan Birney Motivation for mandated DAC y Genesis from the experience of the pilot project y Everyone looking at the ceiling when a key piece of annoying analysis needs to happen y A set of people who are funded to ensure that critical integrative analysis occurs (consistently and timely) y In no way exclusive y Everyone is invited in analysis y DAC should fit around things which are happening at the consortium level y Porous (no distinction expected between DAC members and other consortium members) except… y …the cleaning of the Aegean stables moment (eg, creating repeat libraries, consistently remapping everyone’s chip-seq data) y Interplay with DCC deliberate (trade off where things occur) y When there are too many things on the DAC to-do list - ask AWG to prioritise. AWG Participates in Rick Myers discussion Chair of AWG Birney BickelBickel Project Manager Haussler EBI (Ian Dunham) Bickel Directed Analysis Methods development EBI UCSC Yale BU EBI UCSC Yale BU U. Wash Penn Berkeley U. Wash Penn Berkeley DAC - federated, embedded y Ewan Birney/Paul Flicek/Ian Dunham (EBI)- comparative genomics, short read technology methods y Mark Gerstein (Yale) - chip-seq, link to genes/transcripts, link to modENCODE, P y Zhiping Weng (BU) - chip-chip, chip-seq, motif finding, bayesian analysis y Ross Hardison/Webb Miller (PSU) - comparative genomics, regulatory regions y Jim Kent/David Haussler (UCSC) - comparative genomics, DCC y Peter Bickel (UC Berkeley) - statistician y Bill Nobel (UW) - machine learning - HMMs, change point analysis, wavelets, SVMs New analysis tasks from AWG or community Results Provided Triage and Back to AWG Initial prioritisation Converting Priortisation Active ad hoc of all projects tasks analysis to by AWG handled pipelines by EDAC AWG prioritisation EDAC suggest pipelining tasks Experimental Data exploration, DCC group, in house Normalisation, coordination methods Sanity checking Feedback to AWG and expt.
    [Show full text]