A Tool to Sanity Check and If Needed Reformat FASTA Files

Total Page:16

File Type:pdf, Size:1020Kb

A Tool to Sanity Check and If Needed Reformat FASTA Files ManuscriptbioRxiv preprint doi: https://doi.org/10.1101/024448; this version posted August 21, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under Click here to download Manuscript: bmc_article.tex aCC-BY 4.0 International license. Click here to view linked References Shelton and Brown 1 2 3 4 5 RESEARCH 6 7 8 9 Fasta-O-Matic: a tool to sanity check and if 10 11 needed reformat FASTA files 12 13 Jennifer M Shelton1 and Susan J Brown1* 14 15 *Correspondence: [email protected] 16 1KSU/K-INBRE Bioinformatics Abstract 17 Center, Division of Biology, 18 Kansas State University, Background: As the sheer volume of bioinformatic sequence data increases, the 19 Manhattan, KS, USA only way to take advantage of this content is to more completely automate Full list of author information is 20 available at the end of the article robust analysis workflows. Analysis bottlenecks are often mundane and 21 overlooked processing steps. Idiosyncrasies in reading and/or writing 22 bioinformatics file formats can halt or impair analysis workflows by interfering 23 with the transfer of data from one informatics tools to another. 24 Results: Fasta-O-Matic automates handling of common but minor formatissues 25 that otherwise may halt pipelines. The need for automation must be balanced by 26 the need for manual confirmation that any formatting error is actually minor 27 28 rather than indicative of a corrupt data file. To that end Fasta-O-Matic reports 29 any issues detected to the user with optionally color coded and quiet or verbose 30 logs. 31 Fasta-O-Matic can be used as a general pre-processing tool inbioinformatics 32 workflows (e.g. to automatically wrap FASTA files so that they can be read by 33 BioPerl). It was also developed as a sanity check for bioinformatic core facilities 34 that tend to repeat common analysis steps on FASTA files received from 35 disparate sources. Fasta-O-Matic can be set with format requirements specific to 36 downstream tools as a first step in a larger analysis workflow. 37 Availability: Fasta-O-Matic is available free of charge to academic and 38 non-profit institutions on GitHub. 39 40 Keywords: FASTA; sequence data; bioinformatics file format 41 42 43 44 45 Background 46 Sequence data can be stored as text with each letter representing a nucleic acid 47 48 (DNA and RNA) or amino acid (protein). The linear nature of these molecules 49 makes it natural to represent them as strings, finite sequences of characters. Al- 50 though it has been argued that a graph, a network of edges connected by vertices, 51 is a more accurate way to store genomic sequences because graphs allow the inclu- 52 53 sion of alternate alleles and alternate possible assemblies [1] all of the most common 54 methods for storing sequences (FASTA, FASTQ, SAM/BAM) use a linear strings. 55 Other decisions about how to represent sequence data can be more arbitrary. For 56 57 example, any character that is not used as a base or an amino acid could be used 58 to indicate the beginning of a new sequence. Additionally text could be wrapped to 59 limit the information content in any one line of a file. The advantage of wrapping 60 text is that some programs can then be designed to work one line at time limiting 61 62 the burden of each step (e.g. the program would never have to process an entire 63 64 65 bioRxiv preprint doi: https://doi.org/10.1101/024448; this version posted August 21, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Shelton and Brown Page 2 of 8 1 2 3 4 5 chromosome of sequence data in a single step). The disadvantage is that code must 6 7 be slightly more complex to load an entire sequence record into the working memory. 8 9 0.1 FASTA file format specifications versus recommendations 10 The popular FASTA file format stores sequence records and has very minimal format 11 12 requirements [2]. Each sequence is preceded by a header/description line that begins 13 with a > symbol. Sequence lines can include any standard International Unionof 14 Pure and Applied Chemistry (IUPAC) single character symbols for nucleic acids 15 or amino acids or the ambiguous codes that indicate possible residues or bases [3]. 16 They can also include - to indicate alignment gaps and * to indicate stop codons. 17 18 NCBI recommends wrapping FASTA file sequences lines [2]. It is also common 19 practice to use the first ‘word’ in a header (i.e. any character string to the left of 20 the first space in the header) as the unique sequence id. Although these features 21 are common they are not required leading to format compatibility issues with tools 22 23 that treat these conventions as required. 24 25 0.2 Customizing FASTA files to ensure that information is properly interpreted by 26 downstream tools 27 Regardless of whether a FASTA file is technically improperly formatted or it’s for- 28 29 mat merely violates a popular convention, it is critical to quality analysis workflows 30 that data is converted into a format that will be correctly interpreted by down- 31 stream tools. Formatting issues can fall into multiple categories including actual 32 format errors and formats that are not technically wrong but are non-standard, 33 34 causing some tools to throw an error. 35 Some format errors indicate a major problem like an attempt to use the wrong 36 data format (e.g. the first line is not a FASTA header because it does not begin 37 with a > character). These types of errors will be subsequently referred to as fatal. 38 Alternately, some formatting issues occur commonly without indicating the FASTA 39 40 file is corrupt (e.g. improperly wrapped/unwrapped sequence lines, missing final new 41 line characters, unusual new line characters like \r). These issues will be referred 42 to as non-fatal. Fatal formatting issues should cause processing to stop. Non-fatal 43 formatting issues should be automatically corrected according to the most common 44 45 resolution for this type of error. While downstream processing continues, the analyst 46 can double check the automated decision to reformat non-fatal issues. This way 47 workflow would not be slowed for trivial reformatting steps and the more rare 48 problems (e.g. when a missing last new line was caused by incomplete file transfer) 49 50 could still be caught. 51 52 0.3 Existing tools 53 Existing bioinformatics tools address FASTA format inconsistencies.Howevermany 54 tools either halt and exit with an error (e.g. BioPerl [4], [5], [6]) or can produce 55 56 reformatted output FASTA but cannot determine if there is a formatting issue to 57 begin with (e.g. EMBOSS Seqret [7]). 58 The BioPerl module DB::Fasta will halt if a FASTA is inconsistently wrapped or 59 if a line of sequence is too long (as in an unwrapped genome FASTA). This has the 60 disadvantage of requiring human intervention to wrap and restart analysis. 61 62 Code: 63 64 65 bioRxiv preprint doi: https://doi.org/10.1101/024448; this version posted August 21, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Shelton and Brown Page 3 of 8 1 2 3 4 5 #!/usr/bin/perl 6 7 use Bio::Seq; 8 use Bio::SeqIO; 9 use Bio::DB::Fasta; #makes a searchable db from FASTA file 10 11 my $out_file_temp = ’/home/bionano/test_db/all.fa’; 12 #Create new FASTA outfile object 13 my $seq_out = Bio::SeqIO->new(’-file’ => ">$out_file_temp",’-format’ => ’fasta’); 14 #Load FASTA file as DB 15 16 my $db = Bio::DB::Fasta->new("/home/bionano/test_db/miswrapped.fa"); 17 my $seq_obj = $db->get_Seq_by_id(’seq’); # get FASTA records using headers 18 #(where header = first ’word’ so really header whitespace should also be 19 #removed for this file) 20 21 $seq_out->write_seq($seq_obj); 22 Input: 23 >seq 1 24 25 ACTGTGTGCAATCGCTGNNNNCTCTCATCGGATCTTGCAATCGCTNNNCTCTCATCGGATTGCAATCGCTNNNCTtcatcCGGAT 26 CGCTGNNNNCTGTGTGCAATCGCTGNNNNCTCCTGATCGCTGNNNNCTGTGTGCAATCGCTGNNNNCTCCTGCAATCGCTGNNNN 27 CTCCTGTTCGNATCGatcctctgtttatgcttatagctagctgatcgtagnnntcaacgt 28 CTAGAGCGCAGCTCTGGGGGATTACTACTCACTACATCATTAGATCAGATacgactcann 29 30 >seq 2 31 cttatagctagctgatAATCGCTGNNTCATCGGATCTTGCCTTGCAATCGtcatcCGtcC 32 CGCTGNNNNCTGTGTGCAnnnnnnnnnnncgtaaaacgcctcctccgactcgTCTCTAGG 33 CTAGAGCGCAGCTCTGGGGGATTACTACTCACTACATCATTAGATCAGATacgactcann 34 35 nnnctacgCTATCAGGTCTCGAG 36 >seq 3 37 ATCAGCGCTCTATATGGCTCTGATTATAGTTTGCATTCATATGCTGATCTTctcagnntc 38 cttgacgctcgctATCTGTAGATCTGTACTtcagacagctcTCAGCAGNNNCTCAGCAGC 39 40 CTACGACAGTcatgcagactagcagt 41 Output: 42 ------------- EXCEPTION ------------- 43 44 MSG: Each line of the fasta entry must be the same length exceptthelast. 45 Line above #5 ’CTAGAGCGCAGCTCTGGGGG..’ is 61 != 86 chars... 46 EMBOSS seqret was designed as a very flexible tool to convert from one properly 47 48 formatted file to another properly but distinctly formatted file. It also was designed 49 to accept poorly formatted data (e.g. a FASTA missing the final new line that is 50 improperly wrapped) and export a reformatted file (e.g. wrapped after 60 bases 51 with a final new line). 52 53 Code: 54 seqret -stdout -sequence test.fa -outseq test_reformat.fa 55 Input: 56 57 >my header 58 AAAAAAAAAAAATTTTTTCCCCGGCGCGCGCGCTATAGCGCTATANNNNNNNNNNNNNNN 59 ATATATATATAT 60 ATTATTATATATATATTCTCTCTGGGCTCGCGTCTCGCTATTTATATATATATATATATTGCGCTCTCGTCTCCT 61 62 Output: 63 64 65 bioRxiv preprint doi: https://doi.org/10.1101/024448; this version posted August 21, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
Recommended publications
  • T-Coffee Documentation Release Version 13.45.47.Aba98c5
    T-Coffee Documentation Release Version_13.45.47.aba98c5 Cedric Notredame Aug 31, 2021 Contents 1 T-Coffee Installation 3 1.1 Installation................................................3 1.1.1 Unix/Linux Binaries......................................4 1.1.2 MacOS Binaries - Updated...................................4 1.1.3 Installation From Source/Binaries downloader (Mac OSX/Linux)...............4 1.2 Template based modes: PSI/TM-Coffee and Expresso.........................5 1.2.1 Why do I need BLAST with T-Coffee?.............................6 1.2.2 Using a BLAST local version on Unix.............................6 1.2.3 Using the EBI BLAST client..................................6 1.2.4 Using the NCBI BLAST client.................................7 1.2.5 Using another client.......................................7 1.3 Troubleshooting.............................................7 1.3.1 Third party packages......................................7 1.3.2 M-Coffee parameters......................................9 1.3.3 Structural modes (using PDB)................................. 10 1.3.4 R-Coffee associated packages................................. 10 2 Quick Start Regressive Algorithm 11 2.1 Introduction............................................... 11 2.2 Installation from source......................................... 12 2.3 Examples................................................. 12 2.3.1 Fast and accurate........................................ 12 2.3.2 Slower and more accurate.................................... 12 2.3.3 Very Fast...........................................
    [Show full text]
  • How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)
    Tutorial: How to generate a publication-quality multiple sequence alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012) 1) Get your sequences in FASTA format: • Go to the NCBI website; find your sequences and display them in FASTA format. Each sequence should look like this (http://www.ncbi.nlm.nih.gov/protein/6678177?report=fasta): >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG 2) In a text editor, paste all your sequences together (in the order that you would like them to appear in the end). It should look like this: >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG >gi|151554658|gb|AAI47965.1| STX3 protein [Bos taurus] MKDRLEQLKAKQLTQDDDTDEVEIAVDNTAFMDEFFSEIEETRVNIDKISEHVEEAKRLYSVILSAPIPE PKTKDDLEQLTTEIKKRANNVRNKLKSMERHIEEDEVQSSADLRIRKSQHSVLSRKFVEVMTKYNEAQVD FRERSKGRIQRQLEITGKKTTDEELEEMLESGNPAIFTSGIIDSQISKQALSEIEGRHKDIVRLESSIKE LHDMFMDIAMLVENQGEMLDNIELNVMHTVDHVEKAREETKRAVKYQGQARKKLVIIIVIVVVLLGILAL IIGLSVGLK
    [Show full text]
  • Introduction to Linux for Bioinformatics – Part II Paul Stothard, 2006-09-20
    Introduction to Linux for bioinformatics – part II Paul Stothard, 2006-09-20 In the previous guide you learned how to log in to a Linux account, and you were introduced to some basic Linux commands. This section covers some more advanced commands and features of the Linux operating system. It also introduces some command-line bioinformatics programs. One important aspect of using a Linux system from a Windows or Mac environment that was not discussed in the previous section is how to transfer files between computers. For example, you may have a collection of sequence records on your Windows desktop that you wish to analyze using a Linux command-line program. Alternatively, you may want to transfer some sequence analysis results from a Linux system to your Mac so that you can add them to a PowerPoint presentation. Transferring files between Mac OS X and Linux Recall that Mac OS X includes a Terminal application (located in the Applications >> Utilities folder), which can be used to log in to other systems. This terminal can also be used to transfer files, thanks to the scp command. Try transferring a file from your Mac to your Linux account using the Terminal application: 1. Launch the Terminal program. 2. Instead of logging in to your Linux account, use the same basic commands you learned in the previous section (pwd, ls, and cd) to navigate your Mac file system. 3. Switch to your home directory on the Mac using the command cd ~ 4. Create a text file containing your home directory listing using ls -l > myfiles.txt 5.
    [Show full text]
  • A Tool to Sanity Check and If Needed Reformat FASTA Files
    bioRxiv preprint doi: https://doi.org/10.1101/024448; this version posted August 13, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Fasta-O-Matic: a tool to sanity check and if needed reformat FASTA files Jennifer Shelton Kansas State University August 11, 2015 Abstract As the shear volume of bioinformatic sequence data increases the only way to take advantage of this content is to more completely automate ro- bust analysis workflows. Analysis bottlenecks are often mundane and overlooked processing steps. Idiosyncrasies in reading and/or writing bioinformatics file formats can halt or impair analysis workflows by in- terfering with the transfer of data from one informatics tools to another. Fasta-O-Matic automates handling of common but minor format issues that otherwise may halt pipelines. The need for automation must be balanced by the need for manual confirmation that any formatting error is actually minor rather than indicative of a corrupt data file. To that end Fasta-O-Matic reports any issues detected to the user with optionally color coded and quiet or verbose logs. Fasta-O-Matic can be used as a general pre-processing tool in bioin- formatics workflows (e.g. to automatically wrap FASTA files so that they can be read by BioPerl). It was also developed as a sanity check for bioinformatic core facilities that tend to repeat common analysis steps on FASTA files received from disparate sources.
    [Show full text]
  • The FASTA Program Package Introduction
    fasta-36.3.8 December 4, 2020 The FASTA program package Introduction This documentation describes the version 36 of the FASTA program package (see W. R. Pearson and D. J. Lipman (1988), “Improved Tools for Biological Sequence Analysis”, PNAS 85:2444-2448, [17] W. R. Pearson (1996) “Effective protein sequence comparison” Meth. Enzymol. 266:227- 258 [15]; and Pearson et. al. (1997) Genomics 46:24-36 [18]. Version 3 of the FASTA packages contains many programs for searching DNA and protein databases and for evaluating statistical significance from randomly shuffled sequences. This document is divided into four sections: (1) A summary overview of the programs in the FASTA3 package; (2) A guide to using the FASTA programs; (3) A guide to installing the programs and databases. Section (4) provides answers to some Frequently Asked Questions (FAQs). In addition to this document, the changes v36.html, changes v35.html and changes v34.html files list functional changes to the programs. The readme.v30..v36 files provide a more complete revision history of the programs, including bug fixes. The programs are easy to use; if you are using them on a machine that is administered by someone else, you can focus on sections (1) and (2) to learn how to use the programs. If you are installing the programs on your own machine, you will need to read section (3) carefully. FASTA and BLAST – FASTA and BLAST have the same goal: to identify statistically signifi- cant sequence similarity that can be used to infer homology. The FASTA programs offer several advantages over BLAST: 1.
    [Show full text]
  • Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D
    BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto BIOINFORMATICS SECOND EDITION METHODS OF BIOCHEMICAL ANALYSIS Volume 43 BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons, Inc., is aware of a claim, the product names appear in initial capital or ALL CAPITAL LETTERS. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Copyright ᭧ 2001 by John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic or mechanical, including uploading, downloading, printing, decompiling, recording or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher.
    [Show full text]
  • Need and Role of Scala Implementations in Bioinformatics
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 Need and Role of Scala Implementations in Bioinformatics Abbas Rehman Muhammad Atif Sarwar Department of Computer Science Department of Computer Science COMSATS Institute of Information Technology COMSATS Institute of Information Technology Sahiwal, Pakistan Sahiwal, Pakistan Ali Abbas Javed Ferzund Department of Computer Science Department of Computer Science COMSATS Institute of Information Technology COMSATS Institute of Information Technology Sahiwal, Pakistan Sahiwal, Pakistan Abstract—Next Generation Sequencing has resulted in the evolutionary change in data generation of different sequences. generation of large number of omics data at a faster speed that NGS machines are generating a huge amount of sequence data was not possible before. This data is only useful if it can be stored per day that needs to be stored, analyzed and managed well to and analyzed at the same speed. Big Data platforms and tools like seek the maximum advantages from this. Existing Apache Hadoop and Spark has solved this problem. However, bioinformatics techniques, tools or software are not keeping most of the algorithms used in bioinformatics for Pairwise pace with the speed of data generation. Old Bioinformatics alignment, Multiple Alignment and Motif finding are not tools have very less performance, accuracy and scalability implemented for Hadoop or Spark. Scala is a powerful language while analyzing large amount of data. When storing, managing supported by Spark. It provides, constructs like traits, closures, and analyzing large amount of data which is being generated functions, pattern matching and extractors that make it suitable now a days, these tools require more time and cost with less for Bioinformatics applications.
    [Show full text]
  • MATLAB Software for Extracting Protein Name and Sequence Information from FASTA Formatted Proteome File
    MATLAB software for extracting protein name and sequence information from FASTA formatted proteome file Wenfa Ng Unaffiliated researcher, Singapore, Email: [email protected] Abstract FASTA file format is a common file type for distributing proteome information, especially those obtained from Uniprot. While MATLAB could automatically read fasta files using the built-in function, fastaread, important information such as protein name and organism name remain enmeshed in a character array. Hence, difficulty exists in automatic extraction of protein names from fasta proteome file to help in building a database with fields comprising protein name and its amino acid sequence. The objective of this work was in developing a MATLAB software that could automatically extract protein name and amino acid sequence information from fasta proteome file and assign them to a new database that comprises fields such as protein name, amino acid sequence, number of amino acid residues, molecular weight of protein and nucleotide sequence of protein. Information on number of amino acid residues came from the use of the length built-in function in MATLAB analyzing the length of the amino acid sequence of a protein. The final two fields were provided by MATLAB built-in functions molweight and aa2nt, respectively. Molecular weight of proteins is useful for a variety of applications while nucleotide sequence is essential for gene synthesis applications in molecular cloning. Finally, the MATLAB software is also equipped with an error check function to help detect letters in the amino acid sequence that are not part of the family of 20 natural amino acids. Sequences with such letters would constitute as error inputs to molweight and aa2nt, and would not be processed.
    [Show full text]
  • The Shared Genomic Architecture of Human Nucleolar Organizer Regions
    The shared genomic architecture of human nucleolar organizer regions Ioanna Floutsakou1,4, Saumya Agrawal2,4, Thong T. Nguyen3,4, Cathal Seoighe3, Austen R.D. Ganley2, and Brian McStay1,5 1 Centre for Chromosome Biology, School of Natural Sciences, National University of Ireland, Galway, Galway, Ireland. 2 Institute of Natural and Mathematical Sciences, Massey University, Auckland, New Zealand. 3 School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Galway, Ireland. 4 These authors contributed equally to this work. 5Corresponding author: E-mail [email protected] SUPPLEMENTAL INFORMATION Supplemental Methods (page 3) Supplemental Tables (page 12) Table 1. Putative gene models from the DJ Table 2. Putative gene models from the PJ Table 3. Segmentally duplicated regions from the DJ contig Table 4. Segmentally duplicated regions from the PJ contig Table 5. Fold increase of DJ sequences present in BAC array clones Table 6. Sequencing statistics for the DJ and PJ cosmids Table 7. Assembly statistics for the DJ and PJ cosmid assemblies Table 8. DJ and PJ primers for mapping, ChIP, and RT-PCR Supplemental Figures Fig. 1. Positions of cosmids and BACs in the DJ Fig. 2. Positions of cosmids and BACs in the PJ Fig. 3. PJ sequences are conserved among the acrocentric chromosomes Fig. 4. Structure of DJ and PJ cosmids Fig. 5. Workflow for junction verification mapping pipeline Fig. 6. Alu and 147bp ACRO1 repeat inter-chromosomal PJ variation Fig. 7. Construction scheme for the DJ and PJ contigs Fig. 8. Repeat composition of the DJ and PJ regions. Fig. 9. 48bp CER repeats are found distal to the NOR on all five acrocentric chromosomes.
    [Show full text]
  • Biology 3055
    Laboratory Manual Biology 3055 Table of Contents Page(s) Laboratory Syllabus…………………………………………….……3 - 7 Lab Calendar……………………………………………………..7 Laboratory 1………………………………………………………... 8 - 18 GenBank Entry……………………………………………… …10 SwissProt Entry…………………………………………… 11, 12 Sequence Manipulation Suite…………………………………13 ClustalW Submission Form……………………………………14 Laboratory 2………………………………………………………..19 - 23 Laboratory 3………………………………………………..………24 - 28 Tool Bar……………………………………………………….…25 Ribbon Color Panel…………………………………………….26 Control Panel…………………………………………………...27 Laboratory 4………………………………………………………..29 -30 Laboratory 5…………………………………………………………….31 Glossary……………………………………………………………32 - 37 Created in 2003 by: April Bednarski Advised by: Professor Himadri Pakrasi Funded by a grant from: Howard Hughes Medical Institute to Washington University 2 Biology 3055 Laboratory Syllabus I. LAB INSTRUCTORS Instructors: Jessica Wagenseil, Kathy Hafer, April Bednarski II. COURSE DESCRIPTION Welcome to the laboratory for Biology 3050. This lab is designed around bioinformatics tools that are widely used in biomedical research today. Over the past few years, the number of freely available software programs and web-based research tools has increased dramatically. Knowing how to use these tools is very important to research and health professionals in order to access and interpret the constantly growing amounts of genomic and proteomic information. These bioinformatics tools as well as genomic information are freely available on the web to anyone who knows how to access them. Biology 3055 introduces these computer-based research tools in a context that reinforces concepts presented during the Biology 3050 lecture. III. PROJECTS You will be given your own gene to study during the 5 lab sessions. This gene will be the focus of all your guide sheets and final report. There will be one other person working on the same gene in each lab section.
    [Show full text]
  • Uniprot Programmatically Py3
    UniProt programmatically py3 June 19, 2017 1 UniProt, programmatically 1.1 Part of the series: EMBL-EBI, programmatically: take a REST from man- ual searches This notebook is intended to provide executable examples to go with the webinar. Code was tested in June 2017 against UniProt release 2017 06. Python version was 3.5.1 1.2 What is UniProt? The UniProt Consortium provides many protein-centric resources at the center of which is the UniProt Knowledgebase (UniProtKB). The UniProtKB is what most people refer to when they say `UniProt'. The UniProtKB sits atop and next to other resources provided by UniProt, like the UniProt Archive (Uni- Parc), the UniProt Refernce Clusters (UniRef), Proteomes and various supporting datasets like Keywords, Literature citations, Diseases etc. All of these resources can be explored and queried using the website www.uniprot.org. All queries run against resources are reflected in the URL, thus RESTful and open to programmatic access. Playing with the query facilities on the webiste is also a great way of learning about query fields and parameters for individual resources. In addition, (fairly) recently added (often large scale) datasets are exposed only for programmatic access via a different base URL www.ebi.ac.uk/proteins/api or the Proteins API for short. 1.3 RESTful endpoints exposed on www.uniprot.org Here is a list of all the RESTful endpoints exposed via the base URL http://www.uniprot.org: • UniProtKB: /uniprot/ [GET] • UniRef: /uniref/ [GET] • UniParc: /uniparc/ [GET] • Literature: /citations/ [GET] • Taxonomy: /taxonomy/ [GET] • Proteomes: /proteomes/ [GET] • Keywords: /keywords/ [GET] • Subcellular locations: /locations/ [GET] • Cross-references databases: /database/ [GET] • Human disease: /diseases/ [GET] In addition, there are a few tools which can also be used programmatically: • ID mapping: /uploadlists/ [GET] • Batch retrieval: /uploadlists/ [GET] As the UniProtKB has the most sophisticated querying facilities those will be used in the following examples.
    [Show full text]
  • Improved Hit Criteria for DNA Local Alignment
    Improved hit criteria for DNA local alignment Laurent Noe´∗ Gregory Kucherov† Abstract The hit criterion is a key component of heuristic local alignment algorithms. It specifies a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. In this paper, we propose two ways to improve the hit criterion. First, we define the group criterion combining the advantages of the single-seed and double-seed approaches used in existing algorithms. Second, we introduce transition-constrained seeds that extend spaced seeds by the possibility of dis- tinguishing transition and transversion mismatches. We provide analytical data as well as experimental results, obtained with the YASS software, supporting both improvements. Keywords: DNA local alignment, similarity search, hit criterion, sensitivity, selectivity, seed, spaced seed, transition-constrained seed, alignment score 1 Introduction Sequence alignment is a fundamental problem in Bioinformatics. Despite of a big amount of efforts spent by researchers on designing efficient alignment methods, improving the alignment efficiency remains of primary importance. This is due to the continuously increasing amount of nucleotide sequence data, such as EST and newly sequenced genomic sequences, that need to be compared in order to detect similar regions occurring in them. Those comparisons are done routinely, and therefore need to be done very fast, preferably instantaneously on commonly used computers. On the other hand, they need to be precise, i.e. should report all, or at least a vast majority of interesting similarities that could be relevant in the underlying biological study. The latter requirement for the alignment method, called the sensitivity, counterweights the speed requirement, usually directly related to the selectivity of the method.
    [Show full text]