Blast2go User Manual
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
File Formats Exercises
Introduction to high-throughput sequencing file formats – advanced exercises Daniel Vodák (Bioinformatics Core Facility, The Norwegian Radium Hospital) ( [email protected] ) All example files are located in directory “/data/file_formats/data”. You can set your current directory to that path (“cd /data/file_formats/data”) Exercises: a) FASTA format File “Uniprot_Mammalia.fasta” contains a collection of mammalian protein sequences obtained from the UniProt database ( http://www.uniprot.org/uniprot ). 1) Count the number of lines in the file. 2) Count the number of header lines (i.e. the number of sequences) in the file. 3) Count the number of header lines with “Homo” in the organism/species name. 4) Count the number of header lines with “Homo sapiens” in the organism/species name. 5) Display the header lines which have “Homo” in the organism/species names, but only such that do not have “Homo sapiens” there. b) FASTQ format File NKTCL_31T_1M_R1.fq contains a reduced collection of tumor read sequences obtained from The Sequence Read Archive ( http://www.ncbi.nlm.nih.gov/sra ). 1) Count the number of lines in the file. 2) Count the number of lines beginning with the “at” symbol (“@”). 3) How many reads are there in the file? c) SAM format File NKTCL_31T_1M.sam contains alignments of pair-end reads stored in files NKTCL_31T_1M_R1.fq and NKTCL_31T_1M_R2.fq. 1) Which program was used for the alignment? 2) How many header lines are there in the file? 3) How many non-header lines are there in the file? 4) What do the non-header lines represent? 5) Reads from how many template sequences have been used in the alignment process? c) BED format File NKTCL_31T_1M.bed holds information about alignment locations stored in file NKTCL_31T_1M.sam. -
To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari
UNIT 1.11 Using The Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes Leonore Reiser1, Shabari Subramaniam1, Donghui Li1, and Eva Huala1 1Phoenix Bioinformatics, Redwood City, CA USA ABSTRACT The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene function, orthologs gene expression, mutant phenotypes, biological materials such as clones and seed stocks, genetic markers, genetic and physical maps, genome organization, images of mutant plants, protein sub-cellular localizations, publications, and the research community. The various data types are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes and genome, Additionally we describe how members of the community can share data using TAIR’s Online Annotation Submission Tool (TOAST), in order to make their published research more accessible and visible. Keywords: Arabidopsis ● databases ● bioinformatics ● data mining ● genomics INTRODUCTION The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource for the biology of Arabidopsis thaliana (Huala et al., 2001; Garcia-Hernandez et al., 2002; Rhee et al., 2003; Weems et al., 2004; Swarbreck et al., 2008, Lamesch, et al., 2010, Berardini et al., 2016). The TAIR database contains information about genes, proteins, gene expression, mutant phenotypes, germplasms, clones, genetic markers, genetic and physical maps, genome organization, publications, and the research community. In addition, seed and DNA stocks from the Arabidopsis Biological Resource Center (ABRC; Scholl et al., 2003) are integrated with genomic data, and can be ordered through TAIR. -
Downloaded in August 2017; 209326 Sequences)
microorganisms Article Label-Free Quantitative Proteomics Analysis in Susceptible and Resistant Brassica napus Cultivars Infected with Xanthomonas campestris pv. campestris Md Tabibul Islam 1,2, Bok-Rye Lee 1,3 , Van Hien La 1, Dong-Won Bae 4, Woo-Jin Jung 5 and Tae-Hwan Kim 1,* 1 Department of Animal Science, Institute of Agricultural Science and Technology, College of Agriculture & Life Science, Chonnam National University, Gwangju 61186, Korea; [email protected] (M.T.I.); [email protected] (B.-R.L.); [email protected] (V.H.L.) 2 Alson H. Smith Jr. Agricultural Research and Extension Center, School of Plant and Environmental Sciences, Virginia Tech, Winchester, VA 22602, USA 3 Asian Pear Research Institute, Chonnam National University, Gwangju 61186, Korea 4 Central Instrument Facility, Gyeongsang National University, Jinju 52828, Korea; [email protected] 5 Division of Applied Bioscience and Biotechnology, Institute of Environmentally Friendly Agriculture (IEFA), College of Agriculture and Life Science, Chonnam National University, Gwangju 61186, Korea; [email protected] * Correspondence: [email protected]; Tel.: +82-62-530-2126 Abstract: Black rot, caused by Xanthomonas campestris pv. campestris (Xcc), is the main disease of cruciferous vegetables. To characterize the resistance mechanism in the Brassica napus–Xcc pathosys- tem, Xcc-responsive proteins in susceptible (cv. Mosa) and resistant (cv. Capitol) cultivars were investigated using gel-free quantitative proteomics and analysis of gene expression. This allowed us to identify 158 and 163 differentially expressed proteins following Xcc infection in cv. Mosa and cv. Citation: Islam, M.T.; Lee, B.-R.; Capitol, respectively, and to classify them into five major categories including antioxidative systems, La, V.H.; Bae, D.-W.; Jung, W.-J.; proteolysis, photosynthesis, redox, and innate immunity. -
PANNZER 2: Annotate a Complete Proteome in Minutes!
PANNZER 2: Annotate a complete proteome in minutes! Alan J Medlar#, Petri Törönen#, Elaine Zosa, Liisa Holm* Structural Genomics Group, Institute of Biotechnology PO Box 56, 00014 University of Helsinki, Finland #These authors contributed equally *To whom correspondence should be addressed: [email protected] 1. INTRODUCTION As high-throughput sequencing has become increasingly efficient, downstream analysis has become the main bottleneck in genome sequencing projects. For example, a critical and time-consuming step is the annotation of the organism’s proteome: assigning molecular functions, involvement in biological processes and the cellular localizations of all identified protein sequences. The process of protein annotation involves classifying each protein using the Gene Ontology (GO) and selecting suitable free text descriptions, which are required for submission into sequence databases. We refer to these as GO and DE predictions, respectively. Although servers are available that perform functional annotation, none perform both GO and DE predictions concurrently, they tend to be slow and are restricted in the number of queries that can be submitted. Finally, many of the existing methods do not return an estimate of the prediction accuracy. 2. METHODS AND RESULTS The above shortcomings are remedied by PANNZER 2, an interactive web server and standalone program for protein function prediction. It is, to our knowledge, the only tool that performs both DE and GO prediction. It uses SANS-parallel [1], a high-throughput sequence search program that is thousands of times faster than BLAST, to identify homologous sequences. It is, therefore, capable of analyzing tens of thousands of proteins interactively. Results are displayed as an HTML table summarizing both DE and GO predictions, including a statistical estimate for the reliability of each prediction. -
EMBL-EBI Powerpoint Presentation
Processing data from high-throughput sequencing experiments Simon Anders Use-cases for HTS · de-novo sequencing and assembly of small genomes · transcriptome analysis (RNA-Seq, sRNA-Seq, ...) • identifying transcripted regions • expression profiling · Resequencing to find genetic polymorphisms: • SNPs, micro-indels • CNVs · ChIP-Seq, nucleosome positions, etc. · DNA methylation studies (after bisulfite treatment) · environmental sampling (metagenomics) · reading bar codes Use cases for HTS: Bioinformatics challenges Established procedures may not be suitable. New algorithms are required for · assembly · alignment · statistical tests (counting statistics) · visualization · segmentation · ... Where does Bioconductor come in? Several steps: · Processing of the images and determining of the read sequencest • typically done by core facility with software from the manufacturer of the sequencing machine · Aligning the reads to a reference genome (or assembling the reads into a new genome) • Done with community-developed stand-alone tools. · Downstream statistical analyis. • Write your own scripts with the help of Bioconductor infrastructure. Solexa standard workflow SolexaPipeline · "Firecrest": Identifying clusters ⇨ typically 15..20 mio good clusters per lane · "Bustard": Base calling ⇨ sequence for each cluster, with Phred-like scores · "Eland": Aligning to reference Firecrest output Large tab-separated text files with one row per identified cluster, specifying · lane index and tile index · x and y coordinates of cluster on tile · for each -
Sequence Alignment/Map Format Specification
Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 3 Jun 2021 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 53752fa from that repository, last modified on the date shown above. 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with `@', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAMfilemay optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1 in using the POSIX / C locale. Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. 1.1 An example Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. Coor 12345678901234 5678901234567890123456789012345 ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT The corresponding SAM format is:2 1Charset ANSI X3.4-1968 as defined in RFC1345. -
Homology & Alignment
Protein Bioinformatics Johns Hopkins Bloomberg School of Public Health 260.655 Thursday, April 1, 2010 Jonathan Pevsner Outline for today 1. Homology and pairwise alignment 2. BLAST 3. Multiple sequence alignment 4. Phylogeny and evolution Learning objectives: homology & alignment 1. You should know the definitions of homologs, orthologs, and paralogs 2. You should know how to determine whether two genes (or proteins) are homologous 3. You should know what a scoring matrix is 4. You should know how alignments are performed 5. You should know how to align two sequences using the BLAST tool at NCBI 1 Pairwise sequence alignment is the most fundamental operation of bioinformatics • It is used to decide if two proteins (or genes) are related structurally or functionally • It is used to identify domains or motifs that are shared between proteins • It is the basis of BLAST searching (next topic) • It is used in the analysis of genomes myoglobin Beta globin (NP_005359) (NP_000509) 2MM1 2HHB Page 49 Pairwise alignment: protein sequences can be more informative than DNA • protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties • codons are degenerate: changes in the third position often do not alter the amino acid that is specified • protein sequences offer a longer “look-back” time • DNA sequences can be translated into protein, and then used in pairwise alignments 2 Find BLAST from the home page of NCBI and select protein BLAST… Page 52 Choose align two or more sequences… Page 52 Enter the two sequences (as accession numbers or in the fasta format) and click BLAST. -
Assembly Exercise
Assembly Exercise Turning reads into genomes Where we are • 13:30-14:00 – Primer Design to Amplify Microbial Genomes for Sequencing • 14:00-14:15 – Primer Design Exercise • 14:15-14:45 – Molecular Barcoding to Allow Multiplexed NGS • 14:45-15:15 – Processing NGS Data – de novo and mapping assembly • 15:15-15:30 – Break • 15:30-15:45 – Assembly Exercise • 15:45-16:15 – Annotation • 16:15-16:30 – Annotation Exercise • 16:30-17:00 – Submitting Data to GenBank Log onto ILRI cluster • Log in to HPC using ILRI instructions • NOTE: All the commands here are also in the file - assembly_hands_on_steps.txt • If you are like me, it may be easier to cut and paste Linux commands from this file instead of typing them in from the slides Start an interactive session on larger servers • The interactive command will start a session on a server better equipped to do genome assembly $ interactive • Switch to csh (I use some csh features) $ csh • Set up Newbler software that will be used $ module load 454 A norovirus sample sequenced on both 454 and Illumina • The vendors use different file formats unknown_norovirus_454.GACT.sff unknown_norovirus_illumina.fastq • I have converted these files to additional formats for use with the assembly tools unknown_norovirus_454_convert.fasta unknown_norovirus_454_convert.fastq unknown_norovirus_illumina_convert.fasta Set up and run the Newbler de novo assembler • Create a new de novo assembly project $ newAssembly de_novo_assembly • Add read data to the project $ addRun de_novo_assembly unknown_norovirus_454.GACT.sff -
Pan-Genome-Wide Analysis of Pantoea Ananatis Identified Genes Linked to Pathogenicity In
bioRxiv preprint doi: https://doi.org/10.1101/2020.07.24.219337; this version posted July 24, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Pan-genome-wide analysis of Pantoea ananatis identified genes linked to pathogenicity in onion Gaurav Agarwal1*, Divya Choudhary1, Shaun P. Stice2, Brendon K. Myers1, Ronald D. Gitaitis1, Stephanus N. Venter3, Brian H. Kvitko2, Bhabesh Dutta1* 1Department of Plant Pathology, Coastal Plain Experimental Station, University of Georgia, Tifton, Georgia, USA 2Department of Plant Pathology, University of Georgia, Athens, Georgia, USA 3Department of Biochemistry, Genetics and Microbiology, Forestry and Agricultural Biotechnology Institute, University of Pretoria, Pretoria, South Africa *Corresponding authors: [email protected] (BD); [email protected] (GA) 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.07.24.219337; this version posted July 24, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Abstract Pantoea ananatis is a member of a Pantoea spp. complex that causes center rot of onion, which significantly affects onion yield and quality. This pathogen does not have typical virulence factors like type II or type III secretion systems but appears to require a biosynthetic gene-cluster, HiVir/PASVIL (located chromosomally), for a phosphonate secondary metabolite, and the onion- virulence regions, OVR (localized on a megaplasmid), for onion pathogenicity and virulence, respectively. -
Workflows for Rapid Functional Annotation of Diverse
insects Article Workflows for Rapid Functional Annotation of Diverse Arthropod Genomes Surya Saha 1,2 , Amanda M. Cooksey 2,3, Anna K. Childers 4 , Monica F. Poelchau 5 and Fiona M. McCarthy 2,* 1 Boyce Thompson Institute, 533 Tower Rd., Ithaca, NY 14853, USA; [email protected] 2 School of Animal and Comparative Biomedical Sciences, University of Arizona, 1117 E. Lowell St., Tucson, AZ 85721, USA; [email protected] 3 CyVerse, BioScience Research Laboratories, University of Arizona, 1230 N. Cherry Ave., Tucson, AZ 85721, USA 4 Bee Research Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, USDA, 10300 Baltimore Ave., Beltsville, MD 20705, USA; [email protected] 5 National Agricultural Library, Agricultural Research Service, USDA, 10301 Baltimore Ave., Beltsville, MD 20705, USA; [email protected] * Correspondence: fi[email protected] Simple Summary: Genomic technologies are accumulating information about genes faster than ever before, and sequencing initiatives, such as the Earth BioGenome Project, i5k, and Ag100Pest Initiative, are expected to increase this rate of acquisition. However, if genomic sequencing is to be used for the improvement of human health, agriculture, and our understanding of biological systems, it is necessary to identify genes and understand how they contribute to biological outcomes. While there are several well-established workflows for assembling genomic sequences and identifying genes, understanding gene function is essential to create actionable knowledge. Moreover, this functional annotation process must be easily accessible and provide information at a genomic scale to keep up Citation: Saha, S.; Cooksey, A.M.; with new sequence data. We report a well-defined workflow for rapid functional annotation of whole Childers, A.K.; Poelchau, M.F.; proteomes to produce Gene Ontology and pathways information. -
How the Energy Sensor, AMPK, Impact the Testicular Function Pascal Froment, Mélanie Faure, Edith Guibert, Jean-Pierre Brillard
How the energy sensor, AMPK, impact the testicular function Pascal Froment, Mélanie Faure, Edith Guibert, Jean-Pierre Brillard To cite this version: Pascal Froment, Mélanie Faure, Edith Guibert, Jean-Pierre Brillard. How the energy sensor, AMPK, impact the testicular function. 27. Conference of European Comparative Endocrinologists, Aug 2014, Rennes, France. 2014. hal-02742016 HAL Id: hal-02742016 https://hal.inrae.fr/hal-02742016 Submitted on 3 Jun 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Copyright 27th Conference of European Comparative Endocrinologists CECE 2014 25-29 August 2014 Rennes, France 3 27th Conference of European Comparative Endocrinologists Organized with the generous support and help of our sponsors Université de Rennes 1 European Society for Comparative Endocrinology (Grants) European Union INTEREG TC2N Rennes Métropole European Society of Endocrinology (Grants) Institut National de la Recherche Agronomique Société de Neuroendocrinologie (Grants) Institut National de l'Environnement Industriel et des Risques !"#$%$&$'()'*)+,)*+,)'#&*'-.'#."$/ -
Introduction to Bioinformatics (Elective) – SBB1609
SCHOOL OF BIO AND CHEMICAL ENGINEERING DEPARTMENT OF BIOTECHNOLOGY Unit 1 – Introduction to Bioinformatics (Elective) – SBB1609 1 I HISTORY OF BIOINFORMATICS Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biologicaldata. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Bioinformatics derives knowledge from computer analysis of biological data. These can consist of the information stored in the genetic code, but also experimental results from various sources, patient statistics, and scientific literature. Research in bioinformatics includes method development for storage, retrieval, and analysis of the data. Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary, using techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics. It has many practical applications in different areas of biology and medicine. Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. "Classical" bioinformatics: "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information.” The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline.