Identifying Disease Genes

Total Page:16

File Type:pdf, Size:1020Kb

Identifying Disease Genes Genomics for today • Cancer genomics • Reproductive health • Forensic genomics • Agrigenomics • Complex disease genomics • Microbial genomics • Genomics in Drug and development • and more …omics Data/File formats • File format, a format for encoding data for storage in a computer file which is a standardized file format • Storage, access, sharing, interpretation, security, etc. http://en.wikipedia.org/wiki/Data_format Bioinformatics for dummies http://www.dummies.com/how-to/content/bioinformatics-data-formats.html Scientific data formats 23andMe microarray track data Browser Extensible Data Format AB1 (Chromatogram files used by DNA sequencing instruments from Applied Biosystems) MINiML (MIAME Notation in Markup Language) ABCD (Access to Biological Collection Data) mini Protein Data Bank Format ABCDDNA (Access to Biological Collection Data DNA extension) MIQAS-TAB (Minimal Information for QTLs and Association Studies Tabular) ABCDEFG (Access to Biological Collection Data Extension For Geosciences) MITAB ACE (Sequence assembly format) mmCIF (macromolecular Crystallographic Information File) Affymetrix Raw Intensity Format Multiple Alignment Forma ARLEQUIN Project Format mzData (deprecated) Axt Alignment Format mzIdentML BAM (Binary compressed SAM format) mzML BED (Browser extensible display format describing genes and other features of DNA sequences) mzQuantML BEDgraph mzXML (deprecated) Big Browser Extensible Data Format NCD (Natural Collections Descriptions) Big Wiggle Format NDTF (Neurophysiology Data Translation Format) Binary Alignement Map Format net alignment annotation Format Binary Probe Map Format NeuroML (Neuroscience eXtensible Markup Language) Binary sequence information Format New Hampshire eXtended Format Biological Pathway eXchange Newick tree Format BLAT alignment Format NEXUS (Encodes mixed information about genetic sequence data in a block structured format) BRIX generated O Format Nimblegen Design File Format CAF (Common Assembly Format for sequence assembly) Nimblegen Gene Data Format CellML NMR-STAR (NMR Self-defining Text Archive and Retrieval format) CHADO XML interchange Format nucleotide inFormation binary Format Chain Format for pairwise alignment ODM (Operational Data Model) CHARMM Card File Format Open Biomedical Ontology Flat File Format CLUSTAL-W Alignment Format Personal Genome SNP Format CLUSTAL-W Dendrogram Guide File Format PHD (Output from the basecalling software Phred) Clustered Data Table Format phyloXML (XML for evolutionary biology and comparative genomics) Complete Genomics Pre-Clustering File Format DELTA (DEscription Language for TAxonomy) Protein Data Bank (PDB; Structures of biomolecules deposited in Protein Data Bank) DAS (Distributed Sequence Annotation System) Protein InFormation Resource Format DBN (Dot Bracket Notation (DBN) - Vienna Format) PRM (Protocol Representation Model (Medical Research)) EMBL (Flatfile format used by the EMBL for nucleotide and peptide sequences) PSI-MI XML EML (Environmental Markup Language) not to be confused with EML (Ecological Metadata Language) PSI-PAR ENCODE (Peak information Format) RDML (Real-time PCR Data Markup Language) FASTA and FASTQ (File format for sequence data, FASTQ with quality) SAM (Sequence Alignment/Map format) FuGEFlow SCF (Staden chromatogram files used to store data from DNA sequencing) FuGE-ML (Functional Genomics Experiment Markup Language) SBML (Systems Biology Markup Language used to store biochemical network computational models) Gating-ML SDD (Structured Descriptive Data) GCDML (Genomic Contextual Data Markup Language) SED-ML (Simulation Experiment Description Markup Language) GelML Gel electrophoresis Markup Language Sequence Alignment Map Format GenBank (Flatfile format used by NCBI for nucleotide and peptide sequences) SOFT (Simple Omnibus Format in Text) Gene Feature File (Versions 1 and 3) spML (Separation Markup Language) GFF (General feature format for describing genes and other features of DNA, RNA and protein sequences) SRA-XML (Short Read Archive eXtensible Markup Language) Gene Prediction File Format Standard Flowgram Format GenePattern GeneSet Table Format Stockholm Multiple Alignment Format (Representing multiple sequence alignments) Genome Annotation File (version 1 and 2) SBML (System Biology Markup Language) GTF (Gene transfer format holds information about gene structure) SBGN (Systems Biology Graphical Notation) HMMER SBRML (Systems Biology Results Markup Language) ICB (ICM binary file Format) Swiss-Prot (Flatfile format used for protein sequences from the Swiss-Prot database) Image Cytometry Standard (ICS) TAIR annotation data Format imzML (imaging mz Markup Language) TAPIR (TDWG Access Protocol for Information Retrieval) ISA-Tab (Investigation Study Assay Tabular) TCS (Taxonomic Concept transfer Schema) ISND sequence record XML TraML (Transition Markup Language) KGML (KEGG Mark-up Language) UniProtKB XML Format MAGE-Tab (MicroArray Gene Expression Tabular) VCF (Variant Call Format) MCL (Microbiological Common Language) Wiggle Format MIARE-TAB (Minimum Information About a RNAi Experiment Tabular) http://en.wikipedia.org/wiki/List_of_file_formats#Biology http://fileformats.archiveteam.org/wiki/Scientific_Data_formats Fasta format >Description line Sequence >gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens] MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGF Sequence identifiers GenBank gb|accession|locus EMBL Data Library emb|accession|locus DDBJ, DNA Database of Japan dbj|accession|locus NBRF PIR pir||entry Protein Research Foundation prf||name SWISS-PROT sp|accession|entry name Brookhaven Protein Data Bank pdb|entry|chain Patents pat|country|number GenInfo Backbone Id bbs|number General database identifier gnl|database|identifier NCBI Reference Sequence ref|accession|locus Local Sequence identifier lcl|identifier File extension Extension Meaning Notes faa fasta amino acid Contains amino acids. A multiple protein fasta file can have the more specific extension mpfa. fasta (.fas) generic fasta Any generic fasta file. Other extensions can be fa, seq, fsa ffn FASTA nucleotide coding regions Contains coding regions for a genome. fna fasta nucleic acid Used to generically specify nucleic acids. frn FASTA non-coding RNA Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA BioXSD: the common data-exchange format for everyday bioinformatics web services • for basic bioinformatics data • XML schema • syntax for biological sequences, annotations, alignments, and references to resources http://bioinformatics.oxfordjournals.org/content/26/18/i540.full 8. How to identify disease biomarkers Balaji Rajashekar 20.11.14 Human By mass, human cells consist of 65–90% water (H2O). Oxygen therefore contributes a majority of a human body's mass. Almost 99% of the mass of the human body is made up of the six elements oxygen, carbon, hydrogen, nitrogen, calcium, and phosphorus. About 0.75% of the remainder is composed of only five elements: sodium, phosphorus, potassium, sulfur, and chlorine. The remaining elements are trace elements. Note that not all elements which are found in the human body play a role in life. Elemental composition The average 70 kg adult human body contains approximately 6.7 x 1027 atoms and is composed of 60 chemical elements. The elements needed for life are relatively common in the Earth's crust, and conversely most of the common elements are necessary for life. An exception is aluminium, which is the third most common element in the Earth's crust (after oxygen and silicon). Contents : Flesh, Blood, http://en.wikipedia.org/wiki/Composition_of_the_human_body Composition The composition can also be expressed in terms of chemicals, such as: Water Proteins – including those of hair, connective tissue, etc. Fats (or lipids) Apatite in bones Carbohydrates such as glycogen and glucose DNA Dissolved inorganic ions such as sodium, potassium, chloride, bicarbonate, phosphate Gases such as oxygen, carbon dioxide, nitrogen oxide, hydrogen, carbon monoxide, methanethiol. These may be dissolved or present in the gases in the lungs or intestines. Many other small molecules, such as amino acids, fatty acids, nucleobases, nucleosides, nucleotides, vitamins, cofactors. Free radicals such as superoxide, hydroxyl, and hydroperoxyl. Materials Body composition can also be expressed in terms of various types of material, such as: Muscle Fat Bone and teeth Brain and nerves Connective tissue Blood – 7% of body weight. Lymph Contents of digestive tract, including intestinal gas Urine Air in lungs Composition by cell type : There are many species of bacteria and other microorganisms that live on or inside the healthy human body. In fact, 90% of the cells in (or on) a human body are microbes, by number (much less by mass or volume). Some of these symbionts are necessary for our health. Those that neither help nor harm us are called commensal organisms. Data and analysis Formats : NetCDF netcdf out { dimensions: __string = 11 ; n = 4 ; m = 5 ; variables: char empty(__string) ; int year(n) ; Population growth in Cities in corresponding year char city(m, __string) ; float population(m, n) ; 1900 1940 1970 2000 Los 0.102 1.504 2.812 3.695 // global attributes: Angeles :__str_len = 11 ; data: Washingt 0.279 0.663 0.757 0.572 on empty = "" ; New York 3.437 7.455 7.896 8.008 year = 1900, 1940, 1970, 2000 ; Seattle 0.081 0.368 0.531 0.563 city = London 6.528 8.197 7.452 7.322 "Los Angeles", "Washington",
Recommended publications
  • Comparative Genomic Analysis of Three Pseudomonas
    microorganisms Article Comparative Genomic Analysis of Three Pseudomonas Species Isolated from the Eastern Oyster (Crassostrea virginica) Tissues, Mantle Fluid, and the Overlying Estuarine Water Column Ashish Pathak 1, Paul Stothard 2 and Ashvini Chauhan 1,* 1 Environmental Biotechnology Laboratory, School of the Environment, 1515 S. Martin Luther King Jr. Blvd., Suite 305B, FSH Science Research Center, Florida A&M University, Tallahassee, FL 32307, USA; [email protected] 2 Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB T6G2P5, Canada; [email protected] * Correspondence: [email protected]; Tel.: +1-850-412-5119; Fax: +1-850-561-2248 Abstract: The eastern oysters serve as important keystone species in the United States, especially in the Gulf of Mexico estuarine waters, and at the same time, provide unparalleled economic, ecological, environmental, and cultural services. One ecosystem service that has garnered recent attention is the ability of oysters to sequester impurities and nutrients, such as nitrogen (N), from the estuarine water that feeds them, via their exceptional filtration mechanism coupled with microbially-mediated denitrification processes. It is the oyster-associated microbiomes that essentially provide these myriads of ecological functions, yet not much is known on these microbiota at the genomic scale, especially from warm temperate and tropical water habitats. Among the suite of bacterial genera that appear to interplay with the oyster host species, pseudomonads deserve further assessment because Citation: Pathak, A.; Stothard, P.; of their immense metabolic and ecological potential. To obtain a comprehensive understanding on Chauhan, A. Comparative Genomic this aspect, we previously reported on the isolation and preliminary genomic characterization of Analysis of Three Pseudomonas Species three Pseudomonas species isolated from minced oyster tissue (P.
    [Show full text]
  • Three New Genome Assemblies Support a Rapid Radiation in Musa Acuminata (Wild Banana)
    GBE Three New Genome Assemblies Support a Rapid Radiation in Musa acuminata (Wild Banana) Mathieu Rouard1,*, Gaetan Droc2,3, Guillaume Martin2,3,JulieSardos1, Yann Hueber1, Valentin Guignon1, Alberto Cenci1,Bjo¨rnGeigle4,MarkS.Hibbins5,6, Nabila Yahiaoui2,3, Franc-Christophe Baurens2,3, Vincent Berry7,MatthewW.Hahn5,6, Angelique D’Hont2,3,andNicolasRoux1 1Bioversity International, Parc Scientifique Agropolis II, Montpellier, France 2CIRAD, UMR AGAP, Montpellier, France 3AGAP, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, France 4Computomics GmbH, Tuebingen, Germany 5Department of Biology, Indiana University 6Department of Computer Science, Indiana University 7LIRMM, Universite de Montpellier, CNRS, Montpellier, France *Corresponding author: E-mail: [email protected]. Accepted: October 10, 2018 Data deposition: Raw sequence reads for de novo assemblies were deposited in the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI) (BioProject: PRJNA437930 and SRA: SRP140622). Genome Assemblies and gene annotation data are available on the Banana Genome Hub (Droc G, Lariviere D, Guignon V, Yahiaoui N, This D, Garsmeur O, Dereeper A, Hamelin C, Argout X, Dufayard J-F, Lengelle J, Baurens F–C, Cenci A, Pitollat B, D’Hont A, Ruiz M, Rouard M, Bocs S. The Banana Genome Hub. Database (2013) doi:10.1093/ database/bat035) (http://banana-genome-hub.southgreen.fr/species-list). Cluster and gene tree results are available on a dedicated database (http://panmusa.greenphyl.org) hosted on the South Green Bioinformatics Platform (Guignon et al. 2016). Additional data sets are made available on Dataverse: https://doi.org/10.7910/DVN/IFI1QU. Abstract Edible bananas result from interspecific hybridization between Musa acuminata and Musa balbisiana,aswellasamongsubspeciesin M.
    [Show full text]
  • Sequence Alignment/Map) Is a Text Format for Storing Sequence Alignment Data in a Series of Tab Delimited ASCII Columns
    NGS FILE FORMATS SEQUENCE FILE FORMATS FASTA FORMAT FASTA Single sequence example: >HWI-ST398_0092:1:1:5372:2486#0/1 TTTTTCGTTCTTTTCATGTACCGCTTTTTGTTCGGTTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT ACGTAGCAGCAGCATCAGTACGACTACGACGACTAGCACATGCGACGATCGATGCTAGCTGACTATCGATG Multiple sequence example: >Sequence Name 1 TTTTTCGTTCTTTTCATGTACCGCTTTTTGTTCGGTTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT ACGTAGCAGCAGCATCAGTACGACTACGACGACTAGCACATGCGACGATCGATGCTAGCTGACTATCGATG >Sequence Name 2 ACGTAGACACGACTAGCATCAGCTACGCATCGATCAGCATCGACTAGCATCACACATCGATCAGCATCACGACTAGCAT AGCATCGACTACACTACGACTACGATCCACGTACGACTAGCATGCTAGCGCTAGCTAGCTAGCTAGTCGATCGATGAGT AGCTAGCTAGCTAGC >Sequence Name 3 ACTCAGCATGCATCAGCATCGACTACGACTACGACATCGACTAGCATCAGCAT SEQUENCE FILE FORMATS FASTQ FORMAT FASTQ Text based format for storing sequence data and corresponding quality scores for each base. To enable a one-one correspondence between the base sequence and the quality score the score is stored as a single one letter/number code using an offset of the standard ASCII code. Quality scores range from 0 to 40 and represent a log10 score for the probability of being wrong. E.g. score of 30 => 1:1000 chance of error SEQUENCE FILE FORMATS FASTQ FORMAT FASTQ Each fastq file contain multiple entries and each entry consists of 4 lines: 1. header line beginning with “@“ and sequence name 2. sequence line 3. header line beginning with “+” which can have the name but rarely does 4. quality score line SEQUENCE FILE FORMATS FASTQ FORMAT FASTQ @HWI-ST398_0092:6:73:5372:2486#0/1 TTTTTCGTTCTTTTCATGTACCGCTTTTTGTTCGGTTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT
    [Show full text]
  • UNIVERSITY of CALIFORNIA, SAN DIEGO the Comparative Genomics
    UNIVERSITY OF CALIFORNIA, SAN DIEGO The Comparative Genomics of Salinispora and the Distribution and Abundance of Secondary Metabolite Genes in Marine Plankton A Dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Marine Biology by Kevin Matthew Penn Committee in charge: Paul R. Jensen, Chair Eric Allen Lin Chao Bradley Moore Brian Palenik Forest Rohwer 2012 UMI Number: 3499839 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent on the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMI 3499839 Copyright 2012 by ProQuest LLC. All rights reserved. This edition of the work is protected against unauthorized copying under Title 17, United States Code. ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346 Copyright Kevin Matthew Penn, 2012 All rights reserved The Dissertation of Kevin Matthew Penn is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Chair University of California, San Diego 2012 iii DEDICATION I dedicate this dissertation to my Mom Gail Penn and my Father Lawrence Penn they deserve more credit then any person could imagine. They have supported me through the good times and the bad times. They have never given up on me and they are always excited to know that I am doing well. They just want the best for me.
    [Show full text]
  • Alternate-Locus Aware Variant Calling in Whole Genome Sequencing Marten Jäger1,2, Max Schubach1, Tomasz Zemojtel1,Knutreinert3, Deanna M
    Jäger et al. Genome Medicine (2016) 8:130 DOI 10.1186/s13073-016-0383-z RESEARCH Open Access Alternate-locus aware variant calling in whole genome sequencing Marten Jäger1,2, Max Schubach1, Tomasz Zemojtel1,KnutReinert3, Deanna M. Church4 and Peter N. Robinson1,2,3,5,6* Abstract Background: The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS). Methods: We developed an algorithm that analyzes the patterns of variant calls in the 178 structurally variable regions of the GRCh38 genome assembly, and infers whether a given sample is most likely to contain sequences from the primary assembly, an alternate locus, or their heterozygous combination at each of these 178 regions. We investigate 121 in-house WGS datasets that have been aligned to the GRCh37 and GRCh38 assemblies. Results: We show that stretches of sequences that are largely but not entirely identical between the primary assembly and an alternate locus can result in multiple variant calls against regions of the primary assembly. In WGS analysis, this results in characteristic and recognizable patterns of variant calls at positions that we term alignable scaffold-discrepant positions (ASDPs). In 121 in-house genomes, on average 51.8 ± 3.8 of the 178 regions were found to correspond best to an alternate locus rather than the primary assembly sequence, and filtering these genomes with our algorithm led to the identification of 7863 variant calls per genome that colocalized with ASDPs.
    [Show full text]
  • Compact Graphical Representation of Phylogenetic Data and Metadata with Graphlan
    Compact graphical representation of phylogenetic data and metadata with GraPhlAn The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Asnicar, Francesco, George Weingart, Timothy L. Tickle, Curtis Huttenhower, and Nicola Segata. 2015. “Compact graphical representation of phylogenetic data and metadata with GraPhlAn.” PeerJ 3 (1): e1029. doi:10.7717/peerj.1029. http://dx.doi.org/10.7717/ peerj.1029. Published Version doi:10.7717/peerj.1029 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:17820708 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Compact graphical representation of phylogenetic data and metadata with GraPhlAn Francesco Asnicar1, George Weingart2, Timothy L. Tickle3, Curtis Huttenhower2,3 and Nicola Segata1 1 Centre for Integrative Biology (CIBIO), University of Trento, Italy 2 Biostatistics Department, Harvard School of Public Health, USA 3 Broad Institute of MIT and Harvard, USA ABSTRACT The increased availability of genomic and metagenomic data poses challenges at multiple analysis levels, including visualization of very large-scale microbial and microbial community data paired with rich metadata. We developed GraPhlAn (Graphical Phylogenetic Analysis), a computational tool that produces high-quality, compact visualizations of microbial genomes and metagenomes. This includes phylogenies spanning up to thousands of taxa, annotated with metadata ranging from microbial community abundances to microbial physiology or host and environmental phenotypes. GraPhlAn has been developed as an open-source command-driven tool in order to be easily integrated into complex, publication- quality bioinformatics pipelines.
    [Show full text]
  • Galaxy Platform for NGS Data Analyses
    Galaxy Platform For NGS Data Analyses Weihong Yan [email protected] Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 § UCLA galaxy and user account § Galaxy web interface and management § Tools for NGS analyses and their application § Data formats § Build/share workflow and history § Q and A ü Day 2 § Galaxy Tools for RNA-seq analysis § Galaxy Tools for ChIP-seq analysis § Galaxy Tools for annotation. § Q and A *** Published datasets/results will be used in the tutorial UCLA Galaxy http://galaxy.hoffman2.idre.ucla.edu ü Hardware – Headnode (1) 96Gb memory, 12 core – Computing nodes (8) 48Gb memory, 12 core – Storage 100 Tb disk space ü Galaxy Resource Management - Hoffman2 grid engine Default: 1 core/job bowtie, bwa, tophat, cuffdiff, cufflinks, gatk programs: 4 core/job UCLA Galaxy http://galaxy.hoffman2.idre.ucla.edu ü galaxy login account: login: your email associated with ucla ü Disk quota: 1 Tb/user Galaxy Account Management Installed tools Launch analysis and view result History of execu7on and results Raw Reads *_qseq.txt, *.fastq Upload to Galaxy File transfer protocol (ftp) deMultiplex Barcode splitter, deMultiplex workflow fastqc, compute quality statistics, Quality Assessment draw quality score boxplot, draw nuclotides distribution Process Reads Trim sequences, sickle, scythe Alignment to bwa, bowtie, bowtie2, tophat Reference Format Conversion Text manipulation toolkit, BEDTools, SAM Results (sam/bam) Tools, java genomics toolkit, picard toolkit Downstream Analyses BS-Seeker2, cufflinks, cuffdiff, macs, macs2, GATK, CEAS Visualization Genome browser, IGV Repositories of Galaxy Tools https://toolshed.g2.bx.psu.edu ü History panel contains all datasets that are uploaded and results derived from certain analyses ü A history can be organized, annotated, and managed as a project ü History is sharable.
    [Show full text]
  • Comparative and Genetic Analysis of the Four Sequenced Paenibacillus
    Eastman et al. BMC Genomics 2014, 15:851 http://www.biomedcentral.com/1471-2164/15/851 RESEARCH ARTICLE Open Access Comparative and genetic analysis of the four sequenced Paenibacillus polymyxa genomes reveals a diverse metabolism and conservation of genes relevant to plant-growth promotion and competitiveness Alexander W Eastman1,2, David E Heinrichs2 and Ze-Chun Yuan1,2* Abstract Background: Members of the genus Paenibacillus are important plant growth-promoting rhizobacteria that can serve as bio-reactors. Paenibacillus polymyxa promotes the growth of a variety of economically important crops. Our lab recently completed the genome sequence of Paenibacillus polymyxa CR1. As of January 2014, four P. polymyxa genomes have been completely sequenced but no comparative genomic analyses have been reported. Results: Here we report the comparative and genetic analyses of four sequenced P. polymyxa genomes, which revealed a significantly conserved core genome. Complex metabolic pathways and regulatory networks were highly conserved and allow P. polymyxa to rapidly respond to dynamic environmental cues. Genes responsible for phytohormone synthesis, phosphate solubilization, iron acquisition, transcriptional regulation, σ-factors, stress responses, transporters and biomass degradation were well conserved, indicating an intimate association with plant hosts and the rhizosphere niche. In addition, genes responsible for antimicrobial resistance and non-ribosomal peptide/polyketide synthesis are present in both the core and accessory genome of each strain. Comparative analyses also reveal variations in the accessory genome, including large plasmids present in strains M1 and SC2. Furthermore, a considerable number of strain-specific genes and genomic islands are irregularly distributed throughout each genome. Although a variety of plant-growth promoting traits are encoded by all strains, only P.
    [Show full text]
  • Downloaded from a Variety of Sources (For Details, See Table S1) and Between Metazoa and Dictyostelium
    UC Riverside UC Riverside Previously Published Works Title This Déjà vu feeling--analysis of multidomain protein evolution in eukaryotic genomes. Permalink https://escholarship.org/uc/item/5398f2x3 Journal PLoS computational biology, 8(11) ISSN 1553-734X Authors Zmasek, Christian M Godzik, Adam Publication Date 2012 DOI 10.1371/journal.pcbi.1002701 Peer reviewed eScholarship.org Powered by the California Digital Library University of California This De´ja` Vu Feeling—Analysis of Multidomain Protein Evolution in Eukaryotic Genomes Christian M. Zmasek*, Adam Godzik* Program in Bioinformatics and Systems Biology, Sanford-Burnham Medical Research Institute, La Jolla, California, United States of America Abstract Evolutionary innovation in eukaryotes and especially animals is at least partially driven by genome rearrangements and the resulting emergence of proteins with new domain combinations, and thus potentially novel functionality. Given the random nature of such rearrangements, one could expect that proteins with particularly useful multidomain combinations may have been rediscovered multiple times by parallel evolution. However, existing reports suggest a minimal role of this phenomenon in the overall evolution of eukaryotic proteomes. We assembled a collection of 172 complete eukaryotic genomes that is not only the largest, but also the most phylogenetically complete set of genomes analyzed so far. By employing a maximum parsimony approach to compare repertoires of Pfam domains and their combinations, we show that independent evolution of domain combinations is significantly more prevalent than previously thought. Our results indicate that about 25% of all currently observed domain combinations have evolved multiple times. Interestingly, this percentage is even higher for sets of domain combinations in individual species, with, for instance, 70% of the domain combinations found in the human genome having evolved independently at least once in other species.
    [Show full text]
  • Analysis of Class C G-Protein Coupled Receptors Using Supervised
    Analysis of class C G-Protein Coupled Receptors using supervised classification methods Caroline Leonore König Supervised by: Dr. René Alquézar Mancho and Dr. Alfredo Vellido Alcacena Computer Science Department Universitat Politècnica de Catalunya A thesis submitted for the degree of Ph.D. in Artificial Intelligence Acknowledgments I would like to thank my advisors Dr. Alfredo Vellido and Dr. René Alquézar from the SOCO research group of the UPC, for giving me the opportunity to develop my Ph.D. thesis as part of the KAPPA-AIM1 project and work on the investigation of G protein-coupled receptors with Artificial Intelligence methods. Both of them helped me with questions and provided me useful feedback, as well as valuable advice and input at every stage of this thesis, spending a long time during the preparation of this work. As well, I would like to thank to Dr. Jesús Giraldo from the ’Institut de Neurociencies’ of the ’Universitat Autònoma de Barcelona’ (UAB) for the large collaboration in this PhD research providing so many biological insight to the study. 1KAPPA-AIM: Knowledge Acquisition in Pharmacoproteomics using Advanced Artificial Intel- ligence Methods i Abstract G protein-coupled receptors (GPCRs) are cell membrane proteins with a key role in regulating the function of cells. This is the result of their ability to transmit extracellular signals, which makes them relevant for pharmacology and has led, over the last decade, to active research in the field of proteomics. The current thesis specifically targets class C of GPCRs, which are relevant in therapies for various central nervous system disorders, such as Alzheimer’s disease, anxiety, Parkinson’s disease and schizophrenia.
    [Show full text]
  • Comparative Analysis of Plant Genomes Through Data Integration
    Comparative Analysis of Plant Genomes through Data Integration Michiel Van Bel Promoter: Prof. Dr. Yves Van de Peer Co-Promoter: Prof. Dr. Klaas Vandepoele Ghent University Faculty of Sciences Department of Plant Biotechnology and Bioinformatics VIB Department of Plant Systems Biology Bioinformatics and Systems Biology Dissertation submitted in fulfillment of the requirements for the degree of Doctor (PhD) in Sciences, Bioinformatics). Academic year: 2012-2013 Examination Committee Prof. Dr. Geert De Jaeger (chair) Faculty of Sciences, Department of Plant Biotechnology and Bioinformatics, Ghent University Prof. Dr. Yves Van de Peer (promoter) Faculty of Sciences, Department of Plant Biotechnology and Bioinformatics, Ghent University Prof. Dr. Klaas Vandepoele (co-promoter) Faculty of Sciences, Department of Plant Biotechnology and Bioinformatics, Ghent University Prof. Dr. Jan Fostier Faculty of Engineering, Department of Information Technology, Ghent University Prof. Dr. Peter Dawyndt Faculty of Science, Department of Applied Mathematics and Computer Science, Ghent University Dr. Steven Robbens Bayer Cropscience, Belgium Dr. Matthieu Conte Syngenta Seeds, France II Acknowledgements While the cover of this book carries my name, this thesis did not come to fruition by my hand only. These past years have been a great experience, for which I would like to express my gratitude to several people. First of all, I would like to thank Thomas Abeel, for getting me in touch with Yves’ research group, and encouraging me to start a PhD in bioinformatics. Without a chance encounter with him, I never would have dreamed obtaining a PhD would be possible. Secondly, I would like to thank my promoter and co-promoter, Yves Van de Peer and Klaas Vande- poele.
    [Show full text]
  • BMC Bioinformatics Biomed Central
    BMC Bioinformatics BioMed Central Database Open Access Atlas – a data warehouse for integrative bioinformatics Sohrab P Shah, Yong Huang, Tao Xu, Macaire MS Yuen, John Ling and BF Francis Ouellette* Address: UBC Bioinformatics Centre, University of British Columbia, Vancouver, BC, Canada Email: Sohrab P Shah - [email protected]; Yong Huang - [email protected]; Tao Xu - [email protected]; Macaire MS Yuen - [email protected]; John Ling - [email protected]; BF Francis Ouellette* - [email protected] * Corresponding author Published: 21 February 2005 Received: 04 September 2004 Accepted: 21 February 2005 BMC Bioinformatics 2005, 6:34 doi:10.1186/1471-2105-6-34 This article is available from: http://www.biomedcentral.com/1471-2105/6/34 © 2005 Shah et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: We present a biological data warehouse called Atlas that locally stores and integrates biological sequences, molecular interactions, homology information, functional annotations of genes, and biological ontologies. The goal of the system is to provide data, as well as a software infrastructure for bioinformatics research and development. Description: The Atlas system is based on relational data models that we developed for each of the source data types. Data stored within these relational models are managed through Structured Query Language (SQL) calls that are implemented in a set of Application Programming Interfaces (APIs).
    [Show full text]