Three Data Delivery Cases for EMBL- EBI's Embassy

Total Page:16

File Type:pdf, Size:1020Kb

Three Data Delivery Cases for EMBL- EBI's Embassy Three data delivery cases for EMBL- EBI’s Embassy Guy Cochrane www.ebi.ac.uk EMBL European Bioinformatics Institute Genes, genomes & variation Protein sequences • European Nucleotide Archive • InterPro • 1000 Genomes • Pfam • Ensembl • UniProt • Ensembl Genomes • Ensembl Plants Molecular structures • European Genome-phenome Archive • Protein Data Bank in Europe • Metagenomics portal • Electron Microscopy Data Bank • GWAS Catalog browser Expression • ArrayExpress Chemical biology • Expression Atlas • ChEMBL • Metabolights • ChEBI • PRIDE Literature & ontology • Europe PubMed Central Reactions, interactions • Gene Ontology & pathways Systems • Experimental Factor • IntAct • BioModels Ontology • Reactome • Enzyme Portal • MetaboLights • BioSamples Sequence data at EMBL-EBI Sample/method Sample/method Read Read Alignment Alignment European Genome-phenome Archive - Controlled access data - Human data around molecular medicine Assembly - http://www.ebi.ac.uk/ega/ Annotation European Nucleotide Archive - Unrestricted data - Pan-species and application - http://www.ebi.ac.uk/ena/ Sequence data at EMBL-EBI Sample/method Sample/method Read Read Alignment Alignment European Genome-phenome Archive - Controlled access data - Human data around molecular medicine Assembly - http://www.ebi.ac.uk/ega/ Infrastructure provision Annotation - BBSRC: RNAcentral, MG Portal - MRC: 100k Genomes data implementation European Nucleotide Archive - EC: COMPARE, MicroB3, ESGI, - Unrestricted data BASIS - Pan-species and application - http://www.ebi.ac.uk/ena/ - etc. Challenges • Data have high volume and grow rapidly • Data are dynamic (continuous feed) and their application has urgency • Users require arbitrary and ad hoc access Tara Oceans Tara Oceans Capacity Infectious disease • Opportunity: A methodological revolution in clinical and public health towards shotgun sequencing-based methods • Scientific power: Sequence harbours rich information • Diagnostic: identification, typing, resistance profiling, etc. • Public health: outbreak detection, response strategy, vaccine development • Mechanistic: host interactions, pathogencity, virulence, transmission, anti- COMPARE: recently launched microbial resistance Horizon 2020 project in which EMBL-EBI is informatics provider • Informatics roles for EMBL: • COMPARE: Rapid global sharing of surveillance and outbreak data, systematic integrated analysis, compute provision (Embassy) • Standards for reporting, analysis and the communication of results • New algorithms and analysis methods • User interfaces for surveillance data reporting , across the domains Global Microbial Identifier: Initiative with EMBL-EBI involvement supporting technologies, standards and data sharing for pathogen surveillance COMPARE platform Sources Processes Portals and environments COMPARE COMPARE Data Resource COMPARE workflow engine Food COMPARE Portal Registry workflow development Public data Assembly & ‘Default’ tools alignment Clinical INSDC data Annotaon ‘Hosted tools’ exchange workflow development API API Managed access Typing data Outbreak workflow development Private data Workflow integraon API EBI infrastructure Embassy infrastructure DTU infrastructure Embassy virtual domain COMPARE platform Sources Processes Portals and environments COMPARE COMPARE Data Resource COMPARE workflow engine Food COMPARE Portal Registry workflow development Public data Assembly & ‘Default’ tools alignment Clinical INSDC data Annotaon ‘Hosted tools’ exchange workflow development API API Managed access Typing data Urgency Outbreak workflow development Private data Workflow integraon API EBI infrastructure Embassy infrastructure DTU infrastructure Embassy virtual domain Personalised medicine • Motivation: Personalised studies of variation, cancer mutation, epigenetics, regulation, expression require references for comparison and interpretation • As part of GA4GH, EMBL-EBI is working on • Resources serving reference human genomic and transcriptomic data, including Google read API, variant ‘Beacons’, etc. • CRAM compression supporting greater data fluidity and APIs to allow direct computational access • Delivery and synchronisation of high volume datasets to local Embassy and remote cloud infrastructures • Past and current FP7 projects include SLING, BASIS, ESGI Personalised medicine • Motivation: Personalised studies of variation, cancer mutation, epigenetics, regulation, expression require references for comparison and interpretation • As part of GA4GH,Arbitrary EMBL-EBI is working access on • Resources serving reference human genomic and transcriptomic data, including Google read API, variant ‘Beacons’, etc. • CRAM compression supporting greater data fluidity and APIs to allow direct computational access • Delivery and synchronisation of high volume datasets to local Embassy and remote cloud infrastructures • Past and current FP7 projects include SLING, BASIS, ESGI ENA conventional read data delivery Conventional infrastructure (FTP, Aspera, GridFTP) ENA metadata FIRE1 ENA data (NFS) ENA Embassy read data delivery Conventional infrastructure (FTP, Aspera, GridFTP) ENA metadata FIRE2 FUSE FUSE ENA data (Cleversafe) HTTP ENA Embassy read data delivery Conventional Embassy cloud infrastructure infrastructure (VMWare -> OpenStack) (FTP, Aspera, GridFTP) Marine cache Tara Oceans Embassy ENA metadata Pathogen cache COMPARE Embassy FIRE2 FUSE FUSE ENA data CRAM (Cleversafe) cache GA4GH Embassy HTTP ENA external read data delivery …phase II EMBL-EBI Embassy Cloud Steven Newhouse Head of Technical Services The Challenge Facing EMBL-EBI • Volume and variety of genomic data expanding • EMBL-EBI data doubling every year - replication is challenging • Infrastructure currently 50,000 CPUs & 60+PB • Need to support complex analysis scenarios • Web and programmatic access to services (3M unique users) • Access to both public and managed access data sets • Bespoke workflows and tools across a variety of domains • Hard for users to replicate data sets for local analysis • Use the ‘cloud’ to bring local analysis to EMBL-EBI data 18 EMBL-EBI Embassy Cloud • Service hosted at EMBL-EBI data centres • Direct network access to public and managed data sets • Direct network to access public services • Expect both academic and commercial users • Technical Implementation • Logically isolated outside EMBL-EBI’s LANs • Secure flexible infrastructure for both tenant and host • Resources exposed using VMware’s vCloud Director & OpenStack • Provide isolated IaaS clouds to multiple users 19 Why ‘Embassy’ Cloud? • An embassy is sovereign territory in a host country • Host Country: EMBL-EBI Data Centre • Sovereign Territory: Host Country not allowed to enter • Virtualisation provides the protection for ‘tenant’ and ‘host’ • Host puts boundaries in place to protect it from the tenant • Tenant has freedom and control within those boundaries 20 21 Embassy CloudConcept Virtualised EMBL-EBI Hardware Hardware EMBL-EBI Virtualised Public Data Public Services Managed Data Embassy Cloud 1 Embassy Cloud 2 PanCancer Embassy Cloud 3 Private Data User Benefits for the IaaS Model • Tenant organisations get an empty virtual infrastructure • They establish their own virtual machines and networks • System administration performed by the tenant • EMBL-EBI staff have no access to the VMs • Added value from EMBL-EBI over other clouds • Machines and data hosted in known jurisdiction • Direct network data sets (public & managed access) • Direct network access to public EMBL-EBI services 22 Benefits to EMBL-EBI of the IaaS Model • A secure collaborative workspace • Work does not contend with main EMBL-EBI resources • Clearly define the committed IT resources and data • Explore how to build more data focused analysis services • Move the analysis to where the big data is located • Learn from and inform other big data scientific communities 23 Embassy Cloud: Typical Uses • Collaborative Environment • Neutral ground outside internal network • CTTV: Resources and VMs to host intranet, databases, … • Data Staging • Undertake submission from local machine (following data staging) rather from remote location • BRAEMBL: Remote submission unreliable due to file upload • Data Analysis • Large scale management and analysis of data • PanCancer: 1,000 cores, 2.5 TB RAM, 1.0 PB HDD Issues • Object Store Storage Infrastructure • Essential for scalable high-performance storage • Applications need to adapt to flat model • Current caching strategy will have a limit • Sharing resources between sites/communities/clouds • Adopt a standards based model for federating resources • Solutions for uploading and distributing VMs (+containers?) • Replicating large data sets to ‘attract’ workloads to a cloud 25 Gaps à Activities à Solutions? • Data Set Replication • Strategic pre-positioning of data into clouds • Leverage JANET/GEANT, GridFTP + Globus Transfers, … • Cloud federation for mobile computing • EGI has a federated cloud and VM distribution model • ELIXIR plans to build on existing infrastructure where possible • Wide-area file access needed for collaborative data analysis • High performance wide-area object-store • Need access control for human related data • Coordinated investment in infrastructure • Where is the UK coordination? What coordination is needed? • Integrating commercial resources where they add value • Integration with EU Infrastructure (ELIXIR) 26 .
Recommended publications
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The
    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
    [Show full text]
  • Dual Proteome-Scale Networks Reveal Cell-Specific Remodeling of the Human Interactome
    bioRxiv preprint doi: https://doi.org/10.1101/2020.01.19.905109; this version posted January 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Dual Proteome-scale Networks Reveal Cell-specific Remodeling of the Human Interactome Edward L. Huttlin1*, Raphael J. Bruckner1,3, Jose Navarrete-Perea1, Joe R. Cannon1,4, Kurt Baltier1,5, Fana Gebreab1, Melanie P. Gygi1, Alexandra Thornock1, Gabriela Zarraga1,6, Stanley Tam1,7, John Szpyt1, Alexandra Panov1, Hannah Parzen1,8, Sipei Fu1, Arvene Golbazi1, Eila Maenpaa1, Keegan Stricker1, Sanjukta Guha Thakurta1, Ramin Rad1, Joshua Pan2, David P. Nusinow1, Joao A. Paulo1, Devin K. Schweppe1, Laura Pontano Vaites1, J. Wade Harper1*, Steven P. Gygi1*# 1Department of Cell Biology, Harvard Medical School, Boston, MA, 02115, USA. 2Broad Institute, Cambridge, MA, 02142, USA. 3Present address: ICCB-Longwood Screening Facility, Harvard Medical School, Boston, MA, 02115, USA. 4Present address: Merck, West Point, PA, 19486, USA. 5Present address: IQ Proteomics, Cambridge, MA, 02139, USA. 6Present address: Vor Biopharma, Cambridge, MA, 02142, USA. 7Present address: Rubius Therapeutics, Cambridge, MA, 02139, USA. 8Present address: RPS North America, South Kingstown, RI, 02879, USA. *Correspondence: [email protected] (E.L.H.), [email protected] (J.W.H.), [email protected] (S.P.G.) #Lead Contact: [email protected] bioRxiv preprint doi: https://doi.org/10.1101/2020.01.19.905109; this version posted January 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder.
    [Show full text]
  • Sequence Motifs, Correlations and Structural Mapping of Evolutionary
    Talk overview • Sequence profiles – position specific scoring matrix • Psi-blast. Automated way to create and use sequence Sequence motifs, correlations profiles in similarity searches and structural mapping of • Sequence patterns and sequence logos evolutionary data • Bioinformatic tools which employ sequence profiles: PFAM BLOCKS PROSITE PRINTS InterPro • Correlated Mutations and structural insight • Mapping sequence data on structures: March 2011 Eran Eyal Conservations Correlations PSSM – position specific scoring matrix • A position-specific scoring matrix (PSSM) is a commonly used representation of motifs (patterns) in biological sequences • PSSM enables us to represent multiple sequence alignments as mathematical entities which we can work with. • PSSMs enables the scoring of multiple alignments with sequences, or other PSSMs. PSSM – position specific scoring matrix Assuming a string S of length n S = s1s2s3...sn If we want to score this string against our PSSM of length n (with n lines): n alignment _ score = m ∑ s j , j j=1 where m is the PSSM matrix and sj are the string elements. PSSM can also be incorporated to both dynamic programming algorithms and heuristic algorithms (like Psi-Blast). Sequence space PSI-BLAST • For a query sequence use Blast to find matching sequences. • Construct a multiple sequence alignment from the hits to find the common regions (consensus). • Use the “consensus” to search again the database, and get a new set of matching sequences • Repeat the process ! Sequence space Position-Specific-Iterated-BLAST • Intuition – substitution matrices should be specific to sites and not global. – Example: penalize alanine→glycine more in a helix •Idea – Use BLAST with high stringency to get a set of closely related sequences.
    [Show full text]
  • An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.20.050369; this version posted May 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. An open-sourced bioinformatic pipeline for the processing of Next-Generation Sequencing derived nucleotide reads: Identification and authentication of ancient metagenomic DNA Thomas C. Collin1, *, Konstantina Drosou2, 3, Jeremiah Daniel O’Riordan4, Tengiz Meshveliani5, Ron Pinhasi6, and Robin N. M. Feeney1 1School of Medicine, University College Dublin, Ireland 2Division of Cell Matrix Biology Regenerative Medicine, University of Manchester, United Kingdom 3Manchester Institute of Biotechnology, School of Earth and Environmental Sciences, University of Manchester, United Kingdom [email protected] 5Institute of Paleobiology and Paleoanthropology, National Museum of Georgia, Tbilisi, Georgia 6Department of Evolutionary Anthropology, University of Vienna, Austria *Corresponding Author Abstract The emerging field of ancient metagenomics adds to these Bioinformatic pipelines optimised for the processing and as- processing complexities with the need for additional steps sessment of metagenomic ancient DNA (aDNA) are needed in the separation and authentication of ancient sequences from modern sequences. Currently, there are few pipelines for studies that do not make use of high yielding DNA cap- available for the analysis of ancient metagenomic DNA ture techniques. These bioinformatic pipelines are tradition- 1 4 ally optimised for broad aDNA purposes, are contingent on (aDNA) ≠ The limited number of bioinformatic pipelines selection biases and are associated with high costs.
    [Show full text]
  • Identification of Transcribed Sequences in Arabidopsis Thaliana by Using High-Resolution Genome Tiling Arrays
    Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays Viktor Stolc*†‡§, Manoj Pratim Samanta‡§¶, Waraporn Tongprasitʈ, Himanshu Sethiʈ, Shoudan Liang*, David C. Nelson**, Adrian Hegeman**, Clark Nelson**, David Rancour**, Sebastian Bednarek**, Eldon L. Ulrich**, Qin Zhao**, Russell L. Wrobel**, Craig S. Newman**, Brian G. Fox**, George N. Phillips, Jr.**, John L. Markley**, and Michael R. Sussman**†† *Genome Research Facility, National Aeronautics and Space Administration Ames Research Center, Moffett Field, CA 94035; †Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06520; ¶Systemix Institute, Cupertino, CA 94035; ʈEloret Corporation at National Aeronautics and Space Administration Ames Research Center, Moffett Field, CA 94035; and **Center for Eukaryotic Structural Genomics, University of Wisconsin, Madison, WI 53706 Edited by Sidney Altman, Yale University, New Haven, CT, and approved January 28, 2005 (received for review November 4, 2004) Using a maskless photolithography method, we produced DNA Genome-wide tiling arrays can overcome many of the shortcom- oligonucleotide microarrays with probe sequences tiled through- ings of the previous approaches by comprehensively probing out the genome of the plant Arabidopsis thaliana. RNA expression transcription in all regions of the genome. This technology has was determined for the complete nuclear, mitochondrial, and been used successfully on different organisms (5–12). A recent chloroplast genomes by tiling 5 million 36-mer probes. These study on A. thaliana reported measuring transcriptional activities probes were hybridized to labeled mRNA isolated from liquid of four different cell lines by using 25-mer-based tiling arrays that grown T87 cells, an undifferentiated Arabidopsis cell culture line.
    [Show full text]
  • Impact of the Protein Data Bank Across Scientific Disciplines.Data Science Journal, 19: 25, Pp
    Feng, Z, et al. 2020. Impact of the Protein Data Bank Across Scientific Disciplines. Data Science Journal, 19: 25, pp. 1–14. DOI: https://doi.org/10.5334/dsj-2020-025 RESEARCH PAPER Impact of the Protein Data Bank Across Scientific Disciplines Zukang Feng1,2, Natalie Verdiguel3, Luigi Di Costanzo1,4, David S. Goodsell1,5, John D. Westbrook1,2, Stephen K. Burley1,2,6,7,8 and Christine Zardecki1,2 1 Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ, US 2 Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ, US 3 University of Central Florida, Orlando, Florida, US 4 Department of Agricultural Sciences, University of Naples Federico II, Portici, IT 5 Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, US 6 Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA, US 7 Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, US 8 Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA, US Corresponding author: Christine Zardecki ([email protected]) The Protein Data Bank archive (PDB) was established in 1971 as the 1st open access digital data resource for biology and medicine. Today, the PDB contains >160,000 atomic-level, experimentally-determined 3D biomolecular structures. PDB data are freely and publicly available for download, without restrictions. Each entry contains summary information about the structure and experiment, atomic coordinates, and in most cases, a citation to a corresponding scien- tific publication.
    [Show full text]
  • The Biogrid Interaction Database
    D470–D478 Nucleic Acids Research, 2015, Vol. 43, Database issue Published online 26 November 2014 doi: 10.1093/nar/gku1204 The BioGRID interaction database: 2015 update Andrew Chatr-aryamontri1, Bobby-Joe Breitkreutz2, Rose Oughtred3, Lorrie Boucher2, Sven Heinicke3, Daici Chen1, Chris Stark2, Ashton Breitkreutz2, Nadine Kolas2, Lara O’Donnell2, Teresa Reguly2, Julie Nixon4, Lindsay Ramage4, Andrew Winter4, Adnane Sellam5, Christie Chang3, Jodi Hirschman3, Chandra Theesfeld3, Jennifer Rust3, Michael S. Livstone3, Kara Dolinski3 and Mike Tyers1,2,4,* 1Institute for Research in Immunology and Cancer, Universite´ de Montreal,´ Montreal,´ Quebec H3C 3J7, Canada, 2The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada, 3Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA, 4School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JR, UK and 5Centre Hospitalier de l’UniversiteLaval´ (CHUL), Quebec,´ Quebec´ G1V 4G2, Canada Received September 26, 2014; Revised November 4, 2014; Accepted November 5, 2014 ABSTRACT semi-automated text-mining approaches, and to en- hance curation quality control. The Biological General Repository for Interaction Datasets (BioGRID: http://thebiogrid.org) is an open access database that houses genetic and protein in- INTRODUCTION teractions curated from the primary biomedical lit- Massive increases in high-throughput DNA sequencing erature for all major model organism species and technologies (1) have enabled an unprecedented level of humans. As of September 2014, the BioGRID con- genome annotation for many hundreds of species (2–6), tains 749 912 interactions as drawn from 43 149 pub- which has led to tremendous progress in the understand- lications that represent 30 model organisms.
    [Show full text]
  • The Interpro Database, an Integrated Documentation Resource for Protein
    The InterPro database, an integrated documentation resource for protein families, domains and functional sites R Apweiler, T K Attwood, A Bairoch, A Bateman, E Birney, M Biswas, P Bucher, L Cerutti, F Corpet, M D Croning, et al. To cite this version: R Apweiler, T K Attwood, A Bairoch, A Bateman, E Birney, et al.. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research, Oxford University Press, 2001, 29 (1), pp.37-40. 10.1093/nar/29.1.37. hal-01213150 HAL Id: hal-01213150 https://hal.archives-ouvertes.fr/hal-01213150 Submitted on 7 Oct 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. © 2001 Oxford University Press Nucleic Acids Research, 2001, Vol. 29, No. 1 37–40 The InterPro database, an integrated documentation resource for protein families, domains and functional sites R. Apweiler1,*, T. K. Attwood2,A.Bairoch3, A. Bateman4,E.Birney1, M. Biswas1, P. Bucher5, L. Cerutti4,F.Corpet6, M. D. R. Croning1,2, R. Durbin4,L.Falquet5,W.Fleischmann1, J. Gouzy6,H.Hermjakob1,N.Hulo3, I. Jonassen7,D.Kahn6,A.Kanapin1, Y. Karavidopoulou1, R.
    [Show full text]
  • Multiple Sequence Alignment
    ELB18S Entry Level Bioinformatics 05-09 November 2018 (Second 2018 run of this Course) Basic Bioinformatics Sessions Practical 6: Multiple Sequence Alignment Sunday 4 November 2018 Practical 6: Multiple Sequence Alignment Sunday 4 November 2018 Multiple Sequence Alignment Here we will look at some software tools to align some protein sequences. Before we can do that, we need some sequences to align. I propose we try all the human homeobox domains from the well annotated section of UniprotKB. Getting the sequences is a trifle clumsy, so concentrate now! There used to be a much easier way, but that was made redundant by foolish people intent on making the future ever more tricky!! So, begin by going to the home of Uniprot: http://www.uniprot.org/ Choose the option of the button. First specify that you are only interested in Human proteins. To do this, set the first field to Organism [OS] and Term to Human [9606]. Set the second field selector to Reviewed and the corresponding Term to Reviewed (that is, only SwissProt entries). If required, Click on the button to request a further field selection option. Set the new field to Function. Set the type of Function to DNA binding. Set the Term selection to Homeobox. From previous investigations, you should be aware that a Homeobox domain is generally 60 amino acids in length. To avoid partial and/or really weird Homeobox proteins, set the Length range settings to recognise only homeoboxs between 50 and 70 amino acids long. Leave the Evidence box as Any assertion method, one does not wish to be too fussy! Address the button with authority to get the search going.
    [Show full text]
  • Efficient Storage and Analysis of Genome Data in Relational Database Systems
    Efficient Storage and Analysis of Genome Data in Relational Database Systems D I S S E R T A T I O N zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) angenommen durch die Fakultät für Informatik der Otto-von-Guericke-Universität Magdeburg von M.Sc. Sebastian Dorok geb. am 09.10.1986 in Haldensleben Gutachterinnen/Gutachter Prof. Dr. Gunter Saake Prof. Dr. Jens Teubner Prof. Dr. Ralf Hofestädt Magdeburg, den 27.04.2017 Dorok, Sebastian: Efficient storage and analysis of genome data in relational database systems Dissertation, University of Magdeburg, 2017. Abstract Genome analysis allows researchers to reveal insights about the genetic makeup of living organisms. In the near future, genome analysis will become a key means in the detection and treatment of diseases that are based on variations of the genetic makeup. To this end, powerful variant detection tools were developed or are still under development. However, genome analysis faces a large data deluge. The amounts of data that are produced in a typical genome analysis experiment easily exceed several 100 gigabytes. At the same time, the number of genome analysis experiments increases as the costs drop. Thus, the reliable and efficient management and analysis of large amounts of genome data will likely become a bottleneck, if we do not improve current genome data management and analysis solutions. Currently, genome data management and analysis relies mainly on flat-file based storage and command-line driven analysis tools. Such approaches offer only limited data man- agement capabilities that can hardly cope with future requirements such as annotation management or provenance tracking.
    [Show full text]
  • The Uniprot Knowledgebase BLAST
    Introduction to bioinformatics The UniProt Knowledgebase BLAST UniProtKB Basic Local Alignment Search Tool A CRITICAL GUIDE 1 Version: 1 August 2018 A Critical Guide to BLAST BLAST Overview This Critical Guide provides an overview of the BLAST similarity search tool, Briefly examining the underlying algorithm and its rise to popularity. Several WeB-based and stand-alone implementations are reviewed, and key features of typical search results are discussed. Teaching Goals & Learning Outcomes This Guide introduces concepts and theories emBodied in the sequence database search tool, BLAST, and examines features of search outputs important for understanding and interpreting BLAST results. On reading this Guide, you will Be aBle to: • search a variety of Web-based sequence databases with different query sequences, and alter search parameters; • explain a range of typical search parameters, and the likely impacts on search outputs of changing them; • analyse the information conveyed in search outputs and infer the significance of reported matches; • examine and investigate the annotations of reported matches, and their provenance; and • compare the outputs of different BLAST implementations and evaluate the implications of any differences. finding short words – k-tuples – common to the sequences Being 1 Introduction compared, and using heuristics to join those closest to each other, including the short mis-matched regions Between them. BLAST4 was the second major example of this type of algorithm, From the advent of the first molecular sequence repositories in and rapidly exceeded the popularity of FastA, owing to its efficiency the 1980s, tools for searching dataBases Became essential. DataBase searching is essentially a ‘pairwise alignment’ proBlem, in which the and Built-in statistics.
    [Show full text]
  • Pdbefold Tutorial Tutorial Pdbefold Can May Be Accessed from Multiple Locations on the Pdbe Website
    PDBe TUTORIAL PDBeFold (SSM: Secondary Structure Matching) http://pdbe.org/fold/ This PDBe tutorial introduces PDBeFold, an interactive service for comparing protein structures in 3D. This service provides: . Pairwise and multiple comparison and 3D alignment of protein structures . Examination of a protein structure for similarity with the whole Protein Data Bank (PDB) archive or SCOP. Best C -alignment of compared structures . Download and visualisation of best-superposed structures using various graphical packages PDBeFold structure alignment is based on identification of residues occupying “equivalent” geometrical positions. In other words, unlike sequence alignment, residue type is neglected. The PDBeFold service is a very powerful structure alignment tool which can perform both pairwise and multiple three dimensional alignment. In addition to this there are various options by which the results of the structural alignment query can be sorted. The results of the Secondary Structure Matching can be sorted based on the Q score (Cα- alignment), P score (taking into account RMSD, number of aligned residues, number of gaps, number of matched Secondary Structure Elements and the SSE match score), Z score (based on Gaussian Statistics), RMSD and % Sequence Identity. It is hoped that at the end of this tutorial users will be able to use PDbeFold for the analysis of their own uploaded structures or entries already in the PDB archive. Protein Data Bank in Europe http://pdbe.org PDBeFOLD Tutorial Tutorial PDBeFold can may be accessed from multiple locations on the PDBe website. From the PDBe home page (http://pdbe.org/), there are two access points for the program as shown below.
    [Show full text]