Accessing Molecular Biology Information * Genome Browsers

Total Page:16

File Type:pdf, Size:1020Kb

Accessing Molecular Biology Information * Genome Browsers Accessing molecular biology information * Genome browsers - UCSC * NCBI * Galaxy Flow of genetic information and databases Exon 1 Exon 2 Exon 3 Exon 4 intron intron intron DNA transcription 5' 3' RNA Genbank/ EMBL splicing 5' UTR 3' UTR polyA mature coding sequence mRNA translation protein GenPept / SwissProt folding / UniProt Protein Data Bank (PDB) Databases, cont. Redundancy at GenBank => RefSeq Many sequences are represented more than once in GenBank 2003 RefSeq collection : curated secondary database non-redundant selected organisms •Genome DNA (assemblies) •Transcripts (RNA) •Protein http://www.ncbi.nlm.nih.gov/books/bv.fcg i?rid=handbook RefSeq vs GenBank GenBank RefSeq Not curated Curated Author submits NCBI creates from existing data Only author can revise NCBI reivses as new data emerge Multiple records from same loci Single records for each molecule common of major organisms Records can contradict each other No limit to species included Limited to model organisms Akin to primary literature Akin to review articles Archives with nucleotide sequences • Genbank/EMBL • Sequence read archive (NCBI) • GEO (Gene expression omnibus, NCBI) • 1000 Genomes Project • The Cancer Genome Atlas • The Cancer Genome Project • Species-specific databases like –FlyBase – WormBase – Saccharomyces Genome Database Genome sequencing using a shotgun approach Sequenced eukaryotic genomes Sequencing going wild ... BGI : "capacity to sequence the equivalent of 1,600 complete human genomes each day" "BGI and BGI Americas aim to build a library of digital life, which includes 1,000 plant and animal reference genomes, 10,000 microorganism genomes". “million genome project”: Sequencing of one million chinese individuals NCBI SequenceReadArchive 2013: atotal of~10 0e+00 2e+144e+14 6e+14 8e+14 1e+15 200 8 2009 2010 2011 201 2 15 2013 nt http://www.genome.gov/sequencingcosts/ Cost per Mb of DNA Date Cost per Genome Sequence september‐2001 $ 5 292.39 $ 95 263 072 mars‐2002 $ 3 898.64 $ 70 175 437 september‐2002 $ 3 413.80 $ 61 448 422 mars‐2003 $ 2 986.20 $ 53 751 684 oktober‐2003 $ 2 230.98 $ 40 157 554 januari‐2004 $ 1 598.91 $ 28 780 376 april‐2004 $ 1 135.70 $ 20 442 576 juli‐2004 $ 1 107.46 $ 19 934 346 oktober‐2004 $ 1 028.85 $ 18 519 312 januari‐2005 $ 974.16 $ 17 534 970 april‐2005 $ 897.76 $ 16 159 699 juli‐2005 $ 898.90 $ 16 180 224 oktober‐2005 $ 766.73 $ 13 801 124 januari‐2006 $ 699.20 $ 12 585 659 april‐2006 $ 651.81 $ 11 732 535 juli‐2006 $ 636.41 $ 11 455 315 oktober‐2006 $ 581.92 $ 10 474 556 januari‐2007 $ 522.71 $ 9 408 739 april‐2007 $ 502.61 $ 9 047 003 juli‐2007 $ 495.96 $ 8 927 342 oktober‐2007 $ 397.09 $ 7 147 571 januari‐2008 $ 102.13 $ 3 063 820 april‐2008 $ 15.03 $ 1 352 982 juli‐2008 $ 8.36 $ 752 080 oktober‐2008 $ 3.81 $ 342 502 januari‐2009 $ 2.59 $ 232 735 april‐2009 $ 1.72 $ 154 714 juli‐2009 $ 1.20 $ 108 065 oktober‐2009 $ 0.78 $ 70 333 januari‐2010 $ 0.52 $ 46 774 april‐2010 $ 0.35 $ 31 512 juli‐2010 $ 0.35 $ 31 125 oktober‐2010 $ 0.32 $ 29 092 januari‐2011 $ 0.23 $ 20 963 april‐2011 $ 0.19 $ 16 712 juli‐2011 $ 0.12 $ 10 497 oktober‐2011 $ 0.09 $ 7 743 Many other sites have sequences available for downloading www.1000genomes.org The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. As with other major human genome reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases. The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. http://cancergenome.nih.gov/ The Cancer Genome Atlas (TCGA) is a landmark research program supported by the National Cancer Institute and National Human Genome Research Institute at the National Institutes of Health. TCGA researchers will identify the genomic changes in more than 20 different types of human cancer. By comparing the DNA in samples of normal tissue and cancer tissue taken from the same patient, researchers can identify changes specific to that particular cancer. TCGA is analyzing hundreds of samples for each type of cancer. By looking at many samples from many different patients, researchers will gain a better understanding of what makes one cancer different from another cancer. This is important because even two patients with the same type of cancer may experience very different outcomes or respond very differently to treatments. By connecting specific genomic changes with specific outcomes, researchers will be able to develop more effective, individualized ways of helping each cancer patient. http://www.sanger.ac.uk/genetics/CGP/ The identification of genes that are mutated and hence drive oncogenesis has been a central aim of cancer research since the advent of recombinant DNA technology. The Cancer Genome Project is using the human genome sequence and high throughput mutation detection techniques to identify somatically acquired sequence variants/mutations and hence identify genes critical in the development of human cancers (see here for a description of our strategy). This initiative will ultimately provide the paradigm for the detection of germline mutations in non-neoplastic human genetic diseases through genome-wide mutation detection approaches. Archives with nucleotide sequences • Genbank/EMBL • Sequence read archive (NCBI) • GEO (Gene expression omnibus, NCBI) • 1000 Genomes Project • The Cancer Genome Atlas • The Cancer Genome Project • Species-specific databases like –FlyBase – WormBase – Saccharomyces Genome Database NCBI * Nucleotide * Protein * Structure * PubMed * OMIM (genetic diseases) * dbSNP * Taxonomy browser NCBI databases DNA transkription "Nucleotide" RNA translation protein "Protein" folding protein 3D structure with specific "Structure" biological function Database formats - EMBL and Genbank EMBL format ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX FH Key Location/Qualifiers FH FT source 1..756 FT /db_xref="taxon:1638" FT /organism="Listeria ivanovii" FT /strain="ATCC 19119" FT RBS 95..100 FT /gene="sod" FT terminator 723..746 FT /gene="sod" FT CDS 109..717 FT /db_xref="SWISS-PROT:P28763" FT /transl_table=11 FT /gene="sod" FT /EC_number="1.15.1.1" FT /product="superoxide dismutase" FT /protein_id="CAA45406.1" FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" XX SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300 Examples of feature table elements * to represent a coding sequence that is constructed from a range of exons: CDS join(1886..1922,2272..2319,3563..3675,4750..4878) * to represent a coding sequence on the complementary strand of DNA: CDS complement(1159..2577) EMBL and Genbank formats EMBL format ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX Common sequence formats 1. Genbank 2. EMBL 3. FASTA >X12345 Y098TR gene CGTATCTTACGAGCTACTACGA GGTCTTATCGGACGAGCGACT ... 4. FASTQ @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT + !''*((((***+))%%%++)(%%%%).1***-+* Search Definition Qualifier Field Contains the unique accession number of the sequence or record, assigned to the nucleotide, protein, structure, genome Accession record, or PopSet by a sequence database builder. The [ACCN] Structure database accession index contains the PDB IDs but not the MMDB IDs. Contains all terms from all searchable database fields in the All Fields [ALL] database. Contains all authors from all references in the database Author records. The format is last name space first initial(s), without [AUTH] A selection of search Name punctuation (e.g., marley jf). fields using NCBI Entrez. Contains the biological features assigned or annotated to the nucleotide sequences and defined in the Feature Key DDBJ/EMBL/GenBank Feature Table [FKEY] (http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html). Not available for the Protein or Structure databases. Contains the name of the journal in which the data were published.
Recommended publications
  • Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P
    Published online 1 November 2009 Nucleic Acids Research, 2010, Vol. 38, Database issue D563–D569 doi:10.1093/nar/gkp871 Ensembl Genomes: Extending Ensembl across the taxonomic space P. J. Kersey*, D. Lawson, E. Birney, P. S. Derwent, M. Haimel, J. Herrero, S. Keenan, A. Kerhornou, G. Koscielny, A. Ka¨ ha¨ ri, R. J. Kinsella, E. Kulesha, U. Maheswari, K. Megy, M. Nuhn, G. Proctor, D. Staines, F. Valentin, A. J. Vilella and A. Yates EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK Received August 14, 2009; Revised September 28, 2009; Accepted September 29, 2009 ABSTRACT nucleotide archives; numerous other genomes exist in states of partial assembly and annotation; thousands of Ensembl Genomes (http://www.ensemblgenomes viral genomes sequences have also been generated. .org) is a new portal offering integrated access to Moreover, the increasing use of high-throughput genome-scale data from non-vertebrate species sequencing technologies is rapidly reducing the cost of of scientific interest, developed using the Ensembl genome sequencing, leading to an accelerating rate of genome annotation and visualisation platform. data production. This not only makes it likely that in Ensembl Genomes consists of five sub-portals (for the near future, the genomes of all species of scientific bacteria, protists, fungi, plants and invertebrate interest will be sequenced; but also the genomes of many metazoa) designed to complement the availability individuals, with the possibility of providing accurate and of vertebrate genomes in Ensembl. Many of the sophisticated annotation through the similarly low-cost databases supporting the portal have been built in application of functional assays.
    [Show full text]
  • Rare Variant Contribution to Human Disease in 281,104 UK Biobank Exomes W ­ 1,19 1,19 2,19 2 2 Quanli Wang , Ryan S
    https://doi.org/10.1038/s41586-021-03855-y Accelerated Article Preview Rare variant contribution to human disease W in 281,104 UK Biobank exomes E VI Received: 3 November 2020 Quanli Wang, Ryan S. Dhindsa, Keren Carss, Andrew R. Harper, Abhishek N ag­­, I oa nn a Tachmazidou, Dimitrios Vitsios, Sri V. V. Deevi, Alex Mackay, EDaniel Muthas, Accepted: 28 July 2021 Michael Hühn, Sue Monkley, Henric O ls so n , S eb astian Wasilewski, Katherine R. Smith, Accelerated Article Preview Published Ruth March, Adam Platt, Carolina Haefliger & Slavé PetrovskiR online 10 August 2021 P Cite this article as: Wang, Q. et al. Rare variant This is a PDF fle of a peer-reviewed paper that has been accepted for publication. contribution to human disease in 281,104 UK Biobank exomes. Nature https:// Although unedited, the content has been subjectedE to preliminary formatting. Nature doi.org/10.1038/s41586-021-03855-y (2021). is providing this early version of the typeset paper as a service to our authors and Open access readers. The text and fgures will undergoL copyediting and a proof review before the paper is published in its fnal form. Please note that during the production process errors may be discovered which Ccould afect the content, and all legal disclaimers apply. TI R A D E T A R E L E C C A Nature | www.nature.com Article Rare variant contribution to human disease in 281,104 UK Biobank exomes W 1,19 1,19 2,19 2 2 https://doi.org/10.1038/s41586-021-03855-y Quanli Wang , Ryan S.
    [Show full text]
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • C. Elegans Whole Genome Sequencing Reveals Mutational Signatures Related to Carcinogens and DNA Repair Deficiency
    Downloaded from genome.cshlp.org on September 28, 2021 - Published by Cold Spring Harbor Laboratory Press C. elegans whole genome sequencing reveals mutational signatures related to carcinogens and DNA repair deficiency Authors: Bettina Meier * (1); Susanna L Cooke * (2); Joerg Weiss (1); Aymeric P Bailly (1,3); Ludmil B Alexandrov (2); John Marshall (2); Keiran Raine (2); Mark Maddison (2); Elizabeth Anderson (2); Michael R Stratton (2); Anton Gartner * (1); Peter J Campbell * (2,4,5). * These authors contributed equally to this project. Institutions: (1) Centre for Gene Regulation and Expression, University of Dundee, Dundee, UK. (2) Cancer Genome Project, Wellcome Trust Sanger Institute, Hinxton, UK. (3) CRBM/CNRS UMR5237, University of Montpellier, Montpellier, France. (4) Department of Haematology, University of Cambridge, Cambridge, UK. (5) Department of Haematology, Addenbrooke’s Hospital, Cambridge, UK. Address for correspondence: Dr Peter J Campbell, Dr Anton Gartner, Cancer Genome Project, Centre for Gene Regulation and Expression, Wellcome Trust Sanger Institute, The University of Dundee, Hinxton CB10 1SA, Dow Street, Cambridgeshire, Dundee DD1 5EH UK. UK. Tel: +44 (0) 1223 494745 Phone: +44 (0) 1382 385809 Fax: +44 (0) 1223 494809 E-mail: [email protected] E-mail: [email protected] Running title: mutation profiling in C. elegans Keywords: mutation pattern, genetic and environmental factors, C. elegans, cisplatin, aflatoxin B1, whole-genome sequencing. Downloaded from genome.cshlp.org on September 28, 2021 - Published by Cold Spring Harbor Laboratory Press ABSTRACT Mutation is associated with developmental and hereditary disorders, ageing and cancer. While we understand some mutational processes operative in human disease, most remain mysterious.
    [Show full text]
  • Abstracts In
    ECCB 2014 Accepted Posters with Abstracts G: Bioinformatics of health and disease G01: Emile Rugamika Chimusa, Jacquiline Wangui Mugo and Nicola Mulder. Leveraging ancestry along the genome of admixed individuals to resolve missing heritability in disease scoring statistics Abstract: Human genetics has been haunted by the mystery of “missing heritability” of common traits. Although studies have discovered several variants associated with common diseases and traits, these variants typically appear to explain only a minority of the heritability. Resolving missing heritability, the difference between phenotypic variance explained by associated SNPs and estimates of narrow-sense heritability (h2), will inform strategies for disease mapping and prediction of complex traits. Among biased estimates of h2 due to epistatic interactions and rare variants not captured by genotyping arrays have been cited to be the most can be the most explanations for missing heritability. Here, we present an approach for estimating heritability of traits based on sharing local ancestry segments between pairs of unrelated individuals in an admixed population. From simulation data and real data, we demonstrated that our approach outperformed current approaches for estimating heritability of traits and holds values in admixture mapping for deconvoluting genes underlying ethnic differences in complex diseases risk. G02: Sylvain Mareschal, Pierre-Julien Viailly, Philippe Bertrand, Fabienne Desmots-Loyer, Elodie Bohers, Catherine Maingonnat, Karen Leroy, Thierry Fest and Fabrice Jardin. Next- Generation Sequencing applied to tailor targeted therapies in lymphoma: the RELYSE project Abstract: Non-Hodgkin Lymphomas (NHL) are lymphoid cell malignancies accounting for about 4% of all cancers, with an incidence rate of 12 cases per 100,000 and per year in Europe.
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
    [Show full text]
  • (DDD) Project: What a Genomic Approach Can Achieve
    The Deciphering Development Disorders (DDD) project: What a genomic approach can achieve RCP ADVANCED MEDICINE, LONDON FEB 5TH 2018 HELEN FIRTH DM FRCP DCH, SANGER INSTITUTE 3,000,000,000 bases in each human genome Disease & developmental Health & development disorders Fascinating facts about your genome! –~20,000 protein-coding genes –~30% of genes have a known role in disease or developmental disorders –~10,000 protein altering variants –~100 protein truncating variants –~70 de novo mutations (~1-2 coding ie. In exons of genes) Rare Disease affects 1 in 17 people •Prior to DDD, diagnostic success in patients with rare paediatric disease was poor •Not possible to diagnose many patients with current methodology in routine use– maximum benefit in this group •DDD recruited patients with severe/extreme clinical features present from early childhood with high expectation of genetic basis •Recruitment was primarily of trios (ie The Doctor Sir Luke Fildes (1887) child and both parents) ~ 90% Making a genomic diagnosis of a rare disease improves care •Accurate diagnosis is the cornerstone of good medical practice – informing management, treatment, prognosis and prevention •Enables risk to other family members to be determined enabling predictive testing with potential for surveillance and therapy in some disorders February 28th 2018 •Reduces sense of isolation, enabling better access to support and information •Curtails the diagnostic odyssey •Not just a descriptive label; identifies the fundamental cause of disease A genomic diagnosis can be a gateway to better treatment •Not just a descriptive label; identifies the fundamental cause of disease •Biallelic mutations in the CFTR gene cause Cystic Fibrosis • CFTR protein is an epithelial ion channel regulating absorption/ secretion of salt and water in the lung, sweat glands, pancreas & GI tract.
    [Show full text]
  • Different Evolutionary Patterns of Snps Between Domains and Unassigned Regions in Human Protein‑Coding Sequences
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Springer - Publisher Connector Mol Genet Genomics (2016) 291:1127–1136 DOI 10.1007/s00438-016-1170-7 ORIGINAL ARTICLE Different evolutionary patterns of SNPs between domains and unassigned regions in human protein‑coding sequences Erli Pang1 · Xiaomei Wu2 · Kui Lin1 Received: 14 September 2015 / Accepted: 18 January 2016 / Published online: 30 January 2016 © The Author(s) 2016. This article is published with open access at Springerlink.com Abstract Protein evolution plays an important role in Furthermore, the selective strength on domains is signifi- the evolution of each genome. Because of their functional cantly greater than that on unassigned regions. In addition, nature, in general, most of their parts or sites are differently among all of the human protein sequences, there are 117 constrained selectively, particularly by purifying selection. PfamA domains in which no SNPs are found. Our results Most previous studies on protein evolution considered indi- highlight an important aspect of protein domains and may vidual proteins in their entirety or compared protein-coding contribute to our understanding of protein evolution. sequences with non-coding sequences. Less attention has been paid to the evolution of different parts within each pro- Keywords Human genome · Protein-coding sequence · tein of a given genome. To this end, based on PfamA anno- Protein domain · SNPs · Natural selection tation of all human proteins, each protein sequence can be split into two parts: domains or unassigned regions. Using this rationale, single nucleotide polymorphisms (SNPs) in Introduction protein-coding sequences from the 1000 Genomes Project were mapped according to two classifications: SNPs occur- Studying protein evolution is crucial for understanding ring within protein domains and those within unassigned the evolution of speciation and adaptation, senescence and regions.
    [Show full text]
  • Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors
    Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors Running head: Six-fold speed-up of Smith-Waterman searches Torbjørn Rognes* and Erling Seeberg Institute of Medical Microbiology, University of Oslo, The National Hospital, NO-0027 Oslo, Norway Abstract Motivation: Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence search algorithm is that of Smith- Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. Results: A fast implementation of the Smith-Waterman sequence alignment algorithm using SIMD (Single-Instruction, Multiple-Data) technology is presented. This implementation is based on the MMX (MultiMedia eXtensions) and SSE (Streaming SIMD Extensions) technology that is embedded in Intel’s latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith-Waterman implementation on the same hardware was achieved by an optimised 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date. Availability: Online searches with the software are available at http://dna.uio.no/search/ Contact: [email protected] Published in Bioinformatics (2000) 16 (8), 699-706. Copyright © (2000) Oxford University Press. *) To whom correspondence should be addressed.
    [Show full text]
  • Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I
    EMBL-European Bioinformatics Institute Annual Scientific Report 2013 On the cover Structure 3fof in the Protein Data Bank, determined by Laponogov, I. et al. (2009) Structural insight into the quinolone-DNA cleavage complex of type IIA topoisomerases. Nature Structural & Molecular Biology 16, 667-669. © 2014 European Molecular Biology Laboratory This publication was produced by the External Relations team at the European Bioinformatics Institute (EMBL-EBI) A digital version of the brochure can be found at www.ebi.ac.uk/about/brochures For more information about EMBL-EBI please contact: [email protected] Contents Introduction & overview 3 Services 8 Genes, genomes and variation 8 Molecular atlas 12 Proteins and protein families 14 Molecular and cellular structures 18 Chemical biology 20 Molecular systems 22 Cross-domain tools and resources 24 Research 26 Support 32 ELIXIR 36 Facts and figures 38 Funding & resource allocation 38 Growth of core resources 40 Collaborations 42 Our staff in 2013 44 Scientific advisory committees 46 Major database collaborations 50 Publications 52 Organisation of EMBL-EBI leadership 61 2013 EMBL-EBI Annual Scientific Report 1 Foreword Welcome to EMBL-EBI’s 2013 Annual Scientific Report. Here we look back on our major achievements during the year, reflecting on the delivery of our world-class services, research, training, industry collaboration and European coordination of life-science data. The past year has been one full of exciting changes, both scientifically and organisationally. We unveiled a new website that helps users explore our resources more seamlessly, saw the publication of ground-breaking work in data storage and synthetic biology, joined the global alliance for global health, built important new relationships with our partners in industry and celebrated the launch of ELIXIR.
    [Show full text]
  • 1 Constructing the Scientific Population in the Human Genome Diversity and 1000 Genome Projects Joseph Vitti I. Introduction: P
    Constructing the Scientific Population in the Human Genome Diversity and 1000 Genome Projects Joseph Vitti I. Introduction: Populations Coming into Focus In November 2012, some eleven years after the publication of the first draft sequence of a human genome, an article published in Nature reported a new ‘map’ of the human genome – created from not one, but 1,092 individuals. For many researchers, however, what was compelling was not the number of individuals sequenced, but rather the fourteen worldwide populations they represented. Comparisons that could be made within and among these populations represented new possibilities for the scientific study of human genetic variation. The paper – which has been cited over 400 times in the subsequent year – was the output of the first phase of the 1000 Genomes Project, one of several international research consortia launched with the intent of identifying and cataloguing such variation. With the project’s phase three data release, anticipated in early spring 2014, the sample size will rise to over 2500 individuals representing twenty-six populations. Each individual’s full sequence data is made publicly available online, and is also preserved through the establishment of immortal cell lines, from which DNA can be extracted and distributed. With these developments, population-based science has been made genomic, and scientific conceptions of human populations have begun to crystallize (see appendix). Such extensive biobanking and databasing of human populations is remarkable for a number of reasons, not least among them the socially charged terrain that such an enterprise inevitably must navigate. While the 1000 Genomes Project (1000G) has been relatively uncontroversial in its reception, predecessors such as the Human Genome Diversity Project (HGDP), first conceived in 1991, faced greater difficulty.
    [Show full text]
  • Human Genetics: International Projects and Personalized Medicine
    Drug Metabol Pers Ther 2016; 31(1): 3–8 Mini Review Maria Apellaniz-Ruiza, Cristina Gallegoa, Sara Ruiz-Pintoa, Angel Carracedo and Cristina Rodríguez-Antona* Human genetics: international projects and personalized medicine DOI 10.1515/dmpt-2015-0032 Received August 31, 2015; accepted October 19, 2015; previously Introduction published online November 18, 2015 Genetic variation databases describe naturally occur- Abstract: In this article, we present the progress driven ring genetic differences among individuals of the same by the recent technological advances and new revolu- species. This variation, accounting for 0.1% of our DNA tionary massive sequencing technologies in the field of [1], permits the flexibility and survival of a population human genetics. We discuss this knowledge in relation in the face of changing environmental circumstances, with drug response prediction, from the germline genetic but it also influences how people differ in their risk of variation compiled in the 1000 Genomes Project or in the disease or their response to drugs. It is well known that Genotype-Tissue Expression project, to the phenome- variability in response to drug therapy is the rule rather genome archives, the international cancer projects, such than the exception for most drugs, and these differences as The Cancer Genome Atlas or the International Cancer are among the major challenges in current clinical prac- Genome Consortium, and the epigenetic variation and its tice, drug development, and drug regulation [2, 3]. Thus, influence in gene expression, including the regulation of rather than accepting the “one drug fits all” approach, drug metabolism. This review is based on the lectures pre- researchers envision that drugs need to be tailored to fit sented by the speakers of the Symposium “Human Genet- the profile of each individual patient.
    [Show full text]