Databases/Resources on the Web
Total Page:16
File Type:pdf, Size:1020Kb
Jon K. Lærdahl, Structural Bioinforma�cs Databases/Resources on the web Jon K. Lærdahl [email protected] Jon K. Lærdahl, A lot of biological databases Structural Bioinforma�cs available on the web... MetaBase, the database of biological bioinforma�cs.ca – links directory databases (1801 entries) (620 databases) -‐ h�p://metadatabase.org -‐ h�p://bioinforma�cs.ca/links_directory Jon K. Lærdahl, Structural Bioinforma�cs btw, the bioinforma�cs.ca links directory is an excellent resource bioinforma�cs.ca – links directory h�p://bioinforma�cs.ca/links_directory Currently 1459 tools 620 databases 164 “resources” The problem is not to find a tool or database, but to know what is “gold” and what is “junk” Jon K. Lærdahl, Some important centres for Structural Bioinforma�cs bioinforma�cs Na�onal Center for Biotechnology Informa�on (NCBI) – part of the US Na�onal Library of Medicine (NLM), a branch of the Na�onal Ins�tutes of Health – located in Bethesda, Maryland European Bioinforma�cs Ins�tute (EMBL-‐EBI) – part of part of European Molecular Biology Laboratory (EMBL) – located in Hinxton, Cambridgeshire, UK Jon K. Lærdahl, NCBI databases Structural Bioinforma�cs Provided the GenBank DNA sequence database since 1992 Online Mendelian Inheritance in Man (OMIM) -‐ known diseases with a gene�c component and links to genes – started early 1960s as a book – online version, OMIM, since 1987 – on the WWW by NCBI in 1995 – currently >22,000 entries (14,400 genes) EST -‐ nucleo�de database subset that contains only Expressed Sequence Tag records Gene -‐ genes and associated informa�on for a number of organisms in addi�on to and including human Protein sequence database -‐ collec�on of protein sequence entries compiled from a variety of sources including Swiss-‐Prot, PIR, PRF, PDB, and transla�ons from annotated coding regions in GenBank and RefSeq PubMed -‐ access to over 15 million cita�ons from MEDLINE and addi�onal life sciences journals SNP -‐ repository for both single nucleo�de subs�tu�ons and short dele�on and inser�on polymorphisms All data is publicly available Jon K. Lærdahl, Structural Bioinforma�cs NCBI databases 37 databases that together contains over 690 million records Nucleic Acids 41 Res. , D8 (2013) Jon K. Lærdahl, EMBL-‐EBI databases Structural Bioinforma�cs European Nucleo�de Archive (ENA) nucleo�de sequence database Ensembl -‐ automa�c and manually curated annota�on on selected eukaryo�c (vertebrate) genomes Ensembl Genomes – Ensembl for “all other organisms” UniProt – protein sequence and func�onal informa�on ChEMBL – database of bioac�ve compounds IntAct -‐ repository of molecular interac�ons, including protein-‐protein, protein-‐small molecule and protein-‐nucleic acid interac�ons CiteXplore – 25 million literature abstracts including PubMed, Agricola & patents Gene Ontology (GO) -‐ controlled vocabulary to describe gene and gene product a�ributes in any organism Gene Ontology Annota�on (GOA) – GO annota�ons for proteins in UniProt All data is publicly available Jon K. Lærdahl, NCBI «Trace Archives» Structural Bioinforma�cs Trace Archive – Repository of raw data sequencing traces from gel and capillary electrophoresis sequencers – >2 billion traces Sequence Read Archive (SRA) – Data from high-‐throughput sequencing (454, Illumina, IonTorrent, SOLiD, etc.) – 915 Tbases (9.15 x 1014) open access sequences – At present 1 Tbase added daily h�p://nar.oxfordjournals.org/content/40/D1/D54.abstract 1 Pbp ≈ 100,000 human genomes Jon K. Lærdahl, Structural Bioinforma�cs UniProt Database of protein sequences and func�onal annota�ons – “a single worldwide database of protein sequence and func�on” (2002) UniProt consor�um – EMBL-‐EBI – Swiss Ins�tute of Bioinforma�cs (SIB) Swiss-‐Prot (Amos Bairoch, 1986) TrEMBL (Translated EMBL Nucleo�de Sequence Data Library, 1996) – Protein Informa�on Resource (PIR) roots in Margaret Dayhoff's Atlas of Protein Sequence and Structure (1965) h�p://www.uniprot.org Jon K. Lærdahl, An even be�er place to look for Structural Bioinforma�cs good biological databases -‐ Nucleic Acids Database Res. issues released once every year, in January 20th issue (2013) 88 new databases 77 updates on databases previously described in NAR 11 updates on databases previously described elsewhere h�p://nar.oxfordjournals.org/content/41/D1.toc Jon K. Lærdahl, While we are visi�ng NAR: a Structural Bioinforma�cs good place to look for bioinforma�cs tools Nucleic Acids Web Res. server issues released once every year, in July 11th issue (2013) 95 web servers h�p://nar.oxfordjournals.org/content/41/W1.toc If you need an ar�cle or a cita�on for bioinforma�cs tool or database, the NAR web server or databases issues are o�en good places to look Jon K. Lærdahl, Structural Bioinforma�cs Huge number of databases! In bioinforma�cs, the number of databases, tools, algorithms, and papers is enormous impossible to have an overview, especially if bioinforma�cs is not your main research area instead of trying to do everything yourself: Get yourself a bioinforma�cs expert colleague or collaborator! h�p://www.oxfordjournals.org/nar/database/a NAR online Molecular Biology Database Collec�on, currently contains 1512 databases Jon K. Lærdahl, Structural Bioinforma�cs Good and bad databases Some are excep�onally good, well maintained and o�en updated – EMBL-‐EBI, NCBI, Ensembl,... – h�p://string.embl.de – h�p://www.pdb.org – Maintained by 10s and 100s of experts... Species specific – h�p://www.pombase.org (Schizosaccharomyces pombe) – h�p://flybase.org (Drosophila) – h�p://ecocyc.org (Escherichia coli K-‐12 MG1655) Unique content – h�p://www.proteinatlas.org Also many have poor quality, are never updated, are unreliable Trick is to know what is good and what is bad... Let your favourite bioinforma�cian follow the field! Jon K. Lærdahl, Structural Bioinforma�cs Ensembl genome browser and database Jon K. Lærdahl, Structural Bioinforma�cs Genome browsers Graphical interface for genomic data Shows informa�on from biological databases mapped onto genomic sequence Genomic coordinates Various annota�ons = “tracks” NCBI Gene database Jon K. Lærdahl, Structural Bioinforma�cs Ensembl Genome Browser Joint project between EMBL-‐EBI and the Wellcome Trust Sanger Ins�tute Central resource for studying genomes of vertebrates – Mainly chordates, but some few extra (e.g. C. elegans and S. cerevisiae) – Updated several �mes a year with new genome assemblies and new species – Annota�ons of genomes (e.g. genes and their splice variant, SNPs) added by the Ensembl pipeline – Automa�c gene predic�on (with or without experimental evidence) & some curator input Jon K. Lærdahl, Structural Bioinforma�cs Ensembl Genome Browser h�p://www.ensembl.org Excellent resource for exploring vertebrate species where the genome has been sequenced Jon K. Lærdahl, Ensembl Genome Browser Structural Bioinforma�cs Oslo Currently >70 species Jon K. Lærdahl, EnsemblGenomes Structural Bioinforma�cs Bacteria, pro�sts, fungi, plants and other metazoa Jon K. Lærdahl, Structural Bioinforma�cs Ensembl 2013 Read the ar�cle yourself! Jon K. Lærdahl, Structural Bioinforma�cs UCSC Genome Browser Jon K. Lærdahl, Structural Bioinforma�cs Genome browsers Graphical interface for genomic data Shows informa�on from biological databases mapped onto genomic sequence Genomic coordinates Various annota�ons = “tracks” NCBI Gene database Jon K. Lærdahl, Structural Bioinforma�cs UCSC Genome Browser Developed and maintained at the University of California, Santa Cruz (UCSC) Interac�ve website Access to genome sequence data from – Human genome Latest assembly (GRCh37), but also earlier versions – Mouse, rat, and approx. 40 other mammals – Chicken, turkey, budgerigar, rep�les, frogs, and fishes – Insects, nematodes, S. cerevisiae and more Jon K. Lærdahl, Structural Bioinforma�cs UCSC Genome Browser h�p://genome.ucsc.edu Kuhn et al. Brief. Bioinform. 14, 144 (2012) Jon K. Lærdahl, UCSC Genome Browser Structural Bioinforma�cs Reference genome, chromosome coordinates Known genes Predicted genes Transcripts Promoter binding sites SNPs Repeats Annota�on tracks Epigene�c marks Many, many, more... Links to detailed data Jon K. Lærdahl, UCSC Genome Browser Structural Bioinforma�cs Cow, chromosome 22, around posi�on 17,380,000 Annota�on tracks Jon K. Lærdahl, Structural Bioinforma�cs Access to the UCSC Genome Browser databases and tools h�p://genome.ucsc.edu General informa�on News, updates, announcements Jon K. Lærdahl, UCSC Genome Browser Structural Bioinforma�cs Examples of searching op�ons – correct query format Jon K. Lærdahl, Different kinds of data Structural Bioinforma�cs Direc�on of transcrip�on 5’ UTR shown in the introns Exon 3’ UTR Wiggle (WIG) track format for Single loca�on data, dense, con�nuous data, e.g. e.g. SNPs conserva�on, epigene�c marks, and transcriptome data Add your Switch Reset view Hide own data direc�on Jon K. Lærdahl, ENCODE data in UCSC Structural Bioinforma�cs h�p://genome.ucsc.edu/ENCODE/aboutScaleup.html Jon K. Lærdahl, Structural Bioinforma�cs Much more in MBV-‐INFx410 Jon K. Lærdahl, CLS Wednesday seminars – Structural Bioinforma�cs Bioinforma�cs/CLS seminars every 14 days Jon K. Lærdahl, Structural Bioinforma�cs cbo-‐[email protected] – the mailing list for bioinforma�cs and computa�onal biology in the Oslo region News about Seminars Courses Jobs Conferences Relevant mainly for people in the Oslo region Anyone can send an e-‐mail to the list Curators check that the message is relevant (to avoid spam) and releases the message Currently >400 subscribers .