<<

Jon K. Lærdahl, Structural Bioinforma�cs

Databases/Resources on the web

Jon K. Lærdahl [email protected] Jon K. Lærdahl, A lot of biological databases Structural Bioinforma�cs available on the web...

MetaBase, the database of biological bioinforma�cs.ca – links directory databases (1801 entries) (620 databases) -­‐ h�p://metadatabase.org -­‐ h�p://bioinforma�cs.ca/links_directory

Jon K. Lærdahl, Structural Bioinforma�cs btw, the bioinforma�cs.ca links directory is an excellent resource

bioinforma�cs.ca – links directory  h�p://bioinforma�cs.ca/links_directory  Currently  1459 tools  620 databases  164 “resources”  The problem is not to find a tool or database, but to know what is “gold” and what is “junk”

Jon K. Lærdahl, Some important centres for Structural Bioinforma�cs bioinforma�cs  Na�onal Center for Biotechnology Informa�on (NCBI) – part of the US Na�onal Library of Medicine (NLM), a branch of the Na�onal Ins�tutes of Health – located in Bethesda, Maryland  European Bioinforma�cs Ins�tute (EMBL-­‐EBI) – part of part of European Molecular Biology Laboratory (EMBL) – located in Hinxton, Cambridgeshire, UK Jon K. Lærdahl, NCBI databases Structural Bioinforma�cs  Provided the GenBank DNA sequence database since 1992  Online Mendelian Inheritance in Man (OMIM) -­‐ known diseases with a gene�c component and links to genes – started early 1960s as a book – online version, OMIM, since 1987 – on the WWW by NCBI in 1995 – currently >22,000 entries (14,400 genes)  EST -­‐ nucleo�de database subset that contains only Expressed Sequence Tag records  Gene -­‐ genes and associated informa�on for a number of organisms in addi�on to and including  Protein sequence database -­‐ collec�on of protein sequence entries compiled from a variety of sources including Swiss-­‐Prot, PIR, PRF, PDB, and transla�ons from annotated coding regions in GenBank and RefSeq  PubMed -­‐ access to over 15 million cita�ons from MEDLINE and addi�onal life sciences journals  SNP -­‐ repository for both single nucleo�de subs�tu�ons and short dele�on and inser�on polymorphisms All data is publicly available Jon K. Lærdahl, Structural Bioinforma�cs NCBI databases

37 databases that together contains over 690 million records

Nucleic Acids 41 Res. , D8 (2013) Jon K. Lærdahl, EMBL-­‐EBI databases Structural Bioinforma�cs  European Nucleo�de Archive (ENA) nucleo�de sequence database  Ensembl -­‐ automa�c and manually curated annota�on on selected eukaryo�c (vertebrate)  Ensembl Genomes – Ensembl for “all other organisms”  UniProt – protein sequence and func�onal informa�on  ChEMBL – database of bioac�ve compounds  IntAct -­‐ repository of molecular interac�ons, including protein-­‐protein, protein-­‐small molecule and protein-­‐nucleic acid interac�ons  CiteXplore – 25 million literature abstracts including PubMed, Agricola & patents  Gene Ontology (GO) -­‐ controlled vocabulary to describe gene and gene product a�ributes in any organism  Gene Ontology Annota�on (GOA) – GO annota�ons for proteins in UniProt All data is publicly available Jon K. Lærdahl, NCBI «Trace Archives» Structural Bioinforma�cs  Trace Archive – Repository of raw data sequencing traces from gel and capillary electrophoresis sequencers – >2 billion traces  Sequence Read Archive (SRA) – Data from high-­‐throughput sequencing (454, Illumina, IonTorrent, SOLiD , etc.) – 915 Tbases (9.15 x 1014) open access sequences – At present 1 Tbase added daily h�p://nar.oxfordjournals.org/content/40/D1/D54.abstract

1 Pbp ≈ 100,000 human genomes Jon K. Lærdahl, Structural Bioinforma�cs UniProt

 Database of protein sequences and func�onal annota�ons – “a single worldwide database of protein sequence and func�on” (2002)  UniProt consor�um – EMBL-­‐EBI – Swiss Ins�tute of Bioinforma�cs (SIB)  Swiss-­‐Prot (Amos Bairoch, 1986)  TrEMBL (Translated EMBL Nucleo�de Sequence Data Library, 1996) – Protein Informa�on Resource (PIR)  roots in Margaret Dayhoff's Atlas of Protein Sequence and Structure (1965)  h�p://www..org Jon K. Lærdahl, An even be�er place to look for Structural Bioinforma�cs good biological databases -­‐

Nucleic Acids Database Res. issues  released once every year, in January  20th issue (2013)  88 new databases  77 updates on databases previously described in NAR  11 updates on databases previously described elsewhere

h�p://nar.oxfordjournals.org/content/41/D1.toc Jon K. Lærdahl, While we are visi�ng NAR: a Structural Bioinforma�cs good place to look for bioinforma�cs tools

Nucleic Acids Web Res. server issues  released once every year, in July  11th issue (2013)  95 web servers h�p://nar.oxfordjournals.org/content/41/W1.toc

If you need an ar�cle or a cita�on for bioinforma�cs tool or database, the NAR web server or databases issues are o�en good places to look Jon K. Lærdahl, Structural Bioinforma�cs Huge number of databases!

 In bioinforma�cs, the number of databases, tools, algorithms, and papers is enormous  impossible to have an overview, especially if bioinforma�cs is not your main research area  instead of trying to do everything yourself:

Get yourself a bioinforma�cs expert colleague or collaborator!

h�p://www.oxfordjournals.org/nar/database/a

NAR online Molecular Biology Database Collec�on, currently contains 1512 databases Jon K. Lærdahl, Structural Bioinforma�cs Good and bad databases  Some are excep�onally good, well maintained and o�en updated – EMBL-­‐EBI, NCBI, Ensembl,... – h�p://string.embl.de – h�p://www.pdb.org – Maintained by 10s and 100s of experts...  Species specific – h�p://www.pombase.org (Schizosaccharomyces pombe) – h�p://flybase.org (Drosophila) – h�p://ecocyc.org (Escherichia coli K-­‐12 MG1655)  Unique content – h�p://www.proteinatlas.org

 Also many have poor quality, are never updated, are unreliable

Trick is to know what is good and what is bad...

Let your favourite bioinforma�cian follow the field! Jon K. Lærdahl, Structural Bioinforma�cs

Ensembl browser and database Jon K. Lærdahl, Structural Bioinforma�cs Genome browsers  Graphical interface for genomic data  Shows informa�on from biological databases mapped onto genomic sequence

Genomic coordinates

Various annota�ons = “tracks”

NCBI Gene database Jon K. Lærdahl, Structural Bioinforma�cs Ensembl Genome Browser  Joint project between EMBL-­‐EBI and the Wellcome Trust Sanger Ins�tute  Central resource for studying genomes of vertebrates – Mainly chordates, but some few extra (e.g. C. elegans and S. cerevisiae) – Updated several �mes a year with new genome assemblies and new species – Annota�ons of genomes (e.g. genes and their splice variant, SNPs) added by the Ensembl pipeline – Automa�c gene predic�on (with or without experimental evidence) & some curator input Jon K. Lærdahl, Structural Bioinforma�cs Ensembl Genome Browser

h�p://www.ensembl.org

Excellent resource for exploring vertebrate species where the genome has been sequenced Jon K. Lærdahl, Ensembl Genome Browser Structural Bioinforma�cs

Oslo

Currently >70 species Jon K. Lærdahl, EnsemblGenomes Structural Bioinforma�cs

 Bacteria, pro�sts, fungi, plants and other metazoa Jon K. Lærdahl, Structural Bioinforma�cs Ensembl 2013

Read the ar�cle yourself! Jon K. Lærdahl, Structural Bioinforma�cs

UCSC Genome Browser Jon K. Lærdahl, Structural Bioinforma�cs Genome browsers  Graphical interface for genomic data  Shows informa�on from biological databases mapped onto genomic sequence

Genomic coordinates

Various annota�ons = “tracks”

NCBI Gene database Jon K. Lærdahl, Structural Bioinforma�cs UCSC Genome Browser

 Developed and maintained at the University of California, Santa Cruz (UCSC)  Interac�ve website  Access to genome sequence data from – Human genome  Latest assembly (GRCh37), but also earlier versions – Mouse, rat, and approx. 40 other mammals – Chicken, turkey, budgerigar, rep�les, frogs, and fishes – Insects, nematodes, S. cerevisiae and more Jon K. Lærdahl, Structural Bioinforma�cs UCSC Genome Browser

h�p://genome.ucsc.edu

Kuhn et al. Brief. Bioinform. 14, 144 (2012) Jon K. Lærdahl, UCSC Genome Browser Structural Bioinforma�cs

Reference genome, chromosome coordinates

Known genes

Predicted genes

Transcripts

Promoter binding sites

SNPs

Repeats Annota�on tracks Epigene�c marks

Many, many, more... Links to detailed data Jon K. Lærdahl, UCSC Genome Browser Structural Bioinforma�cs

Cow, chromosome 22, around posi�on 17,380,000 Annota�on tracks Jon K. Lærdahl, Structural Bioinforma�cs Access to the UCSC Genome Browser databases and tools h�p://genome.ucsc.edu

General informa�on

News, updates, announcements Jon K. Lærdahl, UCSC Genome Browser Structural Bioinforma�cs

Examples of searching op�ons – correct query format Jon K. Lærdahl, Different kinds of data Structural Bioinforma�cs

Direc�on of transcrip�on 5’ UTR shown in the introns Exon 3’ UTR

Wiggle (WIG) track format for Single loca�on data, dense, con�nuous data, e.g. e.g. SNPs conserva�on, epigene�c marks, and transcriptome data

Add your Switch Reset view Hide own data direc�on Jon K. Lærdahl, ENCODE data in UCSC Structural Bioinforma�cs

h�p://genome.ucsc.edu/ENCODE/aboutScaleup.html Jon K. Lærdahl, Structural Bioinforma�cs Much more in MBV-­‐INFx410 Jon K. Lærdahl, CLS Wednesday seminars – Structural Bioinforma�cs Bioinforma�cs/CLS seminars every 14 days Jon K. Lærdahl, Structural Bioinforma�cs cbo-­‐[email protected] – the mailing list for bioinforma�cs and computa�onal biology in the Oslo region  News about  Seminars  Courses  Jobs  Conferences

Relevant mainly for people in the Oslo region

Anyone can send an e-­‐mail to the list

Curators check that the message is relevant (to avoid spam) and releases the message

Currently >400 subscribers