BIOLOGICAL DATABASES WHAT IS A DATABASE ?

 A structured collection of data held in computer storage; esp. one that incorporates software to make it accessible in a variety of ways; any large collection of information.  A collection of data  structured  searchable (index) -> table of contents  updated periodically (release) -> new edition  cross-referenced (hyperlinks ) -> links with other db

 Includes also associated tools (software) necessary for access, updating, information insertion, information deletion….

 A database consists of basic units called records or entries.  Each record consists of fields, which hold pre-defined data related to the record.  For example, a database would have protein sequences as records and protein properties as fields (e.g., name of protein, length, amino-acid sequence, …)

2 DATABASES ON THE INTERNET

 Biological databases often have web interfaces, which allow users to send queries to the databases.  Some databases can be accessed by different web servers, each offering a different interface.

request query

web page result User Web server Database server 3 WHY BIOLOGICAL DATABASES ?

 Exponential growth in biological data.

 Data (genomic sequences, 3D structures, 2D gel analysis, MS analysis, Microarrays….) are no longer published in a conventional manner, but directly submitted to databases.

 Essential tools for biological research. The only way to publish massive amounts of data without using all the paper in the world.

4

5 COMPLETE GENOMES

Until 2018: Eukaryotes 5262 Prokaryotes 131446 Viruses 14027

6 SOME STATISTICS

 More than 1000 different ‘biological’ databases

 Variable size: <100Kb to >20Gb  DNA: > 20 Gb  Protein: 1 Gb  3D structure: 5 Gb  Other: smaller

 Update frequency: daily to annually to seldom to forget about it .

 Usually accessible through the web (some free, some not)

7  Some databases in the field of molecular biology…

AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc ...... 8 !!!! CATEGORIES OF DATABASES FOR BIOLOGY SCIENCES  Sequences (DNA, protein)  Genomics  Mutation/polymorphism  Protein domain/family  Proteomics (2D gel, Mass Spectrometry)  3D structure  Metabolic networks  Regulatory networks  Bibliography  Expression ( Microarrays,…)  Specialized

9 NCBI: http://www.ncbi.nlm.nih.gov

EBI: http://www.ebi.ac.uk/

DDBJ: http://www.ddbj.nig.ac.jp/

10 EBI/NCBI/DDBJ

 These 3 databases contain mainly the same information within 2-3 days (few differences in format and syntax)  Serve as archives containing all sequences (single , ESTs, complete genomes, etc.) derived from:  Genome projects  Sequencing centers  Individual scientists  Literature  Patent offices  Non-confidential data exchanged daily  The database triples approximately every 12 months. DATABASES : PROTEIN SEQUENCES

 SWISS-PROT ( UniProt ): created in 1986 (Amos Bairoch) http://www.expasy.org/sprot/

 TrEMBL : created in 1996; complement to SWISS-PROT; derived from EMBL CDS translations (« proteomic » version of EMBL)

 PIR-PSD : Protein Information Resources http://pir.georgetown.edu/

 Genpept: « proteomic » version of GenBank

 Many specialized protein databases for specific families or groups of .

 Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system), YPD (Yeast), etc. IDEAL MINIMAL CONTENT OFAN ENTRY IN A SEQUENCE DATABASE • Sequence • Accession number (AC) • Taxonomic data • References • Annotation • Keywords • Cross-references • Documentation The RefSeq Accession number format and molecule types

Accession Molecule type NC_xxxxxx Complete genomic molecule NG_xxxxxx Genomic region NM_xxxxxx mRNA NP_xxxxxx Protein NR_xxxxxx RNA NT_xxxxxx computed Genomic contig XM_xxxxxx computed mRNA XP_xxxxxx computed Protein http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi

Select GenBank

ENTREZ (Benson et al. (2011) Nucleic Acids Res D32:7) The search and retrieval system that integrates information from the National Center for Biotechnology (NCBI) databases.

These databases include sequences, protein sequences, macromolecular structures, whole genomes, and MEDLINE, through PubMed. 17

Sequence formats: FASTA format

>sequence name ↵ [sequence]… ↵ Sequence formats: GenBank format GEN BANK RECORD

Header information that apply to the whole record

Features annotations on the record

Sequence 1 The LOCUS field consists of five different subfields:

1a Locus Name (HSHFE) - The locus name is a tag for grouping similar sequences. The first two or three letters usually designate the organism. In this case HS stands for Homo sapiens The last several characters are associated with another group designation, such as product. In this example, the last three digits represent the gene symbol, HFE. Currently, the only requirement for assigning a locus name to a record is that it is unique.

1b Sequence Length (12146 bp) - The total number of nucleotide base pairs (or amino acid residues) in the sequence record.

22 2 DEFINITION - Brief description of the sequence. The description may include source organism name, gene or protein name, or designation as untranscribed or untranslated sequences (e.g., a promoter region). For sequences containing a coding region (CDS), the definition field may also contain a “completeness” qualifier such as "complete CDS" or "exon 1."

23 3 ACCESSION (Z92910) - Unique identifier assigned to a complete sequence record. This number never changes, even if the record is modified. An accession number is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456).

24 4 VERSION (Z92910.1) - Identification number assigned to a single, specific sequence in the database. This number is in the format “accession.version.” If any changes are made to the sequence data, the version part of the number will increase by one. For example U12345.1 becomes U12345.2. A version number of Z92910.1 for this HFE sequence indicates that the sequence data has not been altered since its original submission.

25 5 GI (1890179) - Also a sequence identification number. Whenever a sequence is changed, the version number is increased and a new GI is assigned. If a nucleotide sequence record contains a protein translation of the sequence, the translation will have its own GI number

26 6 KEYWORDS (haemochromatosis; HFE gene) - A keyword can be any word or phrase used to describe the sequence. Keywords are not taken from a controlled vocabulary. Notice that in this record the keyword, "haemochromatosis," employs British spelling, rather than the American "hemochromatosis." Many records have no keywords. A period is placed in this field for records without keywords.

27 7 SOURCE (human) - Usually contains an abbreviated or common name of the source organism.

8 ORGANISM (Homo sapiens ) - The scientific name (usually genus and species) and phylogenetic lineage. See the NCBI Taxonomy Homepage for more information about the classification scheme used to construct taxonomic lineages.

28 9 REFERENCE - Citations of publications by sequence authors that support information presented in the sequence record. Several references may be included in one record. References are automatically sorted from the oldest to the newest. Cited publications are searchable by author, article or publication title, journal title, or MEDLINE unique identifier (UID). The UID links the sequence record to the MEDLINE record.

29 1c Molecule Type (DNA) - Type of molecule that was sequenced. All sequence data in an entry must be of the same type.

1d GenBank Division (PRI) - There are different GenBank divisions. In this example, PRI stands for primate sequences. Some other divisions include ROD (rodent sequences), MAM (other mammal sequences), PLN (plant, fungal, and algal sequences), and BCT (bacterial sequences).

1e Modification Date (23-July-1999) - Date of most recent modification made to the record. The date of first public release is not available in the sequence record. This information can be obtained only by contacting NCBI at [email protected].

30 9 REFERENCE - If the REFERENCE TITLE contains the words "Direct Submission," contact information for the submitter(s) is provided.

31 The FEATURES table

32 A feature is simply an annotation that describes a portion of the sequence.

 Each feature includes a location (sequence location or interval) and one or several qualifiers.

 Clicking on the feature name will open a record for the sequence interval identified in the feature location.

A list of features can be found in http://www.ncbi.nlm.nih.gov/collab/FT/

33 source - An obligatory feature. The source gives the length of the entire sequence, the scientific name of the source organism, and the Taxon ID number.

Other types of information that the submitter may include in this field are number, map location, clone, and strain identification.

34 gene - Sequence portion that delineates the beginning and end of a gene.

35 exon - Sequence segment that contains an exon. Exons may contain portions of 5' and 3’ UTRs (untranslated regions). The name of the gene to which the exon belongs and exon number are provided.

36 CDS - Sequence of nucleotides that code for amino acids of the protein product (coding sequence).

The CDS begins with the first nucleotide of the start codon and ends with the third nucleotide of the stop codon. This feature includes the translation into amino acids and may also contain gene name, gene product function, link to protein sequence record, and cross-references to other database entries.

37 intron - Transcribed but spliced-out parts. Intron number is shown.

38 polyA_signal - Identifies the sequence portion required for endonuclease cleavage of an mRNA transcript. Consensus sequence for the polyA signal is AATAAA.

39 BASE COUNT & ORIGIN

BASE COUNT - Base Count gives the total number of adenine (A), cytosine (C), guanine (G), and thymine (T) bases in the sequence.

ORIGIN - Origin contains the sequence data, which begins on the line immediately below the field title.

40 EMBL ENTRY : EXAMPLE

ID HSERPG standard; DNA; HUM; 3398 BP. XX AC X02158; XX SV X02158.1 XX DT 13-JUN-1985 (Rel. 06, Created) DT 22-JUN-1993 (Rel. 36, Last updated, Version 2) XX DE Human gene for XX KW erythropoietin; glycoprotein hormone; hormone; signal peptide. keyword XX OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; taxonomy OC Eutheria; Primates; Catarrhini; Hominidae; Homo. XX RN [1] RP 1-3398 RX MEDLINE; 85137899. RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J., references RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M., RA Shimizu T., Miyake T.; RT Isolation and characterization of genomic and cDNA clones of human RT erythropoietin; RL Nature 313:806-810(1985). Cross-references XX DR GDB; 119110; EPO. DR GDB; 119615; TIMP1. DR SWISS-PROT; P01588; EPO_HUMAN. XX … EMBL ENTRY (CONT .)

CC Data kindly reviewed (24-FEB-1986) by K. Jacobs FH Key Location/Qualifiers FH FT source 1..3398 FT /db_xref=taxon:9606 FT /organism=Homo sapiens FT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) FT /db_xref=SWISS-PROT:P01588 FT /product=erythropoietin FT / protein_id=CAA26095.1 FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR FT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763) FT /product=erythropoietin FT sig_peptide join(615..627,1194..1261) FT exon 397..627 FT /number=1 FT intron 628..1193 FT /number=1 annotation FT exon 1194..1339 FT /number=2 FT intron 1340..1595 FT /number=2 FT exon 1596..1682 FT /number=3 FT intron 1683..2293 FT /number=3 FT exon 2294..2473 FT /number=4 FT intron 2474..2607 FT /number=4 FT exon 2608..3327 FT /note=3' untranslated region FT /number=5 XX SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120 sequence http://genome.ucsc.edu/ REF SEQ

 Database of reference sequences

 Non-redundant; one record for each gene, or each splice variant, from each organism represented

 Each record is intended to present an encapsulation of the current understanding of a gene or protein, similar to a review article