Flow of Genetic Information DNA --> RNA --> PROTEIN

Flow of genetic information DNA --> RNA --> PROTEIN --> ---> CONFORMATION --> BIOLOGICAL FUNCTION Overview of molecular biology databases - Sequence DNA Genbank (www.ncbi.nlm.nih.gov) - BLAST - Entrez EMBL (European Molecular Biology Laboratory, www.ebi.ac.uk) - SRS : srs.ebi.ac.uk, www.sanger.ac.uk/srs6/ DDBJ (DNA Data Bank of Japan) Protein Swissprot (www.ebi.ac.uk) NCBI Protein classification databases Prosite (expasy.hcuge.ch) Pfam (www.sanger.ac.uk/Pfam) InterPro (www.ebi.ac.uk/interpro) Gene ontology www.geneontology.org - Structure PDB Protein Data Bank, www.rcsb.org/pdb/cgi/queryForm.cgi (RCSB, Research Collaboratory for Structural Bioinformatics, rcsb.rutgers.edu) Xray crystallography NMR modeling KLOTHO (small molecules, www.ibc.wustl.edu/moirai/klotho/compound_list.html) - Genome GDB (Human Genome Data Base, www.gdb.org) Mouse genome database (www.informatics.jax.org) Yeast genome (genome-ftp.stanford.edu/Saccharomyces) Bacterial genomes (www.tigr.org) - Human genome browsers NCBI www.ncbi.nlm.nih.gov UCSC genome.ucsc.edu EBI www.ensembl.org Celera www.celera.com - Genetic disorders OMIM (Online Mendelian Inheritance in Man, www.ncbi.nlm.nih.gov) - Taxonomy (www.ncbi.nlm.nih.gov) - Literature PubMed (www.ncbi.nlm.nih.gov/Entrez) Molecular biology databases DNA sequence Genome data Protein sequence Protein classification Protein structure Major bioinformatics sites / public sequence database administrators Genbank NCBI, NIH, US DDBJ (Japan) EMBL (EBI, UK ) DNA sequence data : EMBL - Genbank - DDBJ EMBL and Genbank formats EMBL format ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX FH Key Location/Qualifiers FH FT source 1..756 FT /db_xref="taxon:1638" FT /organism="Listeria ivanovii" FT /strain="ATCC 19119" FT RBS 95..100 FT /gene="sod" FT terminator 723..746 FT /gene="sod" FT CDS 109..717 FT /db_xref="SWISS-PROT:P28763" FT /transl_table=11 FT /gene="sod" FT /EC_number="1.15.1.1" FT /product="superoxide dismutase" FT /protein_id="CAA45406.1" FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" XX SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300 3.2.4 Feature key examples Key Description conflict Separate determinations of the "same" sequence differ rep_origin Origin of replication protein_bind Protein binding site on DNA CDS Protein-coding sequence misc_RNA Generic label for an undefined RNA insertion_seq Insertion element D-loop Mitochondrial or other D-loop structure 3.3.4 Qualifier examples Key Location/Qualifiers CDS 86..742 /product="hypoxanthine phosphoribosyltransferase" /label=hprt /note="hprt catalyzes vital steps in the reutilization pathway for purine biosynthesis and its deficiency leads to forms of ""gouty"" arthritis" rep.origin 234..243 /direction=left CDS 109..564 /usedin=X10009:catalase 3.5.3 Location examples The following is a list of common location descriptors with their meanings: Location Description 467 Points to a single base in the presented sequence 340..565 Points to a continuous range of bases bounded by and including the starting and ending bases <345..500 Indicates that the exact lower boundary point of a feature is unknown. The location begins at some base previous to the first base specified (which need not be contained in the presented sequence) and continues to and includes the ending base <1..888 The feature starts before the first sequenced base and continues to and includes base 888 (102.110) Indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110, inclusive (23.45)..600 Specifies that the starting point is one of the bases between bases 23 and 45, inclusive, and the end point is base 600 (122.133)..(204.221) The feature starts at a base between 122 and 133, inclusive, and ends at a base between 204 and 221, inclusive 123^124 Points to a site between bases 123 and 124 145^177 Points to a site between two adjacent bases anywhere between bases 145 and 177 complement(34..(122.126)) Start at one of the bases complementary to those between 122 and 126 on the presented strand and finish at the base complementary to base 34 (the feature is on the strand complementary to the presented strand) join("acct",449..670) Concatenate the four bases 'acct' to the 5' end of the sequence from bases 449 to 670, inclusive J00193:hladr Points to a feature whose location is described in an- other entry: the feature labelled 'hladr' in the entry (in this database) with primary accession number 'J00193' J00194:(100..202) Points to bases 100 to 202, inclusive, in the entry (in this database) with primary accession number 'J00194' EMBL and Genbank formats EMBL format ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX Common sequence formats 1. EMBL release format 2. Genbank (ASN.1) 3. FASTA format : >X12345 Y098TR gene CGTATCTTACGAGCTACTACGA GGTCTTATCGGACGAGCGACT ... EMBL divisions Human Mus musculus Rodents Other Mammals Other Vertebrates Invertebrates Plants Fungi Prokaryotes (+ Archae) Organanelles Viruses Bacteriophages Patented Synthetic EST HTG STS GSS EST (Expressed Sequence Tag) Expressed Sequence Tags (ESTs) are partial mRNA sequences, they are sequences of cDNA which have been reverse-transcribed from mRNA Short sequences (~500-1000 bases), each is result of single sequencing experiment -> high frequency of errors Applications: Discovery of new genes Mapping of various genomes Identification of coding regions in genomic sequences. EST libraries are used to answer questions like: What genes in specific cell or tissue are expressed ? UniGene clusters UniGene partitions GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene. A majority of sequences are ESTs. The mouse dataset contains 84,247 clusters with a total of 2,332,864 sequences. 5’ UTR CDS 3’ UTR mRNA public ESTs High-Throughput Genomic Sequences The High Throughput Genomic (HTG) Sequences division was created to accommodate a growing need to make 'unfinished' genomic sequence data rapidly available to the scientific community. It was done in a coordinated effort between the three International Nucleotide Sequence databases: DDBJ, EMBL, and GenBank. The HTG division contains 'unfinished' DNA sequences generated by the high-throughput sequencing centers. Sequence data in this division are available for BLAST homology searches against either the "htgs" database or the "month" database, which includes all new submissions for the prior month. The HTG division of GenBank was described in a [Genome Research (1997) 7(10)] article by Ouellette and Boguski. Location of HTG records: Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and deposited in the HTG division. A typical HTG record might consist of all the first pass sequence data generated from a single cosmid, BAC, YAC, or P1 clone which together comprise more than 2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences and each record includes a clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data is "unfinished" and may contain errors. The accession number does not change as sequence records are updated; only the most recent version of a HTG record remains in GenBank. 'Finished' HTG sequences (phase 3) retain the same accession number, but are moved into the relevant primary GenBank division. An example of a submission (one accession number) that has progressed through phase 1, phase 2, and phase 3 is available Genome Survey Sequence (GSS) This division is similar in nature to the EST division, except that its sequences will be genomic rather than cDNA (mRNA). The GSS division will contain (but not be limited to) the following types of data: - random "single pass read" genome survey sequences - single pass reads from cosmid/BAC/YAC ends - exon trapped genomic sequences - Alu PCR sequences STS (Sequence Tagged Sites) Sequence Tagged Sites (STS) are short DNA segments with a single location in the genome.

Flow of Genetic Information DNA --> RNA --> PROTEIN

A Semantic Standard for Describing the Location of Nucleotide and Protein Feature Annotation Jerven T

HTG Data Input: Annotation File Output in DDBJ Flat File

Nucleic Acid Databases on the Web Richard Peters and Robert S

Biological Databases

Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

2021.03.02.433662V1.Full.Pdf

DDBJ Progress Report

Aquatic Symbiosis Genomics at the Wellcome Sanger Institute Announcement of Opportunity & Call for Collaboration Proposals

Trends in Bioinformatics Technology

Nucleotide and Protein Databases

CBD/DSI/AHTEG/2020/1/2 3 March 2020

Best Practice Data Life Cycle Approaches for the Life Sciences[Version 2; Peer Review: 2 Approved]