Flow of genetic information

DNA --> RNA --> PROTEIN --> ---> CONFORMATION --> BIOLOGICAL FUNCTION Overview of molecular biology databases - Sequence DNA Genbank (www.ncbi.nlm.nih.gov) - BLAST - Entrez EMBL (European Molecular Biology Laboratory, www.ebi.ac.uk) - SRS : srs.ebi.ac.uk, www.sanger.ac.uk/srs6/ DDBJ (DNA Data Bank of Japan) Protein Swissprot (www.ebi.ac.uk) NCBI Protein classification databases Prosite (.hcuge.ch) Pfam (www.sanger.ac.uk/Pfam) InterPro (www.ebi.ac.uk/) www.geneontology.org - Structure PDB , www.rcsb.org/pdb/cgi/queryForm.cgi (RCSB, Research Collaboratory for Structural , rcsb.rutgers.edu) Xray crystallography NMR modeling KLOTHO (small molecules, www.ibc.wustl.edu/moirai/klotho/compound_list.html) - GDB (Human Genome Data Base, www.gdb.org) Mouse genome database (www.informatics.jax.org) Yeast genome (genome-ftp.stanford.edu/Saccharomyces) Bacterial (www.tigr.org) - Human genome browsers NCBI www.ncbi.nlm.nih.gov UCSC genome.ucsc.edu EBI www.ensembl.org Celera www.celera.com - Genetic disorders OMIM (Online Mendelian Inheritance in Man, www.ncbi.nlm.nih.gov) - Taxonomy (www.ncbi.nlm.nih.gov) - Literature PubMed (www.ncbi.nlm.nih.gov/Entrez)

Molecular biology databases

DNA sequence Genome data Protein sequence Protein classification Protein structure Major bioinformatics sites / public administrators

Genbank NCBI, NIH, US DDBJ (Japan)

EMBL (EBI, UK ) DNA sequence data : EMBL - Genbank - DDBJ

EMBL and Genbank formats

EMBL format

ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX FH Key Location/Qualifiers FH FT source 1..756 FT /db_xref="taxon:1638" FT /organism="Listeria ivanovii" FT /strain="ATCC 19119" FT RBS 95..100 FT /gene="sod" FT terminator 723..746 FT /gene="sod" FT CDS 109..717 FT /db_xref="SWISS-PROT:P28763" FT /transl_table=11 FT /gene="sod" FT /EC_number="1.15.1.1" FT /product="superoxide dismutase" FT /protein_id="CAA45406.1" FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" XX SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300 3.2.4 Feature key examples

Key Description conflict Separate determinations of the "same" sequence differ rep_origin Origin of replication protein_bind Protein binding site on DNA CDS Protein-coding sequence misc_RNA Generic label for an undefined RNA insertion_seq Insertion element D-loop Mitochondrial or other D-loop structure

3.3.4 Qualifier examples

Key Location/Qualifiers

CDS 86..742 /product="hypoxanthine phosphoribosyltransferase" /label=hprt /note="hprt catalyzes vital steps in the reutilization pathway for purine biosynthesis and its deficiency leads to forms of ""gouty"" arthritis" rep.origin 234..243 /direction=left CDS 109..564 /usedin=X10009:catalase 3.5.3 Location examples

The following is a list of common location descriptors with their meanings:

Location Description

467 Points to a single base in the presented sequence

340..565 Points to a continuous range of bases bounded by and including the starting and ending bases

<345..500 Indicates that the exact lower boundary point of a feature is unknown. The location begins at some base previous to the first base specified (which need not be contained in the presented sequence) and con- tinues to and includes the ending base

<1..888 The feature starts before the first sequenced base and continues to and includes base 888

(102.110) Indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110, in- clusive

(23.45)..600 Specifies that the starting point is one of the bases between bases 23 and 45, inclusive, and the end point is base 600

(122.133)..(204.221) The feature starts at a base between 122 and 133, inclusive, and ends at a base between 204 and 221, inclusive

123^124 Points to a site between bases 123 and 124

145^177 Points to a site between two adjacent bases anywhere between bases 145 and 177 complement(34..(122.126)) Start at one of the bases complementary to those between 122 and 126 on the presented strand and finish at the base complementary to base 34 (the feature is on the strand complementary to the presented strand) join("acct",449..670) Concatenate the four bases 'acct' to the 5' end of the sequence from bases 449 to 670, inclusive

J00193:hladr Points to a feature whose location is described in an- other entry: the feature labelled 'hladr' in the entry (in this database) with primary accession number 'J00193'

J00194:(100..202) Points to bases 100 to 202, inclusive, in the entry (in this database) with primary accession number 'J00194' EMBL and Genbank formats

EMBL format

ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX Common sequence formats

1. EMBL release format 2. Genbank (ASN.1) 3. FASTA format :

>X12345 Y098TR gene CGTATCTTACGAGCTACTACGA GGTCTTATCGGACGAGCGACT ... EMBL divisions

Human Mus musculus Rodents Other Mammals Other Vertebrates Invertebrates Plants Fungi

Prokaryotes (+ Archae)

Organanelles Viruses Bacteriophages

Patented Synthetic

EST HTG STS GSS EST (Expressed Sequence Tag) Expressed Sequence Tags (ESTs) are partial mRNA sequences, they are sequences of cDNA which have been reverse-transcribed from mRNA Short sequences (~500-1000 bases), each is result of single experiment -> high frequency of errors Applications: Discovery of new genes Mapping of various genomes Identification of coding regions in genomic sequences. EST libraries are used to answer questions like: What genes in specific cell or tissue are expressed ? UniGene clusters

UniGene partitions GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene. A majority of sequences are ESTs.

The mouse dataset contains 84,247 clusters with a total of 2,332,864 sequences.

5’ UTR CDS 3’ UTR mRNA

public ESTs High-Throughput Genomic Sequences

The High Throughput Genomic (HTG) Sequences division was created to accommodate a growing need to make 'unfinished' genomic sequence data rapidly available to the scientific community. It was done in a coordinated effort between the three International Nucleotide Sequence databases: DDBJ, EMBL, and GenBank. The HTG division contains 'unfinished' DNA sequences generated by the high-throughput sequencing centers. Sequence data in this division are available for BLAST homology searches against either the "htgs" database or the "month" database, which includes all new submissions for the prior month. The HTG division of GenBank was described in a [Genome Research (1997) 7(10)] article by Ouellette and Boguski.

Location of HTG records: Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and deposited in the HTG division. A typical HTG record might consist of all the first pass sequence data generated from a single cosmid, BAC, YAC, or P1 clone which together comprise more than 2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences and each record includes a clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data is "unfinished" and may contain errors. The accession number does not change as sequence records are updated; only the most recent version of a HTG record remains in GenBank. 'Finished' HTG sequences (phase 3) retain the same accession number, but are moved into the relevant primary GenBank division. An example of a submission (one accession number) that has progressed through phase 1, phase 2, and phase 3 is available Genome Survey Sequence (GSS) This division is similar in nature to the EST division, except that its sequences will be genomic rather than cDNA (mRNA). The GSS division will contain (but not be limited to) the following types of data: - random "single pass read" genome survey sequences - single pass reads from cosmid/BAC/YAC ends - exon trapped genomic sequences - Alu PCR sequences STS (Sequence Tagged Sites)

Sequence Tagged Sites (STS) are short DNA segments with a single location in the genome. This feature of STS makes them useful tags for mapping.

Molecular biology databases

DNA sequence Genome data Protein sequence Protein classification Protein structure Genome sequencing: published complete microbial genomes

Genome Strain Domain Size (Mb) Institution Year

Haemophilus influenzae Rd KW20 B 1.83 TIGR 1995

Mycoplasma genitalium G-37 B 0.58 TIGR 1995

Methanococcus jannaschii DSM 2661 A 1.66 TIGR 1996

Mycoplasma pneumoniae M129 B 0.81 Univ. of Heidelberg 1996

Synechocystis sp. PCC 6803 B 3.57 Kazusa DNA 1996 Research Inst. Archaeoglobus fulgidus DSM4304 A 2.18 TIGR 1997

Bacillus subtilis 168 B 4.2 International 1997 Consortium Deinococcus radiodurans R1 B 3.28 TIGR 1997

Escherichia coli K-12 Strain MG1655 B 4.6 University of 1997 Wisconsin Helicobacter pylori 26695 B 1.66 TIGR 1997

Methanobacterium delta H A 1.75 1997 thermoautotrophicum

Saccharomyces cerevisiae S288C E 13 International 1996/19 Consortium 97 Aquifex aeolicus VF5 B 1.5 Diversa 1998

Chlamydia trachomatis serovar D (D/UW- B 1.05 UC Berkeley Stanford 1998 3/Cx) Mycobacterium tuberculosis H37Rv (lab strain) B 4.4 Sanger Centre 1998

Pyrococcus horikoshii OT3 A 1.8 Biotechnology Center 1998

Rickettsia prowazekii Madrid E B 1.1 University of Uppsala 1998

Rickettsia prowazekii Madrid E B 1.1 University of Uppsala 1998

Treponema pallidum Nichols B 1.14 TIGR 1998

Aeropyrum pernix K1 A 1.67 Biotechnology Center 1999

Chlamydia pneumoniae CWL029 B 1.23 UC Berkeley Stanford 1999

Helicobacter pylori J99 B 1.64 Astra Research Center 1999 Boston Genome Therapeutics Thermotoga maritima MSB8 B 1.8 TIGR 1999

Bacillus halodurans C-125 B 4.2 Japan Marine Science 2000 and Technology Center Buchnera sp. APS B 0.64 Univ. Tokyo / RIKEN 2000

Campylobacter jejuni NCTC 11168 B 1.64 Sanger Centre 2000

Chlamydia pneumoniae AR39 B 1.23 TIGR 2000

Chlamydia trachomatis MoPn B 1.07 TIGR 2000

Halobacterium sp. NRC-1 A 2.57 Halobacterium 2000 genome consortium Neisseria meningitidis MC58 B 2.27 TIGR 2000

Neisseria meningitidis serogroup A strain B 2.18 Sanger Centre 2000 Z2491 Pseudomonas aeruginosa PAO1 B 6.3 University of 2000 Washington Thermoplasma acidophilum A 1.56 Max-Planck-Institute 2000 for Biochemistry Thermoplasma volcanum GSS1 A 1.58 AIST 2000

Ureaplasma urealyticum serovar 3 B 0.75 Applied Biosystems / 2000

Vibrio cholerae serotype O1, B 4 TIGR 2000 Biotype El Tor, strain N16961 Xylella fastidiosa 9a5c B 2.68 ONSA Consortium 2000

Escherichia coli O157:H7 strain B 4.1 University of 2001 EDL933 Wisconsin Borrelia burgdorferi B31 B 1.44 TIGR 1997 / Nucleotide sequence database statistics - distribution among organisms Comparison of fully sequenced genomes

MB Genes

Bacteria 0.6 - 7.5 500-7,000

S. cerevisiae 12 6,000 S. pombe 13 6,000 Caenorhabditis elegans 97 20,000 Drosophila melanogaster 120 14,000 110 26,000 Fugu rubripes 365 ~38,000? Mus musculus ~3000 >40,000? H. sapiens 3200 >40,000?

Sites for exploring fully sequenced genomes of man, mouse and other higher eukaryotes. NCBI www.ncbi.nlm.nih.gov UCSC genome.ucsc.edu EBI www.ensembl.org Celera www.celera.com Genome MOT, Genome monitoring table http://www.ebi.ac.uk/genomes/mot/index.html

March 2003: % Finished % Finished+Draft

Drosophila 100 C. elegans 100 A. thaliana 100 H. sapiens 118 183 Danio rerio 5 23 Mouse 110 181 Rat 0.5 169

Taxonomy database www3.ncbi.nlm.nih.gov/Taxonomy/tax.html

This is the top level of the taxonomy database maintained by NCBI/GenBank. You can explore any of the taxa listed below by clicking it.

Archaea Eubacteria Eukaryotae Viroids Viruses Other Unclassified

Molecular biology databases

DNA sequence Genome data Protein sequence Protein classification Protein structure Most entries in protein sequence databases are computational translations from gene sequences

DNA -> RNA -> protein -> conformation

FT CDS 109..717 FT /db_xref="SWISS-PROT:P28763” FT /transl_table=11 FT /gene="sod” FT /EC_number="1.15.1.1” FT /product="superoxide dismutase” FT /protein_id="CAA45406.1” FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" The flow of genetic information

DNA -> RNA -> protein -> conformation

Translation products of DNA - Amino acids in three letter code

ValArgIleArgIleSerAsp TyrGlyPheGlyPheArgMet ThrAspSerAspPheGlyCys 5' GUACGGAUUCGGAUUUCGGAUGC 3' 3' CAUGCCUAAGCCUAAAGCCUACG 5' TyrProAsnProAsnArgIle ValSerGluSerLysProHis ArgIleArgIleGluSerAla

Amino acids in one letter code

V R I R I S D Y G F G F R M T D S D F G C 5' GUACGGAUUCGGAUUUCGGAUGC 3' 3' CAUGCCUAAGCCUAAAGCCUACG 5' Y P N P N R I V S E S K P H R I R I E S A Three- and one-letter codes of the amino acids.

Alanine Ala A Arginine Arg R Asparagine Asn N Aspartate Asp D Cysteine Cys C Glutamate Glu E Glutamine Gln Q Glycine Gly G Histidine His H Isoleucine Ile I Leucine Leu L Lysine Lys K Metionine Met M Fenylalanine Phe F Proline Pro P Serine Ser S Treonine Thr T Tryptofan Trp W Tyrosine Tyr Y Valine Val V 4. THE GENETIC CODE

UUU Phe UCU Ser UAU Tyr UGU Cys UUC Phe UCC Ser UAC Tyr UGC Cys UUA Leu UCA Ser UAA Stop UGA Stop UUG Leu UCG Ser UAG Stop UGG Trp

CUU Leu CCU Pro CAU His CGU Arg CUC Leu CCC Pro CAC His CGC Arg CUA Leu CCA Pro CAA Gln CGA Arg CUG Leu CCG Pro CAG Gln CGG Arg

AUU Ile ACU Thr AAU Asn AGU Ser AUC Ile ACC Thr AAC Asn AGC Ser AUA Ile ACA Thr AAA Lys AGA Arg AUG Met ACG Thr AAG Lys AGG Arg

GUU Val GCU Ala GAU Asp GGU Gly GUC Val GCC Ala GAC Asp GGC Gly GUA Val GCA Ala GAA Glu GGA Gly GUG Val GCG Ala GAG Glu GGG Gly

Table I. The genetic code Deviations from the standard genetic code

# Cilian protozoa

UAA = Gln:Q UAG = Gln:Q

# Yeast mitochondria

UGA = Trp:W CUU = Thr:T CUC = Thr:T CUA = Thr:T CUG = Thr:T AUA = Met:M

# Mammalian mitochondria

UGA = Trp:W AUU = Ile:I AUC = Ile:I AUA = Met:M AGA = * :* AGG = * :*

# Drosophila mitochondria

UGA = Trp:W AUU = Ile:I AUA = Met:M AGA = Ser:S AGG = Ser:S

# mycoplasma

UGA = Trp Sequence symbols: Nucleotides

Symbol Meaning Complement

A A T

C C G

G G C

T/U T A

M A or C K

R A or G Y

W A or T W

S C or G S

Y C or T R

K G or T M

V A or C or G B

H A or C or T D

D A or G or T H

B C or G or T V

X/N G or A or T or C X

. not G or A or T or C . ‘Reverse’ translation

A - G - K - M Protein

GCN GGN AAR ATG DNA - most ambiguous GCU GGU AAA ATG DNA - most likely Codon usage for enteric bacterial (highly expressed) genes 7/19/83

AmAcid Codon Number /1000 Fraction ..

Gly GGG 13.00 1.89 0.02 Gly GGA 3.00 0.44 0.00 Gly GGU 365.00 52.99 0.59 Gly GGC 238.00 34.55 0.38

Glu GAG 108.00 15.68 0.22 Glu GAA 394.00 57.20 0.78 Asp GAU 149.00 21.63 0.33 Asp GAC 298.00 43.26 0.67

Val GUG 93.00 13.50 0.16 Val GUA 146.00 21.20 0.26 Val GUU 289.00 41.96 0.51 Val GUC 38.00 5.52 0.07

Ala GCG 161.00 23.37 0.26 Ala GCA 173.00 25.12 0.28 Ala GCU 212.00 30.78 0.35 Ala GCC 62.00 9.00 0.10

Arg AGG 1.00 0.15 0.00 Arg AGA 0.00 0.00 0.00 Ser AGU 9.00 1.31 0.03 Ser AGC 71.00 10.31 0.20

Lys AAG 111.00 16.11 0.26 Lys AAA 320.00 46.46 0.74 Asn AAU 19.00 2.76 0.06 Asn AAC 274.00 39.78 0.94

Met AUG 170.00 24.68 1.00 Ile AUA 1.00 0.15 0.00 Ile AUU 70.00 10.16 0.17 Ile AUC 345.00 50.09 0.83

Thr ACG 25.00 3.63 0.07 Thr ACA 14.00 2.03 0.04 Thr ACU 130.00 18.87 0.35 Thr ACC 206.00 29.91 0.55 Protein sequence databases - Content

The SWISS-PROT Protein Sequence Data Bank is a database of protein sequences produced collaboratively by Amos Bairoch (University of Geneva) and the EBI. It contains high-quality annotation, is non- redundant, and cross-referenced to many other databases. Release 40.44of SWISS-PROT contains 122'214 sequence entries comprising 44864044 amino acids.

SWISS-PROT is accompanied by TrEMBL, a computer-annotated supplement to SWISS-PROT. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated into SWISS-PROT.

TrEMBL (March 2003) contains 725'373 sequence entries

NCBI protein database : 1’335'897 sequences Growth of Swissprot protein sequence database RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR DATABASES

************ * EMBL Nucleotide * * Sequence Database * * [EBI] * *********************** ^ ^ ^ ^ ^ ^ ^ ^ ^ ****************** | | | I | | | | | ********************** * FlyBase * <------+ | | I | | | | +------> * MGD [Mouse] * Swissprot and relation to ****************** | | | I | | | | | ********************** | | | I | | | | | ****************** | | | I | | | | | ********************** other databases * SubtiList * <------+ | I | | | +------> * GCRDb [7TM recep.] * * [B.subtilis] * | | | I | | | | | ********************** ****************** | | | I | | | | | | | | I | | | | | ********************** ****************** | | | I | | +------> * EcoGene [E.coli] * * Mendel [Plant] * <-----+ | | | I | | | | | ********************** ****************** | | | | I | | | | | | | | | I | | | | | ********************** ****************** | | | | I +------> * SGD [Yeast] * * MaizeDb * <------+ I | | | | | ********************** * [Zea mays] * | | | | I | | | | | ****************** | | | | I | | | | | ********************** | | | | I | +------> * DictyDB [D.disco.] * ****************** | | | | I | | | | | ********************** * WormPep * | | | | I | | | | | * [C.elegans] * <---+ | | | | I | | | | | ********************** ****************** | | | | | I | | | | | +-----> * ENZYME [Nomencl.] * | | | | | I | | | | | | ********************** ****************** | v v v v v v v v v v v v * REBASE * ************************* ********************** * [Restriction * <-- * SWISS-PROT * ----> * OMIM [Human] * * enzymes] * * Protein Sequence * ********************** ****************** * Data Bank * ************************* ********************** ****************** ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^ * ECO2DBASE [2D] * * StyGene * | | | | | | | | | | +------> ********************** * [S.Typhimurium]* <----+ | | | | | | | | | ****************** | | | | | | | | | ********************** | | | | | | | | +------> * Maize-2DPAGE [2D] * ****************** | | | | | | | | ********************** * TRANSFAC * <------+ | | | | | | | ****************** | | | | | | | ********************** | | | | | | +------> * SWISS-2DPAGE [2D] * ****************** | | | | | | ********************** * Harefield [2D] * <------+ | | | | | ****************** | | | | | ********************** | | | | +------> * Aarhus/Ghent [2D] * ****************** | | | | ********************** * PROSITE * | | | | * [Patterns and * <------+ | | +------> ********************** * profiles] * | | * YEPD [Yeast] [2D] * ****************** | +------+ ********************** | v | | *********************** +-> ********************** +------> * PDB [3D structures] * <----- * HSSP [3D similar.] * *********************** ********************** Example of Swissprot entry

ID PRIO_HUMAN STANDARD; PRT; 253 AA. AC P04156; DT 01-NOV-1986 (REL. 03, CREATED) DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE) DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR). GN PRNP. OS HOMO SAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 86300093. RA KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H., RA PRUSINER S.B., DEARMOND S.J.; RL DNA 5:315-324(1986). RN [2] RP SEQUENCE OF 8-253 FROM N.A. RX MEDLINE; 86261778. RA LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.; RL SCIENCE 233:364-367(1986). RN [3] RP VARIANT AMYLOID GSS, SEQUENCE OF 58-85 AND 111-150. RX MEDLINE; 91160504. RA TAGLIAVINI F., PRELLI F., GHISO J., BUGIANI O., SERBAN D., RA PRUSINER S.B., FARLOW M.R., GHETTI B., FRANGIONE B.; RL EMBO J. 10:513-519(1991). RN [4] RP REVIEW ON VARIANTS. RX MEDLINE; 93372867. RA PALMER M.S., COLLINGE J.; RL HUM. MUTAT. 2:168-173(1993). CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE CC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS. CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED CC "RODS". CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR. CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND CC ANIMALS INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN AS CC TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE: CC CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME CC (GSS), FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE CC IN SHEEP AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) IN CC CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTING CC DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM CC ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY CC (EUE) IN NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATE CC THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2) CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE, CC EUE ARE ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTED CC FOODSTUFFS. CC -!- DISEASE: CJD OCCURS PRIMARILY AS A SPORADIC DISORDER (1 PER CC MILLION), WHILE 10-15% ARE FAMILIAL. ACCIDENTAL TRANSMISSION OF CC CJD TO HUMANS APPEARS TO BE IATROGENIC (CONTAMINATED HUMAN GROWTH CC HORMONE (HGH), CORNEAL TRANSPLANTATION, ELECTROENCEPHALOGRAPHIC CC ELECTRODE IMPLANTATION. . .). EPIDEMIOLOGIC STUDIES HAVE FAILED TO CC IMPLICATE THE INGESTION OF INFECTED ANNIMAL MEAT IN THE CC PATHOGENESIS OF CJD IN HUMAN. THE TRIAD OF MICROSCOPIC FEATURES CC THAT CHARACTERIZE THE PRION DISEASES CONSISTS OF (1) SPONGIFORM CC DEGENERATION OF NEURONS, (2) SEVERE ASTROCYTIC GLIOSIS THAT OFTEN CC APPEARS TO BE OUT OF PROPORTION TO THE DEGREE OF NERF CELL LOSS, CC AND (3) AMYLOID PLAQUE FORMATION. CJD IS CHARACTERIZED BY CC PROGRESSIVE DEMENTIA AND MYOCLONIC SEIZURES, AFFECTING ADULTS IN CC MID-LIFE. SOME PATIENTS PRESENT SLEEP DISORDERS, ABNORMALITIES OF CC HIGH CORTICAL FUNCTION, CEREBELLAR AND CORTICOSPINAL DISTURBANCES. CC THE DISEASE ENDS IN DEATH AFTER A 3-12 MONTHS ILLNESS. CC -!- DISEASE: GSS IS A HETEROGENEOUS DISORDER AND WAS DEFINED AS A CC "SPINOCEREBELLAR ATAXIA WITH DEMENTIA AND PLAQUELIKE DEPOSITS". CC GSS INCIDENCE IS LESS THAN 2 PER 100 MILLION. CC -!- DISEASE: KURU IS TRANSMITTED DURING RITUALISTIC CANNIBALISM, AMONG CC NATIVES OF THE NEW GUINEA HIGHLANDS. PATIENTS EXHIBIT VARIOUS CC MOVEMENT DISORDERS LIKE CEREBELLAR ABNORMALITIES, RIGIDITY OF THE CC LIMBS, AND CLONUS. EMOTIONNAL LABILITY IS PRESENT, AND DEMENTIA IS CC CONSPICUOUSLY ABSENT. DEATH USUALLY OCCURS FROM 3 TO 12 MONTH CC AFTER ONSET. CC -!- SIMILARITY: TO OTHER PRP. CC -!- DATABASE: NAME=HotMolecBase; NOTE=PrP entry; CC WWW="http://bioinformatics.weizmann.ac.il/hotmolecbase/entries/prp.htm". FT SIGNAL 1 22 FT CHAIN 23 230 MAJOR PRION PROTEIN. FT PROPEP 231 253 REMOVED IN MATURE FORM (BY SIMILARITY). FT LIPID 230 230 GPI-ANCHOR (BY SIMILARITY). FT CARBOHYD 181 181 PROBABLE. FT CARBOHYD 197 197 PROBABLE. FT DISULFID 179 214 BY SIMILARITY. FT DOMAIN 51 91 5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-G- FT Q. FT REPEAT 51 59 1. FT REPEAT 60 67 2. FT REPEAT 68 75 3. FT REPEAT 76 83 4. FT REPEAT 84 91 5. FT VARIANT 102 102 P -> L (IN GSS). FT VARIANT 105 105 P -> L (IN GSS). FT VARIANT 117 117 A -> V (LINKED TO DEVELOPMENT OF FT DEMENTING GSS). FT VARIANT 129 129 M -> V (DETERMINES THE DISEASE PHENOTYPE FT IN PATIENTS WHO HAVE A PRP MUTATION AT FT CODON 178: PATIENTS WITH MET DEVELOP FFI, FT THOSE WITH VAL DEVELOP CJD). FT VARIANT 178 178 D -> N (IN FFI AND CJD). FT VARIANT 180 180 V -> I (IN CJD). FT VARIANT 198 198 F -> S (IN A ATYPICAL FORM OF GSS WITH FT NEUROFIBRILLARY TANGLES). FT VARIANT 200 200 E -> K (IN CJD). FT VARIANT 210 210 V -> I (IN CJD). FT VARIANT 217 217 Q -> R (IN GSS WITH NEUROFIBRILLARY FT TANGLES). FT VARIANT 232 232 M -> R (IN CJD). FT CONFLICT 118 118 MISSING (IN REF. 2). SQ SEQUENCE 253 AA; 27661 MW; FD5373AD CRC32; MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA VVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV ILLISFLIFL IVG // Molecular biology databases

DNA sequence Genome data Protein sequence Protein classification Protein structure Protein classification databases

PROSITE Pfam InterPro Prosite: Patterns are identified from multiple alignments of protein sequences

PROSITE

Release 16.45 : 1483 patterns

Example 1

ID ATP_GTP_A; PATTERN. AC PS00017; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); NOV-1990 (INFO UPDATE). DE ATP/GTP-binding site motif A (P-loop). PA [AG]-x(4)-G-K-[ST]. CC /TAXO-RANGE=ABEPV; 3D 1EFM; 1ETU; 1Q21; 2Q21; 4Q21; 5Q21; 6Q21; DO PDOC00017; Example II

ID ZINC_FINGER_C2H2; PATTERN. AC PS00028; DT APR-1990 (CREATED); JUN-1994 (DATA UPDATE); NOV-1997 (INFO UPDATE). DE Zinc finger, C2H2 type, domain. PA C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H. NR /RELEASE=35,69113; Pfam www.sanger.ac.uk/Pfam/

Pfam is a database of multiple alignments of protein domains or conserved protein regions. Hopefully they represent some evolutionary conserved structure which has implications for the protein's function.

Version 6.6, August 2001, 3071 families

Over 65% of the proteins in SWISSPROT 38 and TrEMBL-11 have at least one match to a Pfam family. 72% of protein sequences have at least one match to Pfam.

Applications of gene ontology databases:

1. Cases of non-informative protein sequence databases

Query= FWR602467643.F1 664 23 640 ABI cut from 23 to 663. Remaining: 640 bases. (640 letters)

Database: /pubdata/ncbi/nr 771,594 sequences; 245,249,561 total letters

Searching...... done

Score E Sequences producing significant alignments: (bits) Value dbj|BAB23278.1| (AK004371) putative [Mus musculus] 369 e-101 gb|AAH08101.1|AAH08101 (BC008101) Similar to hypothetical protei... 174 7e-43

2. You want to answer questions like: Which proteins are linked to a specific biological process, like glycolysis ? Gene ontology consortium Major principles •Molecular function •Biological process •Cellular component Gene ontology www.geneontology.org Extract of gene assocation table:

SP O00115 DRN2_HUMAN GO:0003677 F Deoxyribonuclease II precursor SP O00115 DRN2_HUMAN GO:0004519 F Deoxyribonuclease II precursor SP O00115 DRN2_HUMAN GO:0004531 F Deoxyribonuclease II precursor SP O00115 DRN2_HUMAN GO:0005764 C Deoxyribonuclease II precursor SP O00116 ADAS_HUMAN GO:0005777 C Alkyldihydroxyacetonephosphate.. SP O00116 ADAS_HUMAN GO:0005777 C Alkyldihydroxyacetonephosphate.. Molecular biology databases

DNA sequence Genome data Protein sequence Protein classification Protein structure Databases of protein/nucleic structure

HIV protease Secondary structure elements of proteins:

?????? helix Secondary structure elements of proteins:

????? sheet Schematic pictures of proteins highlight secondary structure

Determination of protein structure

* X ray crystallography * NMR

Example of PDB entry

HEADER HORMONE 30-OCT-92 1BPH 1BPH 2 COMPND INSULIN (CUBIC) IN 0.1M SODIUM SALT SOLUTION AT PH9 1BPH 3 SOURCE BOVINE (BOS $TAURUS) PANCREAS 1BPH 4 AUTHOR O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 5 REVDAT 2 31-OCT-93 1BPHA 1 REMARK HET FORMUL 1BPHA 1 REVDAT 1 15-JAN-93 1BPH 0 1BPH 6 JRNL AUTH O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 7 JRNL TITL CONFORMATIONAL CHANGES IN CUBIC INSULIN CRYSTALS 1BPH 8 JRNL TITL 2 IN THE PH RANGE 7-11 1BPH 9 JRNL REF BIOPHYS.J. V. 63 1210 1992 1BPH 10 JRNL REFN ASTM BIOJAU US ISSN 0006-3495 030 1BPH 11 REMARK 1 1BPH 12 REMARK 1 REFERENCE 1 ATOM 1 N GLY A 1 13.994 47.196 31.798 1.00 35.87 1BPH 129 ATOM 2 CA GLY A 1 14.277 46.226 30.708 1.00 38.67 1BPH 130 ATOM 3 C GLY A 1 15.574 45.507 31.085 1.00 31.18 1BPH 131 ATOM 4 O GLY A 1 16.078 45.660 32.217 1.00 22.60 1BPH 132 ATOM 5 N ILE A 2 16.088 44.766 30.126 1.00 28.39 1BPH 133 ATOM 6 CA ILE A 2 17.342 44.034 30.404 1.00 23.76 1BPH 134 ATOM 7 C ILE A 2 18.526 44.939 30.686 1.00 25.29 1BPH 135 ATOM 8 O ILE A 2 19.425 44.457 31.392 1.00 18.74 1BPH 136 ATOM 9 CB ILE A 2 17.571 43.072 29.158 1.00 27.36 1BPH 137 ATOM 10 CG1 ILE A 2 18.638 42.049 29.605 1.00 18.03 1BPH 138 ATOM 11 CG2 ILE A 2 17.859 43.936 27.903 1.00 25.54 1BPH 139 ATOM 12 CD1 ILE A 2 18.914 40.930 28.590 1.00 17.07 1BPH 140 ATOM 13 N VAL A 3 18.619 46.195 30.192 1.00 24.42 1BPH 141 ATOM 14 CA VAL A 3 19.774 47.080 30.436 1.00 30.26 1BPH 142 ATOM 15 C VAL A 3 19.952 47.453 31.895 1.00 19.08 1BPH 143 ATOM 16 O VAL A 3 21.018 47.421 32.561 1.00 28.15 1BPH 144 ATOM 17 CB VAL A 3 19.719 48.274 29.462 1.00 33.87 1BPH 145 ATOM 18 CG1 VAL A 3 20.847 49.225 29.754 1.00 30.40 1BPH 146 ATOM 19 CG2 VAL A 3 19.868 47.724 28.044 1.00 24.51 3D viewers

Several programs are available for viewing protein and nucleic 3D structures:

Rasmol www.umass.edu/microbio/rasmol/

Weblab www.msi.com

Kinemage www.cryst.bbk.ac.uk/PPS/vsns-pps/technology/kinemage.html

Chime www.umass.edu/microbio/rasmol/

Protein explorer www.umass.edu/microbio/chime/explorer/

Cn3D www.ncbi.nlm.nih.gov/Entrez

SwissPDB viewer expasy.proteome.org.au/spdbv/

(Molscript www.avatar.se/molscript/)