Databases Genomes

25‐Mar‐15 Biology is Big Data science Databases genomes sequenced # Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Moore's Law: computer Utrecht University, March 26th 2015 power doubles every ~2 years. History How would you figure out the function of a protein? Activity assay X‐ray structure IBM 7090 computer • First protein sequence: bovine insulin (51 amino acids, 1956) • Atlas of Protein Sequence and Structure (1965) – Margaret Oakley Dayhoff • Protein DataBank (10 proteins, 1972) – X‐ray crystallographic protein structures Knock‐out mouse • SWISSPROT (1987) – Protein sequence database • Genbank (1982) – Nucleotide and protein sequences BLAST search Fasta files Fasta file extensions • Biological sequences are stored in Fasta files • The file extension of a Fasta file is .fa or .fasta • Fasta files are plain text files (open e.g. in ) • The preferred extension for protein Fasta files is .faa – Fasta Amino Acid Every new sequence entry starts with a “>” sign at the start of a line >protein_sequence_A MTQSSHAVAA FDLGAALRQE GLTETDYSEI QRDPNRAELG TFGV Each sequence has an identifier >protein_sequence_B that has to be unique in the file MLTETDYSEI QRRLGRDPNR AELGMFGVMN RAELGMFGY >protein_sequence_A >protein_sequence_C MTQQQQSSHAVAA FDLGAALRQE GLTETDYSEI QRDPNRAELG TFGV MHAVAAFDLG AALRQEGLTE TDYSEIQRRL GRAMFGVMWS EHCCYRNDDA >protein_sequence_B RPLLRPIKSP FGAWVVIV MLTETDYSEI QRRLGRDPNR AELGMFGVMN RAELGMFGY >protein_sequence_C • The preferred extension for DNA Fasta files is .fna MHAVAAFDLG AALRQEGLTE TDYSEIQRRL GRAMFGVMWS EHCCYRNDDA RPLLRPIKSP FGAWVVIV – Fasta Nucleic Acid >DNA_sequence_X GAGGAATTCA TAGCTGACGA GTCGAGTGAA AACCGTGTCG TAAAAGA >DNA_sequence_Y The sequence can be on one or more lines Spaces and newlines just make CTGACGAGTC GCCCCCCCCC ATAGAGTGGT TTCCGTTTCC GGAAGGGTCG until the next “>” at the start of a new line sequences easier to read/count, >DNA_sequence_Z they do not have any meaning GAAGCTGACC CGTTTCCGGA AGAGGGAGG 1 25‐Mar‐15 DNA sequencing Bad quality sequencing read • DNA sequencing depends on A/C/G/T signal being “read” – Differently colored fluorophore signals – Signal is not always unambiguous • DNA sequencing machines estimate the quality of a sequenced nucleotide Good quality sequencing read DNA sequencing quality scores • DNA sequencing quality is measured in Phred scores – Phred 10: 10‐1 chance that the base is wrong • 90% accuracy; 10% error rate – Phred 20: 10‐2 chance that the base is wrong • 99% accuracy ; 1% error rate – Phred 30: 10‐3 chance that the base is wrong • 99.9% accuracy ; 0.1% error rate – Etcetera • Phred scores in Fastq files are stored as ASCII characters – Phred score + 33, converted to ASCII text Fastq Genbank format • Sequencing output and quality are stored in Fastq format • Used by the Genbank – Based on Fasta format database – Used in sequence similarity – Contains information about quality of each nucleotide searches (more about this – Quality score is estimated by sequencing machine later) >sequence_identifier_1@sequence_identifier_1 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC >se+seqq_uence identifier _ 21 AAGCATCCGAATGACGAGCTAGGAGAGATCTGAGCCTTTCAAAhhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E% • Four lines per sequence: – Identifier line starting with @ – DNA sequence on one line • Contains the sequence and all related information – Second identifier line starting with + • Click on “FASTA” to get Fasta – String of quality scores on one line, encoded in ASCII characters format 2 25‐Mar‐15 Central paradigm of Bioinformatics International Nucleotide Sequence Database Collaboration • Central dogma of (molecular) biology • INSDC is a collaboration between: – DNA Data Bank of Japan (DDBJ) – National Center for Biotechnology Information (NCBI) – European Molecular Biology Laboratory / European Bioinformatics • Biological sequences encode a lot of information Institute (EMBL‐EBI) • One of the most important applications of Bioinformatic Data Analysis is to extract this information Protein families • Pfam database • SEED database • EGGnog database Ribosomal RNA genes Protein structures • Small subunit ribosomal RNA (SSU rRNA) is a universal marker gene that indicates the taxonomic group of an organism, and was used to discover the three domains in the Tree of Life (ToL) – 16S rRNA (Bacteria and Archaea) – 18S rRNA (Eukaryotes) 3 25‐Mar‐15 Transcription factor binding sites (TFBS) Metabolic pathways (etc.) Scientific literature Protein interactions Using databases reproducibly in science • Databases are not static, but are constantly updated • Thus, every entry (sequence, protein, structure, function, etc.) in a database has a unique identifier – Sometimes identifiers are changed in new versions of the database • Entries can be retrieved from the database – Search possibilities depend on the database – Often the complete database can be downloaded for large‐scale analyses – When you access a database, note the version of the database or the date of accessing the database for reproducible science! 4.

Databases Genomes

Lie Therory and Hermite Polynomials

Sophie Dumont

Sophie Dumont

The Origins of Bioinformatics.Pdf

History and Epistemology of M Olecular Biology and Beyond

Coding Sequences: a History of Sequence Comparison Algorithms As a Scientiªc Instrument

Functional Proteins from Short Peptides: Dayhoff S Hypothesis

Evolution by Substitution: Amino Acid Changes Over Time

Classifying Peroxiredoxin Subgroups and Identifying

Fully Transferred to Cotton, Maize and Potatoes

PDF Hosted at the Radboud Repository of the Radboud University Nijmegen

Amino Acids, Peptides, and Proteins