25‐Mar‐15

Biology is Big Data science Databases genomes

sequenced

#

Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Moore's Law: computer Utrecht University, March 26th 2015 power doubles every ~2 years.

History How would you figure out the function of a protein?

Activity assay X‐ray structure

IBM 7090 computer • First protein sequence: bovine insulin (51 amino acids, 1956) • Atlas of Protein Sequence and Structure (1965) – Margaret Oakley Dayhoff • Protein DataBank (10 proteins, 1972) – X‐ray crystallographic protein structures Knock‐out mouse • SWISSPROT (1987) – Protein sequence database • Genbank (1982) – Nucleotide and protein sequences BLAST search

Fasta files Fasta file extensions • Biological sequences are stored in Fasta files • The file extension of a Fasta file is .fa or .fasta • Fasta files are plain text files (open e.g. in ) • The preferred extension for protein Fasta files is .faa – Fasta Every new sequence entry starts with a “>” sign at the start of a line >protein_sequence_A MTQSSHAVAA FDLGAALRQE GLTETDYSEI QRDPNRAELG TFGV Each sequence has an identifier >protein_sequence_B that has to be unique in the file MLTETDYSEI QRRLGRDPNR AELGMFGVMN RAELGMFGY >protein_sequence_A >protein_sequence_C MTQQQQSSHAVAA FDLGAALRQE GLTETDYSEI QRDPNRAELG TFGV MHAVAAFDLG AALRQEGLTE TDYSEIQRRL GRAMFGVMWS EHCCYRNDDA >protein_sequence_B RPLLRPIKSP FGAWVVIV MLTETDYSEI QRRLGRDPNR AELGMFGVMN RAELGMFGY >protein_sequence_C • The preferred extension for DNA Fasta files is .fna MHAVAAFDLG AALRQEGLTE TDYSEIQRRL GRAMFGVMWS EHCCYRNDDA RPLLRPIKSP FGAWVVIV – Fasta Nucleic Acid >DNA_sequence_X GAGGAATTCA TAGCTGACGA GTCGAGTGAA AACCGTGTCG TAAAAGA >DNA_sequence_Y The sequence can be on one or more lines Spaces and newlines just make CTGACGAGTC GCCCCCCCCC ATAGAGTGGT TTCCGTTTCC GGAAGGGTCG until the next “>” at the start of a new line sequences easier to read/count, >DNA_sequence_Z they do not have any meaning GAAGCTGACC CGTTTCCGGA AGAGGGAGG

1 25‐Mar‐15

DNA sequencing Bad quality sequencing read

• DNA sequencing depends on A/C/G/T signal being “read” – Differently colored fluorophore signals – Signal is not always unambiguous • DNA sequencing machines estimate the quality of a sequenced nucleotide

Good quality sequencing read DNA sequencing quality scores • DNA sequencing quality is measured in Phred scores – Phred 10: 10‐1 chance that the base is wrong • 90% accuracy; 10% error rate – Phred 20: 10‐2 chance that the base is wrong • 99% accuracy ; 1% error rate – Phred 30: 10‐3 chance that the base is wrong • 99.9% accuracy ; 0.1% error rate – Etcetera • Phred scores in Fastq files are stored as ASCII characters – Phred score + 33, converted to ASCII text

Fastq Genbank format • Sequencing output and quality are stored in Fastq format • Used by the Genbank – Based on Fasta format database – Used in sequence similarity – Contains information about quality of each nucleotide searches (more about this – Quality score is estimated by sequencing machine later) >sequence_identifier_1@sequence_identifier_1 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC >se+seqq_uence identifier _ 21 AAGCATCCGAATGACGAGCTAGGAGAGATCTGAGCCTTTCAAAhhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E% • Four lines per sequence: – Identifier line starting with @ – DNA sequence on one line • Contains the sequence and all related information – Second identifier line starting with + • Click on “FASTA” to get Fasta – String of quality scores on one line, encoded in ASCII characters format

2 25‐Mar‐15

Central paradigm of International Nucleotide Sequence Database Collaboration • Central dogma of (molecular) biology • INSDC is a collaboration between: – DNA Data Bank of Japan (DDBJ) – National Center for Biotechnology Information (NCBI) – European Molecular Biology Laboratory / European Bioinformatics • Biological sequences encode a lot of information Institute (EMBL‐EBI) • One of the most important applications of Bioinformatic Data Analysis is to extract this information

Protein families • Pfam database • SEED database • EGGnog database

Ribosomal RNA genes Protein structures • Small subunit ribosomal RNA (SSU rRNA) is a universal marker gene that indicates the taxonomic group of an organism, and was used to discover the three domains in the Tree of Life (ToL) – 16S rRNA (Bacteria and Archaea) – 18S rRNA (Eukaryotes)

3 25‐Mar‐15

Transcription factor binding sites (TFBS) Metabolic pathways (etc.)

Scientific literature Protein interactions

Using databases reproducibly in science

• Databases are not static, but are constantly updated • Thus, every entry (sequence, protein, structure, function, etc.) in a database has a unique identifier – Sometimes identifiers are changed in new versions of the database • Entries can be retrieved from the database – Search possibilities depend on the database – Often the complete database can be downloaded for large‐scale analyses – When you access a database, note the version of the database or the date of accessing the database for reproducible science!

4