21‐Mar‐15

Info and documentation Introduction to Bioinformatics • http://theory.bio.uu.nl/BDA/2015

• http://www.google.com – … but only for guidance and hints: never take the internet for granted

• Campbell Biology, 9th or 10th edition, Pearson

• Reader – Printed in black and white – Download full color PDF at: http://theory.bio.uu.nl/BDA/2015/BioInf2015.pdf Bas E. Dutilh – Errata: Systems Biology: Bioinformatic Data Analysis http://theory.bio.uu.nl/BDA/2015/errata.html Utrecht University, March 19th 2015

Evaluation How would you figure out the function of a protein? • Final mark course – 2/3 mark of Mathematics/Theoretical Biology – 1/3 mark of Bioinformatic Data Analysis

• Bioinformatics: mark of written exam only – NOTE: this is different from info in studiegids! Activity assay – Date: April 9th 2015 at 17:00‐20:00 in Educatorium Gamma X‐ray structure

• Bonus point – NOTE: this is different from info in studiegids! – Make all practicals and have them signed by the assistant • In case of emergencies you can be late by one class maximum th – Hand in your mini‐article on time (deadline: April 7 2015) Knock‐out mouse through http://theory.bio.uu.nl/sb/rooster.html – The bonus point will only be added to the mark of the written exam if this mark is >4 before addition – The maximum mark is a 10 BLAST search

How about for all proteins in a ? Genome sizes Chaos chaos (1.4 Tb, Friz 1968)

Tb: Tera base pairs (1012) Gb: Giga base pairs (109) Mb: Mega base pairs (106) Kb: Kilo base pairs (103)

1 21‐Mar‐15

Gene density and non‐coding DNA Components of the genome • Mammals (including ) have the lowest • 20,000 – 25,000 protein‐coding (1.5%) density – Number of genes in a given length of DNA • Introns within genes • Introns (25.9%) • Noncoding DNA between genes

• Transposable elements (44.7%) – DNA transposons – Long terminal repeat (LTR) – Short interspersed nuclear elements (SINEs) – Long interspersed nuclear elements (LINEs) – Endogenous – Miniature inverted repeat transposable elements (MITEs)

Largest Smallest genomes • Eukaryota – Free: Ostreococcus tauri (12.6 Mb) – Endosymb: Encephalitozoon intestinalis (2.3 Mb)

and – Free: Mycoplasma genitalium (580 kb) Largest sequenced genome: – Endosymb: Cand. Carsonella ruddii (160 kb) Loblolly pine (Pinus taeda) 20,000 ,000 ,000 bp (20 Gb) Kinugasasō (Paris japonica) • 149,000,000,000 bp (149 Gb) – Circoviridae (1.8 kb –only two proteins!)

Genetic diversity Human genome • Phylogenetic Tree of • 3,000,000,000 bp (3 Gb) • Human Genome Project (HGP) – 1990‐2003 – Draft genome sequence complete in 2000 • Reference genome – Source: blood (female) and sperm (male) – Samples taken from many donors, but only a few were used to protect donor identities – Sequence is not from one individual • >70% from one male donor Archaea • Cost HGP: $ 3,000,000,000 Prokaryotes – Target: $ 1,000 genome Bacteria

2 21‐Mar‐15

Genome sequencing Whole Genome Shotgun (WGS) approach

Cloned genomes

Segments known order

Fragment and sequence

Assemble sequences

Consensus genome

Personal genome sequences Your personal genome sequence

~2.000.000 differences

Craig Venter James Watson

~5.000.000 differences ~5.000.000 differences

Reference Genome

So we have a $200 personal genome… Personalized medicine Sergey Brin Co‐founder Co‐invester

LRRK2 polymorphism on 12 ‐ 28% risk of Parkinson’s at age 59 ‐ 51% at age 69 • …now the million dollar question is: ‐ 74% at age 79 • From reactive to proactive medicine What can I learn from my – Identify high risk alleles 3,000,000,000 A’s, C’s, G’s, and T’s? – Adapt lifestyle (e.g. risk of high blood pressure) – Preventive screening or treatment (e.g. risk of cancer) • Pharmacogenomics: – Impact of genetic variation on response to medication

3 21‐Mar‐15

Biology is Big Data science Omics sciences • The suffix ‐ome refers to a totality of some sort • Gene (genetics) • Genome • Genomics • Transcript (RNA) • Transcriptome • Transcriptomics genomes

• Protein • Proteome • Proteomics sequenced

#

DNA RNA Protein

• Metabolite • Metabolome • Metabolomics • Lipid • Lipidome • Lipidomics Moore's Law: computer power doubles every ~2 years. • Microbe • • Microbiomics (?!)

Genomics Metagenomics • Identify differences in gene content between genomes • Discover new species: “Biological Dark Matter” Sample • Analyze genome evolution • Predict gene functions

Filter

Microbes or viruses

Chordata ↔ Echinodermata

Human microbiome and virome Bioinformatics • In your body: ~1013 human cells ~1014 bacteria ~1015 viruses • Bioinformatics: study of informatic processes in biotic

Image: Lisa Brown for systems

Paulien Hogeweg and Ben Hesper (Utrecht University, 1970) • Bioinformatic Data Analysis: using computational methods to analyze biological data

4 21‐Mar‐15

Bioinformatics in Utrecht today

5