<<

1. INTRODUCTION TO AND

BIOINFORMATICS COURSE MTAT.03.239

11.09.2013 LIFE

 Is a characteristic that distinguishes objects that have signaling and self-sustaining processes (i.e. living ) to those that do not have it

 Is a of living characterized by capacity for , growth, reaction to stimuli, and

 A diversity of life forms are found on Earth, eg. plants, animals, fungi, protists, archaea and bacteria

"Introduction to Bioinformatics" 2 Bioinformatics Course "Introduction to Bioinformatics" 3 Bioinformatics Course WHAT IS BIOLOGY ?

"Introduction to Bioinformatics" 4 Bioinformatics Course http://www.tagxedo.com/app.html "Introduction to Bioinformatics" 5 Bioinformatics Course BIOLOGY

 Is a study of life and living

 It brings together the structure, , growth, origin, distribution, , interactions, and of living organism

AEROBIOLOGY, AGRICULTURE, , , , BIOENGINEERING, BIOINFORMATICS, BIOMATHEMATICSOR, MATHEMATICAL BIOLOGY, , BIOMEDICAL RESEARCH, , , BUILDING BIOLOGY, , CELLBIOLOGY, , , , , , ENTOMOLOGY, ENVIRONMENTAL BIOLOGY, , ETHOLOGY, , , , , , INTEGRATIVE BIOLOGY, LIMNOLOGY, , , , , , NEUROBIOLOGY, OCEANOGRAPHY, ONCOLOGY, , BIOLOGY, , , , PATHOBIOLOGY OR , , , , PHYTOPATHOLOGY, PSYCHOBIOLOGY, , ,

"Introduction to Bioinformatics" 6 Bioinformatics Course BIOLOGY

 Is a study of life and living organisms

 It brings together the structure, function, growth, origin, distribution, adaptation, interactions, taxonomy and evolution of living organism IOLOGY COMPRISES AREAS OF STUDY THAT FOCUS ON LIFE AT A B VARIETYAEROBIOLOGY ,OF AGRICULTURE LEVELS, ANATOMY AND, AFROMSTROBIOLOGY A , DIVERSITYBIOCHEMISTRY, B IOENGINEERINGOF PERSPECTIVES, BIOINFORMATICS , BIOMATHEMATICSOR, MATHEMATICAL BIOLOGY, BIOMECHANICS, BIOMEDICAL RESEARCH, BIOPHYSICS, BIOTECHNOLOGY, BUILDING BIOLOGY, BOTANY, CELLBIOLOGY, CONSERVATION BIOLOGY, CRYOBIOLOGY, DEVELOPMENTAL BIOLOGY, ECOLOGY, EMBRYOLOGY, ENTOMOLOGY, ENVIRONMENTAL BIOLOGY, EPIDEMIOLOGY, ETHOLOGY, EVOLUTIONARY BIOLOGY, GENETICS, HERPETOLOGY, HISTOLOGY, ICHTHYOLOGY, INTEGRATIVE BIOLOGY, LIMNOLOGY, MAMMALOGY, MARINE BIOLOGY, MICROBIOLOGY, MOLECULAR BIOLOGY, MYCOLOGY, NEUROBIOLOGY, OCEANOGRAPHY, ONCOLOGY, ORNITHOLOGY, POPULATION BIOLOGY, POPULATION ECOLOGY, POPULATION GENETICS, PALEONTOLOGY, PATHOBIOLOGY OR PATHOLOGY, PARASITOLOGY, PHARMACOLOGY, PHYSIOLOGY, PHYTOPATHOLOGY, PSYCHOBIOLOGY, SOCIOBIOLOGY, STRUCTURAL BIOLOGY, VIROLOGY

"Introduction to Bioinformatics" 7 Bioinformatics Course LIVING SYSTEMS

Domain - Eukaryota Kingdom - Animalia Phylum - Chordata Vertebrata (Subphylum) Class - Mammalia Order - Primates Anthropoidea (Suborder) Hominoidea (Superfamily) Family - Hominidae Genus - Homo - sapiens

"Introduction to Bioinformatics" 8 Bioinformatics Course HUMANS http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606 Lineage (full): root; cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; Homininae; Homo; Homo sapiens "Introduction to Bioinformatics" 9 Bioinformatics Course SPECIES

 Defined as a group of living organisms consisting of similar individuals capable of exchanging or interbreeding

http://www.nature.com/news/2011/110823/full/news.2011.498.html

"Introduction to Bioinformatics" 10 Bioinformatics Course

NO. OF SPECIES

http://www.iucnredlist.org/documents/summarystatistics/2010_1RL_Stats_Table_1.pdf

"Introduction to Bioinformatics" 13 Bioinformatics Course LEVELS OF ORGANISATION

http://www.nature.com/scitable/topicpage/biological-complexity-and-integrative-levels-of-organization-468 14 LEVELS OF ORGANISATION

http://www.nature.com/scitable/topicpage/biological-complexity-and-integrative-levels-of-organization-468 15 BIOLOGICAL QUESTIONS

 How are all life-forms related?  What was the first like?  How do species adapt to their environment?  Which part of our is evolving the fastest?  Are we descendents of Neanderthals?  What genes are responsible for major human disease?  Why do we need new flu vaccines every day?

Introduction to , Nello Christiani and Matthew W. Hahn

"Introduction to Bioinformatics" 16 Bioinformatics Course BIOINFORMATICS ?

"Introduction to Bioinformatics" 17 Bioinformatics Course SCIENCE [CS] STUDIES COMPUTABLE PROCESSES

AND STRUCTURES ( WITH THE AID OF )

"Introduction to Bioinformatics" 18 Bioinformatics Course BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

The boundaries between the two diciplines are not well defined and can be distinguished by the problems they solve

 BIOINFORMATICS – is the application of and to the field of molecular biology

 COMPUTATIONAL BIOLOGY – actual of analyzing and interpreting

"Introduction to Bioinformatics" 19 Bioinformatics Course DEFINITION OF BIOINFORMATICS

 The term bioinformatics was coined in 1978  Bioinformatics is the application of technology and computer science to the field of molecular biology  The science of using / developing computer software and to record, analyze and merge biologically related data  Using computer technology to manage large amounts of  Bioinformatics involves the use of techniques including applied , , statistics, computer science, artificial , chemistry, and biochemistry to solve biological problems usually on the molecular level

http://www.google.com/search?q=define%3ABioinformatics

"Introduction to Bioinformatics" 20 Bioinformatics Course DEFINITION OF BIOINFORMATICS

 The collection, organization, storage, analysis, and integration of large amounts of biological data using networks of computers and  Bioinformatics involves the integration of computers, software tools, and databases in an effort to address biological questions  In summary, the use of computer science to solve biological problems

http://www.google.com/search?q=define%3ABioinformatics

"Introduction to Bioinformatics" 21 Bioinformatics Course BIOINFORMATIC FOCUS

MOLECULES

CELL

ORGAN

ORGANISM

http://www.nature.com/scitable/topicpage/biological-complexity-and-integrative-levels-of-organization-468 22 BIOINFORMATIC FOCUS

MOLECULES

CELL TISSUE ANALYSIS AND INTERPRETATION OF VARIOUS TYPES OF BIOLOGICAL DATA INCLUDING: AND SEQUENCES DOMAINS AND PROTEIN STRUCTURES , , .

ORGANISM

http://www.nature.com/scitable/topicpage/biological-complexity-and-integrative-levels-of-organization-468 23 BIOINFORMATIC FOCUS

Development of new algorithms and statistics with which to assess biological information, such as relationships among members of large data sets.

http://www.nature.com/msb/journal/v3/n1/images/msb4100163-f4b.jpg "Introduction to Bioinformatics" 24 Bioinformatics Course BIOINFORMATIC FOCUS

Development and implementation of tools that enable efficient access and management of different types of information, such as various databases, integrated mapping information http://www.jofwidata.com/images/database-design-development.jpg http://wolfson.huji.ac.il/expression/detective.jpg "Introduction to Bioinformatics" 25 Bioinformatics Course UNITS OF INFORMATION IN BIOINFORMATICS

DNA Sequence Pathways

RNA Structure Interactions

Protein Evolution

"Introduction to Bioinformatics" 26 Bioinformatics Course UNITS OF INFORMATION IN COMPUTER SCIENCE File Storage capacity by Bits and

Bit Kilobyte Megabyte Gigabyte 1024*8= 1024*8192= 1024*8388608= bit 1 8 8,192 8,388,608 8,589,934,592 1024*1024= 1024*1048576= byte 8 1 1024 1,048,576 1,073,741,824 Kilobyte 8,192 1024 1 KB 1024 1,048,576 Megabyte 8,388,608 1,048,576 1024 1 MB 1024

Gigabyte 8,589,934,592 1,073,741,824 1,048,576 1024 1 GB 8,796,093,022,208 Terabyte 1,099,511,627,776 1,073,741,824 1,048,576 1024 1TB 9,007,199,254,740,9 1,125,899,906,842, 1,099,511,627,77 1,073,741,824 1,048,576 Petabyte 90 620 6 1024 TB 1 PB

"Introduction to Bioinformatics" 27 Bioinformatics Course UNITS OF INFORMATION IN COMPUTER SCIENCE File Storage capacity by Bits and Bytes

Bit Byte Kilobyte Megabyte Gigabyte 9,007,199,254,740,99 1,125,899,906,84 1,099,511,627,77 1,073,741,824 1,048,576 Petabyte 0 2,620 6 1024 TB 1 BO 9,223,372,036,854,78 1,152,921,504,60 1,125,899,906,84 1,099,511,627,7 1,073,741,824 Exabyte 0,000 6,850,000 2,620 76 1,048,576 TB 1024 PB 1 EB 9,444,732,965,739,29 1,180,591,620,71 1,152,921,504,60 1,125,899,906,8 1,099,511,627,776 Zettabyte 0,000,000 7,410,000,000 6,850,000 42,620 1,073,741,824 TB 1,048,576 PB 1024 EB 1 ZB 1,208,925,819,61 9,671,406,556,917,030, 1,180,591,620,71 1,152,921,504,6 1,125,899,906,842,62 4,630,000,000,00 000,000,000 7,410,000,000 KB 06,850,000 MB 0 GB Yottabyte 0 1,099,511,627,776 TB 1,073,741,824 PB 1,048,576 EB 1024 ZB 1 YB

"Introduction to Bioinformatics" 28 Bioinformatics Course CELL SIZES

http://learn.genetics.utah.edu/content/begin/cells/scale/ "Introduction to Bioinformatics" 29 Bioinformatics Course HUMAN CELL

http://bhavanajagat.files.wordpress.com/2012/02/cell-structure-and-functions.jpg 30 EXAMPLES OF BIOLOGICAL DATA

 GENOME – DNA  – RNA 

The biological information contained in a genome is encoded in deoxyribonucleic acid (DNA) or, for many types of virus, in ribonucleic acid (RNA)

"Introduction to Bioinformatics" 31 Bioinformatics Course NAME THE NUMBERS NUCLEUS DNA GENES CHROMOSOME 1 CELL 2

3 5 4

"Introduction to Bioinformatics" 32 Bioinformatics Course EXAMPLES OF BIOLOGICAL DATA

"Introduction to Bioinformatics" 33 Bioinformatics Course CENTRAL DOGMA OF MOLECULAR BIOLOGY

DNA is transcribed into RNA and RNA is translated into proteins

http://compbio.pbworks.com/f/central_dogma.jpg 34 CENTRAL DOGMA OF MOLECULAR BIOLOGY

http://www.uic.edu/classes/bios/bios100/lectures/centraldogma.jpg 35 EXAMPLES OF BIOLOGICAL DATA

 GENOME – DNA  TRANSCRIPTOME – RNA  PROTEOME – Proteins

The biological information contained in a genome is encoded in deoxyribonucleic acid (DNA) or, for many types of virus, in ribonucleic acid (RNA)

"Introduction to Bioinformatics" 36 Bioinformatics Course GENOME

 Is the entirety of an organism’s hereditary information

 The genome includes both the genes and non-coding sequences of DNA/RNA

 In 1995, or was the first genome of a living organism to be sequenced in July 1995

1 830 140 base pairs of DNA in single circular chromosome that contains 1740 protein-coding , 58 transfer RNA genes and 18 other RNA genes

http://www.sciencemag.org/content/269/5223/local/front-matter.pdf http://en.wikipedia.org/wiki/File:Haemophilus_influenzae_01.jpg

"Introduction to Bioinformatics" 37 Bioinformatics Course WHOLE

"Introduction to Bioinformatics" 38 Bioinformatics Course GENOME SIZES

Introduction to Computational Biology, Nello Christiani and Matthew W. Hahn

"Introduction to Bioinformatics" 39 Bioinformatics Course GENOME SIZES

 Japanese flower Paris japonica  130 billion base pairs – 50 times the

"Introduction to Bioinformatics" 40 Bioinformatics Course COMPLETELY SEQUENCED GENOMES

"Introduction to Bioinformatics" 41 Bioinformatics Course HUMAN GENOME

One cell DNA

• 23 pairs of • 3 billion pairs chromosomes of DNA bases

RNA

• ≈21,000 to Human 23,000 body genes Protein • 1014 cells • (100 • ≈100 000 trillion) different proteins

"Introduction to Bioinformatics" 42 Bioinformatics Course Relative proportions (%) of bases in DNA

CURRENT SCIENCE, VOL. 85, NO. 11, 10 DECEMBER 2003

"Introduction to Bioinformatics" 43 Bioinformatics Course DNA

DNA with high GC-content is more stable than DNA with low GC-content, 3 hydrogen bonds

"Introduction to Bioinformatics" 44 Bioinformatics Course DNA vs RNA

 DNA – deoxyribonucleic acid  RNA –ribonucleic acid  Sugar is deoxyribose  Sugar is ribose  DNA is a polymer of  RNA is a polymer of deoxyribonucleotides ribonucleotides  Bases are adenine (A),  Bases are adenine (A), guanine (G), cytosine (C) and guanine (G), cytosine (C) and thymine (T) uracil (U)

http://www2.chemistry.msu.edu/faculty/reusch/VirtTxtJml/Images3/dna_rna1.gif 45 DNA SEQUENCE

 Raw DNA sequence  Coding or non-coding  Parses into genes  4 nucleotide bases ATGC

>ENST00000539570 cdna:known chromosome:GRCh37:15:63889592:63893885:1 gene:ENSG00000259662 gene_biotype:protein_coding transcript_biotype:protein_coding ATGTGGCCACTGCTCACCATGCACATAACCCAGCTCAACCGGGAGTGCCTGCTGCACCTCTTCTCCTTCCTA GACAAGGACAGCAGGAAGAGCCTTGCCAGGACCTGCTCCCAGCTCCACGACGTGTTTGAGGACCCCGCA CTCTGGTCCCTGCTGCACTTCCGTTCCCTCACTGAACTCCAGAAGGACAACTTCCTCCTGGGCCCGGCACTC CGCAGCCTCTCCATCTGCTGGCACTCCAGCCGCGTGCAGGTGTGCAGCATTGAGGACTGGCTCAAGAGTG CCTTCCAGAGAAGCATCTGCAGCCGGCACGAGAGCCTGGTCAATGATTTCCTCCTCCGGGTGTGCGACAG GCTTTCTGCTGTGCGCTCCCCACGGAGGCGGGAGGCGCCTGCACCGTCCTCGGGGACTCCGATCGCCGTT GGACCGAAATCACCTCGGTGGGGAGGACCTGACCACTCGGAGTTCGCCGACTTGCGCTCGGGGGTGACG GGGGCCAGGGCTGCCGCGCGCAGGGGTCTGGGGAGCCTCCGGGCGGAGCGACCCAGCGAGACCCCGC CGGCTCCCGGAGTGTCCTGGGGACCGCCACCTCCAGGAGCCCCGGTGGTGATCTCGGTGAAGCAGGAGG AGGGGAAGCAGGGGCGCACGGGCAGAAGGAGCCACCGAGCCGCTCCTCCTTGCGGTTTTGCCCGCACG CGCGTCTGCCCGCCCACCTTTCCTGGGGCGGATGCGTTCCCGCAGTGA

"Introduction to Bioinformatics" 46 Bioinformatics Course A GENE

http://www.down-syndrome.org/updates/2054/updates-2054-figure1-400w.png

"Introduction to Bioinformatics" 47 Bioinformatics Course REGULATORS

http://www.nature.com/scitable/topicpage/gene-expression-14121669

"Introduction to Bioinformatics" 48 Bioinformatics Course GENE EXPRESSION REGULATORS -

http://scienceblogs.com/pharyngula/2008/07/22/epigenetics/

"Introduction to Bioinformatics" 49 Bioinformatics Course EXAMPLES OF BIOLOGICAL DATA

 GENOME – DNA  TRANSCRIPTOME – RNA  PROTEOME – Proteins

Transcriptome is a set of all RNA molecules including mRNA, rRNA, tRNA, and non- coding RNA produced in one or a population of cells

http://www.bio.miami.edu/~cmallery/150/gene/c7.17.7b.transcription.jpg

"Introduction to Bioinformatics" 50 Bioinformatics Course TRANSCRIPTION

http://www.youtube.com/watch?v=ztPkv7wc3yU

"Introduction to Bioinformatics" 51 Bioinformatics Course TRANSCRIPTION

http://www.bio.miami.edu/~cmallery/150/gene/c7.17.7b.transcription.jpg

"Introduction to Bioinformatics" 52 Bioinformatics Course ALTERNATIVE SPLICING

http://www.nature.com/scitable/content/a-schematic-representation-of-alternative-splicing-95777

"Introduction to Bioinformatics" 53 Bioinformatics Course TYPES OF RNA

 mRNA – messenger RNA: encodes amino acid sequences of a polypeptide  tRNA – transfer RNA: brings amino acids to ribosomes during translation  rRNA – ribosomal RNA: with ribosome proteins makes up the ribosomes, the that translate the mRNA  snRNA – small nuclear RNA: forms complexes with proteins that are used in RNA processing in eukaryotes

http://csls-text.c.u-tokyo.ac.jp/images/fig/fig03_4.gif

"Introduction to Bioinformatics" 54 Bioinformatics Course TYPES OF RNA

http://finchtalk.geospiza.com/2009_05_01_archive.html

"Introduction to Bioinformatics" 55 Bioinformatics Course EXAMPLES OF BIOLOGICAL DATA

 GENOME – DNA  TRANSCRIPTOME – RNA  PROTEOME – Proteins

The proteome is the entire set of proteins expressed by a genome, cell, tissue or organism.

http://artavanis-tsakonas.med.harvard.edu/research_images/figure_harsha_proteome.jpg

"Introduction to Bioinformatics" 56 Bioinformatics Course FROM TRANSCRIPTION TO TRANSLATION

http://www1.cs.columbia.edu/~cleslie/cs4761/microarray/central-dogma.png

"Introduction to Bioinformatics" 57 Bioinformatics Course TRANSLATION

http://0.tqn.com/d/chemistry/1/0/G/m/mrnatranslation.jpg

"Introduction to Bioinformatics" 58 Bioinformatics Course TRANSLATION INITIATION

http://bioap.wikispaces.com/Ch+17+Collaboration

"Introduction to Bioinformatics" 59 Bioinformatics Course TRANSLATION TERMINATION

http://kvhs.nbed.nb.ca/gallant/biology/translation_termination.html

"Introduction to Bioinformatics" 60 Bioinformatics Course UNIVERSAL GENETIC CODE

http://www.biogem.org/codon.jpg

"Introduction to Bioinformatics" 61 Bioinformatics Course AMINO ACIDS

62 PROTEIN

 Proteins consists of long chains of amino acid sequences  20 letter alphabet (IUPAC nomenclature)

IUPAC amino Three letter IUPAC amino Three letter Amino acid Amino acid acid code code acid code code A Ala Alanine M Met Methionine C Cys Cysteine N Asn Asparagine D Asp Aspartic Acid P Pro Proline E Glu Glutamic Acid Q Gln Glutamine F Phe Phenylalanine R Arg Arginine G Gly Glycine S Ser Serine H His Histidine T Thr Threonine I Ile Isoleucine V Val Valine K Lys Lysine W Trp Tryptophan L Leu Leucine Y Tyr Tyrosine "Introduction to Bioinformatics" 63 Bioinformatics Course PROTEIN SEQUENCE

>sp|P48431|SOX2_HUMAN Transcription factor SOX-2 OS=Homo sapiens GN=SOX2 PE=1 SV=1 MYNMMETELKPPGPQQTSGGGGGNSTAAAAGGNQKNSPDRVKRPMNAFMVWSRGQRRKMA QENPKMHNSEISKRLGAEWKLLSETEKRPFIDEAKRLRALHMKEHPDYKYRPRRKTKTLM KKDKYTLPGGLLAPGGNSMASGVGVGAGLGAGVNQRMDSYAHMNGWSNGSYSMMQDQLGY PQHPGLNAHGAAQMQPMHRYDVSALQYNSMTSSQTYMNGSPTYSMSYSQQGTPGMALGSM GSVVKSEASSSPPVVTSSSHSRAPCQAGDLRDMISMYLPGAEVPEPAAPSRLHMSQHYQS GPVPGTAINGTLPLSHM

"Introduction to Bioinformatics" 64 Bioinformatics Course PROTEIN SIZE

http://www.quora.com/Protein-nutrition-1/Whats-the-average-size-of-a-human-protein-in-kDa

65 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1150220/

67 http://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Protein_structure.png/1024px-Protein_structure.png PROTEIN DOMAINS

68 PROTEIN SEQUENCE

Proteins are divided into domains

>sp|P48431|SOX2_HUMAN Transcription factor SOX-2 OS=Homo sapiens GN=SOX2 PE=1 SV=1 MYNMMETELKPPGPQQTSGGGGGNSTAAAAGGNQKNSPDRVKRPMNAFMVWSRGQRRKMA QENPKMHNSEISKRLGAEWKLLSETEKRPFIDEAKRLRALHMKEHPDYKYRPRRKTKTLM KKDKYTLPGGLLAPGGNSMASGVGVGAGLGAGVNQRMDSYAHMNGWSNGSYSMMQDQLGY PQHPGLNAHGAAQMQPMHRYDVSALQYNSMTSSQTYMNGSPTYSMSYSQQGTPGMALGSM GSVVKSEASSSPPVVTSSSHSRAPCQAGDLRDMISMYLPGAEVPEPAAPSRLHMSQHYQS GPVPGTAINGTLPLSHM DNA BINDING DOMAIN

http://www.uniprot.org/ 69 GENE TRANSCRIPTION, TRANSLATION AND PROTEIN SYNTHESIS

70 http://compbio.pbworks.com/f/central_dogma.jpg CENTRAL DOGMA

71 BIOINFORMATIC APPLICATIONS

 The integrative approaches are useful and applied in  Agricultural  Higher yield in crops or fruits  Disease or drought resistance crops  Medical  To understand processes in healthy and disease individuals  Genetic diseases  Pharmaceutical  To find or develop new and better drugs  Gene based drugs  Structure based drug designing

"Introduction to Bioinformatics" 72 Bioinformatics Course BIOINFORMATIC QUESTIONS 1

 To identify an unknown gene of interest  Sequence  Is there a match to known sequence in the  Which does it match to  How to identify more family members  I have an similar structure, how to identify its potential ligands  How to identify if my gene/protein is found present also in other species  How can I identify genes that are inherited together in a specific region

"Introduction to Bioinformatics" 73 Bioinformatics Course BIOINFORMATIC QUESTIONS 2

 I have to constructed a artificial gene, how do I design the primers, how to check if I have the right sequence?  To know structure of an poorly expressed RNA sequence  To identify the structure and function of a protein sequence  To cluster protein sequences into families of related sequences and develop models  To generate phylogenetic trees to identify the evolutionary relationships using similar proteins/DNA  To identify which other proteins interacts with sequence of interest.

"Introduction to Bioinformatics" 74 Bioinformatics Course BIOINFORMATIC QUESTIONS 3

 Find genes that have similar expression in specific conditions  Find transcription factors that regulate specific genes  Vizualise different gene and protein networks  Describe the regulation of genes

"Introduction to Bioinformatics" 75 Bioinformatics Course