An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Outline:
Molecular Biology Primer 1. How molecular biology came about? 2. Similarities: What is life made of? 3. Differences: Variation in genomes
Angela Brooks, Raymond Brown, Calvin Chen, Mike Daly, Hoa Dinh, Erinn Hama, Robert Hinman, Julio Ng, Michael Sneddon, Hoa Troung, Jerry Wang, Che Fung Yung Edited for Introduction to Bioinformatics (Autumn 2006) by Esa Pitkänen http://www.cs.helsinki.fi/mbi/courses/06-07/itb/ 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 2
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info 1. How Molecular Biology came about? Major events in the history of Molecular • Microscopic biology began in • Robert Biology 1800 - 1870 1665 Hooke • 1865 Gregor Mendel discover the basic rules of • Robert Hooke (1635-1703) heredity of garden pea. discovered organisms are • An individual organism has made up of cells two alternative heredity units for a given trait ( dominant Mendel: The Father of Genetics trait v.s. recessive trait ) • Matthias Schleiden (1804- 1881) and Theodor Schwann (1810-1882) further • 1869 Johann Friedrich expanded the study of cells Miescher discovered DNA in 1830s • Theodor and named it nuclein. • Matthias Johann Miescher Schleiden Schwann
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 3 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 4
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Major events in the history of Molecular Biology 1880 - 1900 Biology 1900-1911 • 1881 Edward Zacharias showed chromosomes are • 1902 - Emil Hermann Fischer wins Nobel prize: showed amino acids are linked and form composed of nuclein. proteins • Postulated: protein properties are defined by Emil amino acid composition and arrangement, which Fischer • 1899 Richard Altmann renamed nuclein to nucleic acid. we nowadays know as fact
• By 1900 , chemical structures of all 20 amino acids had • 1911 – Thomas Hunt Morgan discovers genes on chromosomes are the discrete units of • been identified heredity Thomas Morgan • 1911 Pheobus Aaron Theodore Lerene discovers RNA
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 5 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 6
1 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Major events in the history of Molecular Biology 1940 - 1950 Biology 1950 - 1952 • 1941 – George Beadle and • 1950s – Mahlon Bush Edward Tatum identify that genes Hoagland first to isolate tRNA make proteins Mahlon Hoagland
George Edward Beadle Tatum
• 1950 – Edwin Chargaff find Cytosine complements Guanine • 1952 – Alfred Hershey and and Adenine complements Martha Chase make genes Thymine Edwin from DNA Chargaff
Hershey Chase Experiment
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 7 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 8
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Major events in the history of Molecular Biology 1952 - 1960 Biology 1970
• 1970 Howard Temin and David • 1952-1953 James D. Baltimore independently isolate Watson and Francis H. C. James Watson the first restriction enzyme Crick deduced the double and Francis Crick helical structure of DNA • DNA can be cut into reproducible pieces with site-specific endonuclease • 1956 George Emil Palade called restriction enzymes; showed the site of enzymes • the pieces can be linked to manufacturing in the bacterial vectors and cytoplasm is made on RNA introduced into bacterial hosts. organelles called ribosomes. (gene cloning or recombinant George Emil Palade DNA technology )
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 9 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 10
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Major events in the history of Molecular Biology Biology 1970- 1977 1986 ---1995 • 1986 Leroy Hood: Developed automated sequencing • 1977 Phillip Sharp and mechanism Richard Roberts demonstrated that pre-mRNA • 1986 Human Genome Initiative Leroy Hood announced is processed by the excision Phillip Sharp Richard Roberts of introns and exons are • 1990 The 15 year Human spliced together. Genome project is launched by congress
• Joan Steitz determined that • 1995 Moderate-resolution maps the 5’ end of snRNA is of chromosomes 3, 11, 12, and partially complementary to 22 maps published (These maps provide the locations of the consensus sequence of “markers” on each chromosome 5’ splice junctions. Joan Steitz to make locating genes easier)
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 11 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 12
2 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Biology Major events in the history of Molecular Biology 19951995----19961996 1997 ---1999 • 1995 John Craig Venter: First • 1997 E. Coli sequenced bactierial genomes sequenced • 1998 PerkinsElmer, Inc.. Developed 96-capillary • 1995 Automated fluorescent sequencer sequencing instruments and John Craig Venter robotic operations • 1998 Complete sequence of the Caenorhabditis elegans genome • 1996 First eukaryotic genome- yeast-sequenced • 1999 First human chromosome (number 22) sequenced
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 13 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 14
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Biology Major events in the history of Molecular Biology 20002000----20012001 20032003----Present • 2000 Complete sequence • April 2003 Human Genome of the euchromatic portion Project Completed. Mouse of the Drosophila genome is sequenced. melanogaster genome • April 2004 Rat genome sequenced. • 2001 International Human Genome Sequencing :first draft of the sequence of the human genome published
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 15 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 16
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info 2. What is Life made of? Cells
• Fundamental working units of every living system. • Every organism is composed of one of two radically different types of cells: • prokaryotic cells or • eukaryotic cells. • Prokaryotes and Eukaryotes are descended from the same primitive cell. • All prokaryotic and eukaryotic cells are the result of a total of 3.5 billion years of evolution.
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 17 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 18
3 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Cells Life begins with Cell
• Chemical composition -by weight • 70% water • 7% small molecules • salts • Lipids • amino acids • nucleotides • 23% macromolecules • Proteins • Polysaccharides • A cell is a smallest structural unit of an • lipids organism that is capable of independent • biochemical (metabolic) pathways functioning • translation of mRNA into proteins • All cells have some common features
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 19 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 20
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Common features of organisms All Cells have common Cycles
• Chemical energy is stored in ATP • Genetic information is encoded by DNA • Information is transcribed into RNA • There is a common triplet genetic code • Translation into proteins involves ribosomes • Shared metabolic pathways • Similar proteins among diverse groups of organisms • Born, eat, replicate, and die
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 21 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 22
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Two types of cells: Prokaryotes and Prokaryotes and Eukaryotes Eukaryotes
•According to the most recent evidence, there are three main branches to the tree of life. •Prokaryotes include Archaea (“ancient ones”) and bacteria. •Eukaryotes are kingdom Eukarya and includes plants, animals, fungi and certain algae.
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 23 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 24
4 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Prokaryotes and Eukaryotes, Signaling Pathways: Control Gene continued Activity Prokaryotes Eukaryotes • Instead of having brains, cells make decision Single cell Single or multi cell through complex networks (or pathways) of No nucleus Nucleus chemical reactions • Synthesize new materials No organelles Organelles • Break other materials down for spare parts One piece of circular DNA Chromosomes • Signal to eat or die
No mRNA post Exons/Introns splicing transcriptional modification
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 25 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 26
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Cells Information and Machinery Overview of organizations of life • Cells store all information to replicate itself • Human genome is around 3 billions base pair long • Nucleus = library • Almost every cell in human body contains same • Chromosomes = bookshelves set of genes • But not all genes are used or expressed by those • Genes = books cells • Almost every cell in an organism contains the • Machinery: same libraries and the same sets of books. • Collect and manufacture components • Books represent all the information (DNA) • Carry out replication that every cell in the body needs so it can • Kick-start its new offspring grow and carry out its various functions. (A cell is like a car factory)
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 27 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 28
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Terminology All Life depends on 3 critical molecules
• The genome is an organism’s complete set of DNA. • a bacteria contains about 600,000 DNA base pairs • DNAs (Deoxyribonucleic acid) • human and mouse genomes have some 3 billion. • Hold information on how cell works • Human genome has 24 distinct chromosomes. • RNAs (Ribonucleic acid) • Each chromosome contains many genes . • Act to transfer short pieces of information to different parts • Gene of cell • basic physical and functional units of heredity. • specific sequences of DNA bases that encode • Provide templates to synthesize into protein instructions on how to make proteins . • Proteins • Proteins • Form enzymes that send signals to other cells and regulate • Make up the cellular structure and function gene activity • large, complex molecules made up of smaller subunits • Form body’s major components (e.g. hair, skin, etc.) called amino acids.
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 29 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 30
5 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info DNA: The Code of Life DNA, continued
• DNA has a double helix structure which is composed of • sugar molecule • phosphate group • and a base (A,C,G,T)
• By convention, we read DNA strings in direction of • The structure and the four genomic letters code for all living transcription: from 5’ end to organisms 3’ end • Adenine, Guanine, Thymine, and Cytosine which pair A-T and C-G 5’ ATTTAGGCC 3’ on complimentary strands. 3’ TAAATCCGG 5’
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 31 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 32
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
DNA is packed into chromosomes Human chromosomes
• Somatic cells in humans have 2 pairs of 22 chromosomes + XX (female) or XY (male) = total of 46 chromosomes • Germline cells have 22 chromosomes + either X or Y = total of 23 • (1) Double helix DNA strand. chromosomes • (2) Chromatin strand ( DNA with histones )
• (3) Condensed chromatin during interphase with centromere . Karyogram of human male using Giemsa staining • (4) Condensed chromatin during prophase (http://en.wikipedia.org/wiki/Karyotype) • (5) Chromosome during metaphase
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 33 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 34
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Length of DNA and number of chromosomes Human Genome Composition
Organism #base pairs #chromosomes (germline)
Prokayotic Escherichia coli (bacterium) 4x10 6 1
Eukaryotic Saccharomyces cerevisia (yeast) 1.35x10 7 17 Drosophila melanogaster (insect) 1.65x10 8 4 Homo sapiens (human) 2.9x10 9 23 Zea mays (corn) 5.0x10 9 10
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 35 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 36
6 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info RNA DNA, RNA, and the Flow of • RNA is similar to DNA chemically. It is usually only a Information single strand. T(hyamine) is replaced by U(racil) ”The central dogma” • Several types of RNA exist for different functions in the Replication cell.
Transcription Translation
tRNA linear and 3D view: http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 37 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 38
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Overview of DNA to RNA to Protein Proteins
• Proteins are polypeptides (strings of amino acid residues) • Represented using strings of letters from an alphabet of 20: AEGLV…WKKLAG • Typical length 50…1000 residues • A gene is expressed in two steps
1) Transcription: RNA synthesis Urease enzyme from Helicobacter pylori 2) Translation: Protein synthesis
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 39 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 40
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info How DNA/RNA codes for protein? How DNA/RNA codes for protein?
• DNA alphabet contains four • Three of the possible letters but must specify triplets specify ”stop protein, or polypeptide translation” sequence of 20 letters. • Translation usually starts at • Dinucleotides are not triplet AUG (this also codes enough: 4 2 = 16 possible for methionine) dinucleotides • Most amino acids may be • Trinucleotides (triplets) specified by more than allow 4 3 = 64 possible triplet trinucleotides • How to find a gene? Look • Triplets are also called for start and stop codons codons (not that easy though)
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 41 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 42
7 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Proteins: Workhorses of the Cell 3. Where does the variation in genomes come from? • 20 different amino acids • different chemical properties cause the protein chains to fold up into • Prokaryotes are typically specific three-dimensional structures that define their particular haploid: they have a single functions in the cell. (circular) chromosome • Proteins do all essential work for the cell • build cellular structures • DNA is usually inherited • digest nutrients vertically (parent to • execute metabolic functions daughter) • Mediate information flow within a cell and among cellular • Inheritance is clonal communities. • Descendants are faithful • Proteins work together with other proteins or nucleic acids as copies of an ancestral DNA "molecular machines" • Variation is introduced via • structures that fit together and function in highly specific, lock- mutations, transposable and-key ways. elements, and horizontal Chromosome map of S. dysenteriae , the nine rings describe different properties of the genome transfer of DNA http://www.mgc.ac.cn/ShiBASE/circular_Sd197.htm
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 43 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 44
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Mitosis and meiosis Recombination and variation
• Sexual organisms are usually • Allele is a viable DNA coding diploid occupying a given locus • Germline cells (gametes) contain N chromosomes (position in the genome) • Somatic (body) cells have 2N • In recombination, alleles from chromosomes parents become suffled in • Meiosis: reduction of offspring individuals via chromosome number from 2N to chromosomal crossover over N during reproductive cycle • One chromosome doubling is Major events in meiosis • Allele combinations in offspring followed by two cell divisions http://en.wikipedia.org/wiki/Meiosis are usually different from • Mitosis: growth and development http://www.ncbi.nlm.nih.gov/About/Primer combinations found in parents of the organism • One chromosome doubling is • Recombination errors lead into followed by one cell division additional variations Chromosomal crossover as described by T. H. Morgan in 1916
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 45 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 46
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Recombination frequency and linked genes Recombination and longer time scales
Conserved synteny Syntenic blocks and segments • Genetic marker: some DNA sequence of interest Chromosome i, species B Chromosome i, species B (e.g., gene or a part of a gene) g2B g1B g3B g5B g4B g1B g2B g3B
syntenic block syntenic segment • Recombination is more likely to separate two distant Chromosome j, species C Chromosome j, species C markers than two close ones g1C g2C g3C g4C g5C g1C g2C g3C
• Linked markers: ”tend” to be inherited together • Assume that species B and C are descendants of A • Conserved synteny: group of genes linked in both B and C • Conserved segment: conserved synteny with same gene order • Marker distances measured in centimorgans: 1 • Syntenic segment: group of markers (!) linked in both B and C centimorgan corresponds to 1% chance that two • Syntenic block: set of syntenic segments which may contain set markers are separated in recombination inversions and duplications
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 47 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 48
8 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Biological string manipulation References
• Richard C. Deonier, Simon Tavaré and Michael S. Waterman. Computational • Errors get introduced to DNA during replication Genome Analysis, An Introduction. Springer, 2005. • Daniel Sam, “Greedy Algorithm” presentation. • Deletion: removal of one or more contiguous bases • Glenn Tesler, “Genome Rearrangements in Mammalian Evolution: (substring) Lessons from Human and Mouse Genomes” presentation. • Ernst Mayr, “What evolution is”. • Insertion: insertion of a substring • Neil C. Jones, Pavel A. Pevzner, “An Introduction to Bioinformatics Algorithms”. • Alberts, Bruce, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, Peter • Segmental duplication: insertion of a copy of a DNA region Walter. Molecular Biology of the Cell . New York: Garland Science. 2002. into a different location • Mount, Ellis, Barbara A. List. Milestones in Science & Technology . Phoenix: The Oryx Press. 1994. • Inversion: reversal of substring • Voet, Donald, Judith Voet, Charlotte Pratt. Fundamentals of Biochemistry . New Jersey: John Wiley & Sons, Inc. 2002. • Translocation: removal and insertion of a substring • Campbell, Neil. Biology, Third Edition . The Benjamin/Cummings Publishing Company, Inc., 1993. • Point mutation: substitution of a base • Snustad, Peter and Simmons, Michael. Principles of Genetics . John Wiley & Sons, Inc, 2003.
8.9.2006 Introduction to Bioinformatics (Autumn 2006) 49 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 50
9