<<

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Outline:

Molecular Primer 1. How came about? 2. Similarities: What is made of? 3. Differences: Variation in

Angela Brooks, Raymond Brown, Calvin Chen, Mike Daly, Hoa Dinh, Erinn Hama, Robert Hinman, Julio Ng, Michael Sneddon, Hoa Troung, Jerry Wang, Che Fung Yung Edited for Introduction to Bioinformatics (Autumn 2006) by Esa Pitkänen http://www.cs.helsinki.fi/mbi/courses/06-07/itb/ 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 2

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info 1. How Molecular Biology came about? Major events in the history of Molecular • Microscopic biology began in • Robert Biology 1800 - 1870 1665 Hooke • 1865 discover the basic rules of • (1635-1703) heredity of garden pea. discovered organisms are • An individual organism has made up of cells two alternative heredity units for a given trait ( dominant Mendel: The Father of trait v.s. recessive trait ) • Matthias Schleiden (1804- 1881) and Theodor Schwann (1810-1882) further • 1869 Johann Friedrich expanded the study of cells Miescher discovered DNA in 1830s • Theodor and named it nuclein. • Matthias Johann Miescher Schleiden Schwann

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 3 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 4

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Major events in the history of Molecular Biology 1880 - 1900 Biology 1900-1911 • 1881 Edward Zacharias showed are • 1902 - Emil Hermann Fischer wins : showed amino acids are linked and form composed of nuclein. • Postulated: properties are defined by Emil amino acid composition and arrangement, which Fischer • 1899 Richard Altmann renamed nuclein to . we nowadays know as fact

• By 1900 , chemical structures of all 20 amino acids had • 1911 – discovers on chromosomes are the discrete units of • been identified heredity Thomas Morgan • 1911 Pheobus Aaron Theodore Lerene discovers RNA

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 5 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 6

1 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Major events in the history of Molecular Biology 1940 - 1950 Biology 1950 - 1952 • 1941 – and • – Mahlon Bush identify that genes Hoagland first to isolate tRNA make proteins Mahlon Hoagland

George Edward Beadle Tatum

• 1950 – Edwin Chargaff find Cytosine complements Guanine • 1952 – and and Adenine complements Martha Chase make genes Thymine Edwin from DNA Chargaff

Hershey Chase Experiment

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 7 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 8

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Major events in the history of Molecular Biology 1952 - 1960 Biology 1970

• 1970 Howard Temin and David • 1952-1953 James D. Baltimore independently isolate Watson and Francis H. C. the first restriction enzyme Crick deduced the double and helical structure of DNA • DNA can be cut into reproducible pieces with site-specific endonuclease • 1956 called restriction enzymes; showed the site of enzymes • the pieces can be linked to manufacturing in the bacterial vectors and cytoplasm is made on RNA introduced into bacterial hosts. organelles called . ( cloning or recombinant George Emil Palade DNA technology )

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 9 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 10

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Major events in the history of Molecular Biology Biology 1970- 1977 1986 ---1995 • 1986 : Developed automated • 1977 Phillip Sharp and mechanism Richard Roberts demonstrated that pre-mRNA • 1986 Human Initiative Leroy Hood announced is processed by the excision Phillip Sharp Richard Roberts of introns and exons are • 1990 The 15 year Human spliced together. Genome project is launched by congress

• Joan Steitz determined that • 1995 Moderate-resolution maps the 5’ end of snRNA is of chromosomes 3, 11, 12, and partially complementary to 22 maps published (These maps provide the locations of the consensus sequence of “markers” on each 5’ splice junctions. Joan Steitz to make locating genes easier)

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 11 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 12

2 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Biology Major events in the history of Molecular Biology 19951995----19961996 1997 ---1999 • 1995 John : First • 1997 E. Coli sequenced bactierial genomes sequenced • 1998 PerkinsElmer, Inc.. Developed 96-capillary • 1995 Automated fluorescent sequencer sequencing instruments and John Craig Venter robotic operations • 1998 Complete sequence of the Caenorhabditis elegans genome • 1996 First eukaryotic genome- yeast-sequenced • 1999 First human chromosome (number 22) sequenced

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 13 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 14

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Major events in the history of Molecular Biology Major events in the history of Molecular Biology 20002000----20012001 20032003----Present • 2000 Complete sequence • April 2003 of the euchromatic portion Project Completed. Mouse of the genome is sequenced. melanogaster genome • April 2004 Rat genome sequenced. • 2001 International Human Genome Sequencing :first draft of the sequence of the human genome published

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 15 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 16

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info 2. What is Life made of? Cells

• Fundamental working units of every living system. • Every organism is composed of one of two radically different types of cells: • prokaryotic cells or • eukaryotic cells. • Prokaryotes and Eukaryotes are descended from the same primitive . • All prokaryotic and eukaryotic cells are the result of a total of 3.5 billion years of .

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 17 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 18

3 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Cells Life begins with Cell

• Chemical composition -by weight • 70% water • 7% small • salts • Lipids • amino acids • nucleotides • 23% macromolecules • Proteins • Polysaccharides • A cell is a smallest structural unit of an • lipids organism that is capable of independent • biochemical (metabolic) pathways functioning • translation of mRNA into proteins • All cells have some common features

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 19 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 20

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Common features of organisms All Cells have common Cycles

• Chemical energy is stored in ATP • Genetic information is encoded by DNA • Information is transcribed into RNA • There is a common triplet genetic code • Translation into proteins involves ribosomes • Shared metabolic pathways • Similar proteins among diverse groups of organisms • Born, eat, replicate, and die

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 21 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 22

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Two types of cells: Prokaryotes and Prokaryotes and Eukaryotes Eukaryotes

•According to the most recent evidence, there are three main branches to the tree of life. •Prokaryotes include Archaea (“ancient ones”) and . •Eukaryotes are kingdom Eukarya and includes plants, animals, fungi and certain algae.

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 23 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 24

4 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Prokaryotes and Eukaryotes, Signaling Pathways: Control Gene continued Activity Prokaryotes Eukaryotes • Instead of having brains, cells make decision Single cell Single or multi cell through complex networks (or pathways) of No nucleus Nucleus chemical reactions • Synthesize new materials No organelles Organelles • Break other materials down for spare parts One piece of circular DNA Chromosomes • Signal to eat or die

No mRNA post Exons/Introns splicing transcriptional modification

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 25 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 26

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Cells Information and Machinery Overview of organizations of life • Cells store all information to replicate itself • Human genome is around 3 billions base pair long • Nucleus = library • Almost every cell in human body contains same • Chromosomes = bookshelves set of genes • But not all genes are used or expressed by those • Genes = books cells • Almost every cell in an organism contains the • Machinery: same libraries and the same sets of books. • Collect and manufacture components • Books represent all the information (DNA) • Carry out replication that every cell in the body needs so it can • Kick-start its new offspring grow and carry out its various functions. (A cell is like a car factory)

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 27 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 28

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Terminology All Life depends on 3 critical molecules

• The genome is an organism’s complete set of DNA. • a bacteria contains about 600,000 DNA base pairs • (Deoxyribonucleic acid) • human and mouse genomes have some 3 billion. • Hold information on how cell works • Human genome has 24 distinct chromosomes. • RNAs (Ribonucleic acid) • Each chromosome contains many genes . • Act to transfer short pieces of information to different parts • Gene of cell • basic physical and functional units of heredity. • specific sequences of DNA bases that encode • Provide templates to synthesize into protein instructions on how to make proteins . • Proteins • Proteins • Form enzymes that send signals to other cells and regulate • Make up the cellular structure and function gene activity • large, complex molecules made up of smaller subunits • Form body’s major components (e.g. hair, skin, etc.) called amino acids.

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 29 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 30

5 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info DNA: The Code of Life DNA, continued

• DNA has a structure which is composed of • sugar • phosphate group • and a base (A,C,G,T)

• By convention, we read DNA strings in direction of • The structure and the four genomic letters code for all living transcription: from 5’ end to organisms 3’ end • Adenine, Guanine, Thymine, and Cytosine which pair A-T and C-G 5’ ATTTAGGCC 3’ on complimentary strands. 3’ TAAATCCGG 5’

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 31 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 32

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

DNA is packed into chromosomes Human chromosomes

• Somatic cells in humans have 2 pairs of 22 chromosomes + XX (female) or XY (male) = total of 46 chromosomes • Germline cells have 22 chromosomes + either X or Y = total of 23 • (1) Double helix DNA strand. chromosomes • (2) strand ( DNA with histones )

• (3) Condensed chromatin during interphase with centromere . Karyogram of human male using Giemsa staining • (4) Condensed chromatin during prophase (http://en.wikipedia.org/wiki/Karyotype) • (5) Chromosome during metaphase

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 33 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 34

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Length of DNA and number of chromosomes Human Genome Composition

Organism #base pairs #chromosomes (germline)

Prokayotic (bacterium) 4x10 6 1

Eukaryotic Saccharomyces cerevisia (yeast) 1.35x10 7 17 Drosophila melanogaster (insect) 1.65x10 8 4 Homo sapiens (human) 2.9x10 9 23 Zea mays (corn) 5.0x10 9 10

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 35 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 36

6 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info RNA DNA, RNA, and the Flow of • RNA is similar to DNA chemically. It is usually only a Information single strand. T(hyamine) is replaced by U(racil) ”The central dogma” • Several types of RNA exist for different functions in the Replication cell.

Transcription Translation

tRNA linear and 3D view: http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 37 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 38

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Overview of DNA to RNA to Protein Proteins

• Proteins are polypeptides (strings of amino acid residues) • Represented using strings of letters from an alphabet of 20: AEGLV…WKKLAG • Typical length 50…1000 residues • A gene is expressed in two steps

1) Transcription: RNA synthesis Urease enzyme from Helicobacter pylori 2) Translation: Protein synthesis

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 39 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 40

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info How DNA/RNA codes for protein? How DNA/RNA codes for protein?

• DNA alphabet contains four • Three of the possible letters but must specify triplets specify ”stop protein, or polypeptide translation” sequence of 20 letters. • Translation usually starts at • Dinucleotides are not triplet AUG (this also codes enough: 4 2 = 16 possible for methionine) dinucleotides • Most amino acids may be • Trinucleotides (triplets) specified by more than allow 4 3 = 64 possible triplet trinucleotides • How to find a gene? Look • Triplets are also called for start and stop codons codons (not that easy though)

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 41 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 42

7 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Proteins: Workhorses of the Cell 3. Where does the variation in genomes come from? • 20 different amino acids • different chemical properties cause the protein chains to fold up into • Prokaryotes are typically specific three-dimensional structures that define their particular haploid: they have a single functions in the cell. (circular) chromosome • Proteins do all essential work for the cell • build cellular structures • DNA is usually inherited • digest nutrients vertically (parent to • execute metabolic functions daughter) • Mediate information flow within a cell and among cellular • Inheritance is clonal communities. • Descendants are faithful • Proteins work together with other proteins or nucleic acids as copies of an ancestral DNA "molecular machines" • Variation is introduced via • structures that fit together and function in highly specific, lock- , transposable and-key ways. elements, and horizontal Chromosome map of S. dysenteriae , the nine rings describe different properties of the genome transfer of DNA http://www.mgc.ac.cn/ShiBASE/circular_Sd197.htm

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 43 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 44

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Mitosis and meiosis Recombination and variation

• Sexual organisms are usually • Allele is a viable DNA coding diploid occupying a given locus • Germline cells (gametes) contain N chromosomes (position in the genome) • Somatic (body) cells have 2N • In recombination, alleles from chromosomes parents become suffled in • Meiosis: reduction of offspring individuals via chromosome number from 2N to chromosomal crossover over N during reproductive cycle • One chromosome doubling is Major events in meiosis • Allele combinations in offspring followed by two cell divisions http://en.wikipedia.org/wiki/Meiosis are usually different from • Mitosis: growth and development http://www.ncbi.nlm.nih.gov/About/Primer combinations found in parents of the organism • One chromosome doubling is • Recombination errors lead into followed by one cell division additional variations Chromosomal crossover as described by T. H. Morgan in 1916

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 45 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 46

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Recombination frequency and linked genes Recombination and longer time scales

Conserved synteny Syntenic blocks and segments • Genetic marker: some DNA sequence of interest Chromosome i, species B Chromosome i, species B (e.g., gene or a part of a gene) g2B g1B g3B g5B g4B g1B g2B g3B

syntenic block syntenic segment • Recombination is more likely to separate two distant Chromosome j, species C Chromosome j, species C markers than two close ones g1C g2C g3C g4C g5C g1C g2C g3C

• Linked markers: ”tend” to be inherited together • Assume that species B and C are descendants of A • Conserved synteny: group of genes linked in both B and C • Conserved segment: conserved synteny with same gene order • Marker distances measured in centimorgans: 1 • Syntenic segment: group of markers (!) linked in both B and C centimorgan corresponds to 1% chance that two • Syntenic block: set of syntenic segments which may contain set markers are separated in recombination inversions and duplications

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 47 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 48

8 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Biological string manipulation References

• Richard C. Deonier, Simon Tavaré and Michael S. Waterman. Computational • Errors get introduced to DNA during replication Genome Analysis, An Introduction. Springer, 2005. • Daniel Sam, “Greedy Algorithm” presentation. • Deletion: removal of one or more contiguous bases • Glenn Tesler, “Genome Rearrangements in Mammalian Evolution: (substring) Lessons from Human and Mouse Genomes” presentation. • , “What evolution is”. • Insertion: insertion of a substring • Neil C. Jones, Pavel A. Pevzner, “An Introduction to Bioinformatics Algorithms”. • Alberts, Bruce, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, Peter • Segmental duplication: insertion of a copy of a DNA region Walter. Molecular Biology of the Cell . New York: Garland . 2002. into a different location • Mount, Ellis, Barbara A. List. Milestones in Science & Technology . Phoenix: The Oryx Press. 1994. • Inversion: reversal of substring • Voet, Donald, Judith Voet, Charlotte Pratt. Fundamentals of . New Jersey: John Wiley & Sons, Inc. 2002. • Translocation: removal and insertion of a substring • Campbell, Neil. Biology, Third Edition . The Benjamin/Cummings Publishing Company, Inc., 1993. • Point : substitution of a base • Snustad, Peter and Simmons, Michael. Principles of Genetics . John Wiley & Sons, Inc, 2003.

8.9.2006 Introduction to Bioinformatics (Autumn 2006) 49 8.9.2006 Introduction to Bioinformatics (Autumn 2006) 50

9