Signal Processing Problems in Genomics

Signal Processing Problems in Genomics P. P. Vaidyanathan California Institute of Technology Pasadena, CA P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Why is genomics interesting for the signal processing person? Because there are sequences there! OK, what sort of sequences? 1. Sequences from an alphabet of size four: …ATTCGAAGATTTCAACGGGAAAA… DNA 2. Sequences from an alphabet of size twenty: AACWYDEFGHIKLMNPQRSTVAPPQR P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Protein Size-4 alphabet: A, C, T, G: bases (also called or nucleotides) DNA sequences (genomes) are made of these. Genes are parts of DNA, and are 4-letter sequences. Adenine Thymine Cytosine Guanine or Uracil (in RNA) DNA: deoxyribonucleic acid RNA:ribonucleic acid P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover DNA molecule in the living cell (usually in nucleus) Complementary Strands in the Double Helix A T C G Hydrogen bond Great place Alberts, et. al.,Essential Cell Biology,Garland publishing, Inc.,1998 to get started, and a great reference Alberts, Bray, Johnson, Lewis, Raff, Roberts, and Walter A good introductory article (signal processing aspects) Dimitris Anastassiou, IEEE Signal Processing Magazine, July 2001 Size-20 alphabet: ACDEFGHIKLMNPQRSTVWY: amino acids (B,J,O,U,X,Z missing) Proteins are sequences made of these letters. 20-letter proteins and 4-letter DNA are common to all life P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover The twenty natural amino acids (B,J,O,U,X,Z missing) 11 essential amino acids. Animals cannot make the eleven indicated amino acids. They need to eat them; Milk provides all of these. Grains and beans together provide all of these. P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Protein Example Fibroblast growth factor proteins Basic bovine length 146 Acidic bovine length 140 Will return to these and talk about their Fourier transforms P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Outline • Molecular biology background • Computational gene-finding • Spectral analysis (Fourier, wavelet, correlations) • ! Hidden Markov Models and sequence analysis • New world of non-coding genes • References Will try to cover the cream of it. P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover DNA schematic P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Bacterial DNA: few million bases; Human DNA: three billion bases If we write the bases as letter-sized objects: • Bacterial DNA takes up the space of about 50 average novels. • Human DNA takes about 2000 novels. Actual physical size: • human DNA in any cell stretches out to 2 yards. • DNA in all 5 trillion cells in humans: ... sun Covers it 50 ACTTAAGGCCAAAGATCAGG times over earth P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover What do genes do? top strand bottom strand Intergenic spaces; contain A,T,C,G too! 1 2 3 Top strand, let’s say protein 1 protein 2 protein 3 Lots of protein in the cell, inside and outside nucleus P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover All cells in a given organism have the same DNA; same set of genes. But different genes are expressed(i.e., functional) in different cells. That’s why brain cells look different from blood cells, and so forth. Brain cell http://www-biology.ucsd.edu/news/article_112901.html Red blood cells http://www.cellsalive.com/gallery.htm When a gene is expressed, it gives instructions to the cell to make a particular protein. Each gene makes a different protein. Example of a Protein: Hemoglobin (oxy, human) http://www.biochem.szote.u-szeged.hu/astrojan/protein2.htm Sequence of amino acids. Folds into beautiful 3D shapes. Necessary for function. Example of a protein (an enzyme) http://www.biochem.szote.u-szeged.hu/astrojan/protein2.htm some other molecule, e.g., ligand F! its like a puzzle piece. That’s how beautifully enzymes work! protein molecule P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Generation of a protein from a gene intergenic spaces Top strand; millions of bases T replaced with U A,C,T,G A,C,U,G 1 2 3 4 5 Gene copied sequence into mRNA (transcription) A,C,U,G 1 2 3 4 5 sequence reduced mRNA (introns removed by splicing) P. P. Vaidyanathan, Converted to protein by tRNA and ribosome ISCAS Plenary, 5/24/2004,Vancover (translation from 4- language to 20-language) Generation of a protein from a gene cell ds-D!NA double strand opened up, one strand copied as an RNA introns removed and mRNA reduced by mRNA nucleus splicing ribosome converts mRNA into protein mRNA In this process the ribosome tRNA works with a molecule called tRNA which transfers groups of ribosome 3 bases (codons) in the mRNA protein into amino acids that make up the protein The protein folds beautifully into its 3D structure which depends only on the amino acid sequence (and pH of medium). Now it is P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover ready to function. Central dogma of molecular biology (Crick) transcript DNA mRNA translate protein Pioneers: Beadle and Tatum, Bread mold experiment (1942) In recent years the central dogma has been challenged! P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Role of codons Gene from DNA scanned from 5’ to 3’ end: 5’ ATGGAAGTGGCAATGATCCTGAATTTAACGTACTAG 3’ The gene is interpreted in groups of three bases called codons. 5’ end 3’ end ATGGAAGTGGCAATGATCCTGAATTTAACGTACTAG gene E V A M I L N L T Y Protein ATG: start codon; also codon for M (met); plays two roles TAA, TAG, TGA : stop codons (do not code for amino acids). Typically genes are long (1000s of bases); proteins have 100s to 1000s of amin acids P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover codon amino acid The genetic code The genetic code is common to ALL life! Mutations in genes can cause disease Gene HBB creates the protein beta globin in hemoglobin of red blood cells. This gene is 1600 bases long, and the spliced mRNA 626 bases long. A single error in this sequence is responsible for sickle cell anemia. http://www.ornl.gov/sci/techresources/Human_Genome/posters/chromosome/hbb.shtml cell DNA replicates itself when ds-D!NA the cell divides. AATATAGACCGACCCTAAGTAAAATAGACCTAGTAGA nucleus 1 error per billion bases. -9 P e = 10 Built-in proof reading system called mismatch-pair system P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Parcel service, first class mail: 13 late deliveries out of 100 parcels Airline luggage: 1 lost bag per 200 Professional typist: 1 mistake in 250 characters Driving in the US: 1 death per 10,000 people per year DNA replication: 1 error per billion bases copied Speaker giving a talk: 1 erorr per slide P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Beginning of the history of molecular biology: J. D. Watson, and F. H. C. Crick, A structure for DNA, Nature, 4/1953 http://www.pbs.org/wgbh/nova/photo51/before.html End of this part Outline • Molecular biology background • Computational gene-finding • Spectral analysis (Fourier, wavelet, correlations) • ! Hidden Markov Models and sequence analysis • New world of non-coding genes • References P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Indicator sequences DNA AACTGGCATCCGGGAATAAGGTC xA (n) 1 1 0 00 0 0 10 0 00 0 0 1 10 1 1 0 0 00 Indicator sequence for base A Similarly define x T (n) x C (n) x G (n) = xA (n) + x T (n) + x C (n) + x G (n) 1 Fourier transforms: jw jw jw jw X A (e ) X T (e ) X C (e ) X G (e ) P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Fourier transforms: jw jw jw jw X A (e ) X T (e ) X C (e ) X G (e ) Define S(e j w ) to be the sum-of-magnitude squares. In protein coding regions this exhibits a peak at 2p /3. Period-3 property. jw S(e ) Even the plot of one base, e.g., X G reveals this! w 0 2p/3 p Coding region of length N=1320 inside a genome of baker’s yeast (S. cerevisiae). Tiwari, et. al., CABIOS, 1997. Dimitris Anastassiou, IEEE Signal Processing Magazine, July 2001 Period-3 property arises from the special bias built into the genetic code. Some bases dominate at certain positions, e.g., base G is dominant at positions 1 and 2. The mapping from amino acids to codons is many-to-one P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover intergenic spaces Top strand; millions of bases A,C,T,G A,C,T,G 1 2 3 4 5 sequence Short-time Fourier transform p w 2p/3 0 Base location So we can locate exons using STFT How to choose window size? Usual time-frequency resolution tradeoff P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Filtering interpretation Take any base, say G: x G (n) 1 1 0 00 0 0 10 0 00 0 0 1 10 1 1 0 0 00 0 1 10 1 1 0 0 1 10 1 1 0 N w(n) Sliding window Frequency response magnitude x G (n) y G (n) filter with impulse response h(n) 2p/N P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Spectrum at 2p/3 as a function of base location Gene F56F11.4 in the C-elegans chromosome III Vaidyanathan and Yoon, J. of the Franklin Inst., Elsevier Ltd., 2004. Return to the filtering interpretation How about designing filters to improve time-frequency resolution? Interesting DSP problem! P. P. Vaidyanathan, ISCAS Plenary, 5/24/2004,Vancover Notch Antinotch Allpass: Define two filters: Vaidyanathan and Yoon, J.

Signal Processing Problems in Genomics

Functional Aspects and Genomic Analysis

Gene Prediction Using Deep Learning

Complete Genome Sequence of the Hyperthermophilic Bacteria- Thermotoga Sp

Cep-2020-00633.Pdf

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Genes & Gene Finding

Computational Gene Finding

Scheme and Syllabus of B. Tech. Biotechnology (BT) Batch 2011 Onwards

Basics of Molecular Biology

Computational Gene Prediction

Yersinia Pestis Genes Essential Across a Narrow Or a Broad Range of Environmental Conditions Nicola J

Introduction to Ab Initio and Evidence-Based Gene Finding