Genefinding.Pdf

Gene Finding GenBank Growth GenBank Growth • In 2003 • ~ 31 million sequences • ~ 37 billion base pairs GenBank: Exponential Growth Growth of GenBank in billions of base pairs from release 3 in April of 1994 to the current release, 142. Number of base pairs of sequence in GenBank release 142 for selected organisms Human Genome • Human genome = 3.4*109 bp • Number of genes = 30,000 - 100,000 • Gene percentage ~= 1% Gene Finding Genome Annotation kb 60 80 100 120 140 Single-exon gene Initial exon Internal exon Terminal exon Annotation amounts to finding: - content segments (exons - aa coding) - signal segments (Promoter, UTR, donor/acceptor site, Poly-Adenylation, etc.) The Central Dogma RNA polymerase performs transcription Ribosomes performs translation Transcription • Transcription starts a few bps before the coding region and ends a few bps after the coding region • Segments before and after the coding region are called untranslated regions (UTRs) • RNA polymerase starts transcription by binding to promoter regions before the transcription start sites • Promoters have signals which help cells control the expression of different genes Translation • Ribosomes produce amino acid sequences by translating information in the mRNA • Translation is based on the genetic code • One start codon • Three stop codons The Genetic Code Start codon Prokaryotic vs. Eukaryotic Genes Prokaryotes Eukaryotes small genomes large genomes high gene density low gene density no introns (or splicing) introns (splicing) no RNA processing RNA processing Eukaryotic and Prokaryotic Gene Structure Gene Finding in Prokaryotes • Most of the DNA sequence codes for proteins • 70% of the H.influenzea genome is coding • Each gene is one continuous stretch of base pairs (no introns) Prokaryotic DNA 5’ UTR 3’ UTR Coding Region Promoter Start codon Stop codon Transcription start site Open Reading Frames (ORFs) • Any sequence of nucleotides can be translated in three possible ways, depending on where the coding starts ACCUUAGCGUA Thr-Leu-Ala Pro-Stop-Arg Leu-Ser-Val • An open reading frame (ORF) is a reading frame with no stop codons Finding Long ORFs • Examine frequencies of stop codons to distinguish between coding and non-coding regions • Assuming a distribution on the codons, one can calculate the expected number a stop codon would be observed Finding Long ORFs • Scan the DNA sequence, looking for long ORFs in all three reading frames • Once a stop codon is detected, scan backwards to the start codon Shortcomings of the Algorithm • Fails to detect very short genes • Fails to detect overlapping long ORFs on opposite strands • #ORFs >> #genes (6500 ORFs in E.Coli, compared to 1100 genes only) Detecting Coding Regions • Consider the frequencies of each of the codons in coding regions • Amino acids Leu, Ala, and Try are coded by 6,4, and 1 different codons, respectively • Expected frequency ratio: 6:4:1 • However, in a protein, they appear with ratio 6.9:6.5:1 • Coding DNA is not random Using Markov Chains • Two Markov chains • G: “inside” a coding ORF • R: “inside” a non-coding ORF (NORF) • Given a sequence X, compute the ratio AG P (X|G) XiXi log = log +1 P (X|R) ! AR i XiXi+1 ORFs as Markov Chains • Given the set of all ORFs in a sequence • translate each ORF into a codon sequence • get 64-state Markov chain • Transition probabilities: probability of codon D being followed by codon E • Using this Markov chain, one can calculate the probability that a given ORF is a coding region Using Codon Frequencies Coding sequence a1, b1, c1, a2, b2, c2, . , an, bn, cn = × · · · × p1 fa1b1c1 fanbncn Prob. of three possible = × · · · × p2 fb c a fbncnan reading frames 1 1 2 +1 = × · · · × { p3 fc1a2b2 fcnan+1bn+1 Prob. reading frame #i being coding reading frame: pi Pi = p1 + p2 + p3 • Algorithm: • slide a window and compute the probabilities • Problems: • Dependence on codon frequency of already found genes • Detecting HGT and other heterogeneous regions Detecting Promoter Regions • Special sequences (with high frequency) in the promoter region • In E.Coli • TTGACA appears 35 bases before the transcription start point • TATAAT appears 12 bases before the transcription start point • Both strings appear with high frequency Another Approach • Define two models for “inside” and “outside” promoter regions • Calculate log-likelihood ratio P (X|Inside Promoter) log P (X|outside Promoter) Prokaryotic Gene Finding Tools Glimmer http://www.tigr.org/~salzberg/glimmer.html GeneMark http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi Critica http://www.ttaxus.com/index.php?pagename=Software ORNL Annotation Pipeline http://compbio.ornl.gov/GP3/pro.shtml Gene Finding in Eukaryotes Eukaryotic DNA 5’ UTR 3’ UTR Introns Promoter Start codon Exons Stop codon Transcription PolyA Donor Acceptor start site Site Site Typical Figures: Vertebrate • Gene: ~30Kb • Coding region: ~1-2Kb • ~6 exons, each ~150bp • Promoter is ~6bp, and ~30bp before the transcription start site • Deviation from those figures is huge (dystrophin gene is ~2.4Mb, etc.) Distribution of Introns/Exons in Human Genome Sakharkar et al., In Silico Biology, 2004 Evolution of Gene Finding Tools 1982 extrinsic intrinsic hybrid Ab-initio Alignment-based Genie 1996 Genscan 1997 Comparative Genomics cDNA, Protein DNA Protein ExoFish Procrustes GenieEST GenieESTHOM 2000 1996 Informant HMM-based Rosetta Twinscan 2000 2001 Pair-HMM Phylo-HMM Slam Siepel-Haussler DoubleScan Jojic-Haussler 2002 2004 Performance Evaluation . Benchmark training set Burset-Guigo: • 570 vertebrate sequences • 6 accuracy measures • Sensitivity, specificity most common actual class positive negative At different levels: positive true positives false positives • Nucleotide (TP) (FP) predicted • Exon negative false negatives true negatives (FN) (TN) • Gene (unreliable) TP TP sensitivity = = actual positives TP + FN TP TP specificity = = predicted positives TP + FP A Solution: Markov Models • Usually a window of width 6, and a table of 4 6 observed frequencies for coding regions, and another for non-coding regions • (fifth other Markov model) • Homogeneous model: does not take into account any ORF information • Non-homogeneous model: different tables for the three possible reading frames • Problem when exons are too short Splicing and Splice Junctions • Splicing is the removal of introns by enzymes called spliceosomes • Spliceosomes contain both proteins and snRNA • snRNA recognizes the splice sites through RNA- RNA base-pairing • Splice sites recognition must be precise (a shift in the reading frame affects its message) • Many genes have alternative splicing: the exons used differ among different variations • Alternative splicing happens in more than 50% of the genes (on average, a gene has more than two variations) Consensus Eukaryotic Gene Sequence 100% frequency C-T rich AGGUAAGU.............CTGAC.........CAGG........ 15bp Donor Branch Acceptor site point site • This typical structure leads naturally to an algorithm based on position-specific weight matrices • However, this does not exploit all the information (reading frames, intron/exon states, etc.) • Not suitable for short genes An HMM Solution k P (exon of length k) = p (1 − p) Geometric dist. However, exon length does not seem to have a geometric distribution Solution • Generalized Hidden Markov Model (GHMM) GHMMs • The output of a state is a string of finite length • For a given state, the output string and its length are randomly chosen according to some probability distribution • Different states may have different probability distributions GHMMs • A finite set Q of states • Initial state probability distribution π T • Transition probabilities i j , for i, j ∈ Q • Length distribution f of the states ( f q is the length distribution of state q ) • Probabilistic models for each of the states, according to which output strings are generated upon visiting a state GenScan Model Burge&Karlin, 97 Gene Prediction Using the GenScan Model • A parse Φ of a sequence S of length L is a sequence of states ( q 1 , . , q t ) with an associated duration d i to each state, where t L = ! di i=1 • Intuitively, a parse is an annotation of a sequence, matching each region with a functional unit of a gene • Let Φ be a parse of sequence S • P ( S i |d i ) : probability of generating S i by the sequence generation model of state q i with length di The probability of generating S based on Φ is • t P (Φ, S) = πq1 fq1 (d1)P (S1|d1) ! Tqk−1qk fqk (dk)P (Sk|dk) • k=2 We have • P (Φ, S) P (Φ, S) P (Φ|S) = = P (S) !Φi P (Φi, S) Φi is a parse of length L GENSCAN • A computer program for gene identification • Uses the GHMM model described earlier • Uses a set of completely sequenced genes from GenBank for training C+G Content • The training set is divided into four categories depending on the C+G content of the sequence • For each of the categories, separate initial state probabilities, transition probabilities, and state length distributions are computed Initial State Probabilities • Initial state probabilities should be close to the frequencies with which various functional units occur in the actual genomic data • For example, if the estimated frequency of the non-coding intergenic region in 80%, then the initial probability for the state N (in the GenScan model) must be around 0.8 Transition Probabilities • Vary quite a bit with the C+G content • Computed separately for each of the categories • Transitions must be biologically permissible • E.g., transition from a P+ to an F+ state must have probability 1 State Length Distributions • Different functional units on a gene have different lengths • E.g., average exon length = 150bp; introns of length 1Kbp are common • Separate distributions for intron states in each category • Different distributions for initial/internal/terminal exons • Geometric distributions are used for UTR states Signal Models • Different signal models for different functional units • Weighted Matrix Model (WMM), where every position has its own independent distribution: PolyA, translation initiation and termination, promoters • Weighted Array Model (WAM), which allows for dependencies between adjacent positions: splice sites • Etc.

Genefinding.Pdf

Insights Into Comparative Genomics, Codon Usage Bias, And

Mutation Bias Shapes Gene Evolution in Arabidopsis Thaliana

Chapter 3. the Beginnings of Genomic Biology – Molecular

"The" Genetic Code?

Designing Lentiviral Vectors for Gene Therapy of Genetic Diseases

Analysis of Codon Usage Patterns in Giardia Duodenalis Based on Transcriptome Data from Giardiadb

Transcription and Open Reading Frame

Codon Usage Biases Co-Evolve with Transcription Termination Machinery

Small Open Reading Frames Tiny Treasures of the Non-Coding Genomic Regions

Unconventional Viral Gene Expression Mechanisms As Therapeutic Targets

Recitation 8 Solutions (PDF)

Mutations in Noncoding Regions of GJB1 Are a Major Cause of X-Linked CMT