Gene Prediction.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Gene Prediction Srivani Narra Indian Institute of Technology Kanpur Email: [email protected] Supervisor: Prof. Harish Karnick Indian Institute of Technology Kanpur Email: [email protected] Keywords: DNA, Hidden Markov Models, Neural Networks, Prokaryotes, Eukaryotes, E.Coli. Abstract: The gene identification problem is the problem of interpreting nucleotide sequences by computer, in order to provide tentative annotation on the location, structure, and functional class of protein-coding genes. This problem is of self-evident importance, and is far from being fully solved, particularly for higher Eukaryotes. With the advent of whole-genome sequencing projects, there is considerable use for programs that scan genomic DNA sequences to find genes, particularly those that encode proteins. Even though there is no substitute to experimentation in determining the exact locations of genes in the genome sequence, a prior knowledge of the approximate location of genes will fasten the process to a great extent apart from saving a huge amount of laboratory time and resources. Hidden Markov Models have been developed to find protein coding genes in DNA. Three models are developed with increasing complexity. First model includes states that model the codons and their frequencies in the genes. Second model includes states for amino acids with corresponding codons as the observation events. The third model, in addition, includes states that model Shine-Delgarno motif and Repetitive Extragenic Palindromic sequences (REPs). In order to compare the performance, an ANN is also developed to classify a set of nucleotide sequences into genes and non-coding regions. 1 A primer on molecular biology Cells are fundamental working units of every living system. All instructions needed to direct their activities are contained within the chemical DNA (Deoxyribonucleic Acid). DNA from all organisms is made up of the same chemical and physical components. The DNA sequence is the particular side-by-side arrangement of bases along the DNA strand. (E.g.: ATTCCGGA) This order spells out the extra instructions required to create a particular organism with its own unique traits. The genome is an organism's complete set of DNA. However only fragments of genome are responsible for the functioning of the cell. These fragments, called genes, are the basic physical and functional units of heredity. Genes are made up of a contiguous set of codons, each of which specifies an amino acid. (Three consecutive nucleotide bases in a DNA sequence constitute a ‘codon’; for example, 'AGT' and 'ACG' are two consecutive codons in the DNA fragment AGTACGT. Of the 64 possible different arrangements of the four nucleotides (A, T, G, C) in sets of three, three (UAA, UAG, UGA) functionally act as periods to translating ribosome in that they cause the translation to stop. These three codons are therefore termed as 'stop codons'. Similarly, one codon of the genetic code, namely ATG, is reserved as start codon, though GTG, CTG, TTG are also rarely observed.) Genes translate into proteins and these proteins perform most life functions and even make up the majority of cellular structures. [reference] 2 Introduction With the advent of whole-genome sequencing projects, there is considerable use for programs that scan genomic DNA sequences to find genes. Even though there is no substitute to experimentation in determining the exact locations of genes in the genome sequence, a prior knowledge of the approximate location of genes will speed up the process to a great extent thereby saving huge amount of laboratory time and resources. The simplest method of finding DNA sequences that encode proteins is to search for open reading frames, or ORFs. An ORF is a length of DNA sequence that contains a contiguous set of codons, each of which specifies an amino acid[reference], and end- marked by a start codon and stop codon which mark the beginning and the end of a gene respectively. However, biologically, not every ORF is a coding region. Only few ORFs with a certain minimum length and with a specific composition can translate into proteins. And hence, this method fails when there are a large number of flanking stop codons, resulting in many small ORFs. DNA sequences that encode protein are not random chain of available codons for an amino acid, but rather an ordered list of specific codons that reflect the evolutionary origin of the gene and constraints associated with gene expression. This nonrandom property of coding sequences can be used to advantage for finding regions in DNA sequences that encode proteins (Ficket and tung 1992). Each species also has a characteristic patten of use of synonymous codons (Wada et al 1992). Also, there is a strong preference for certain codon pairs within a coding region. The various methods for finding genes maybe classified into the following categories : Sequence similarity search: One of the oldest methods of gene identification, based on sequence conservation due to functional constraint, is to search for regions of similarity between the sequence under study (or its conceptual translation) and the sequences of known genes (or their protein products). [Robinson et al. (1994)]. The obvious disadvantage of this method is that when no homologues to the new gene are to be found in the databases, similarity search will yield little or no useful information. Based on statistical regularities: The following measures have been used to encapsulate the features of genes - codon usage measure, hexamer-n measure, hexamer measure, open reading frame measure, amino acid usage measure, Diamino acid usage measure and many more. Combining several measures does improve accuracy. Using signals: Any portion of the DNA whose binding by another biochemical plays a key role in transcription is called a signal. The collection of all specific instances of some particular kind of signal will normally tend to be recognizably similar. Our aim in this project is to use machine learning techniques like Hidden Markov Models and Neural Networks to capture the above mentioned properties for gene recognition and prediction. 3 Issues in Gene Prediction Three types of posttranscriptional events influence the translation of mRNA into protein and the accuracy of gene prediction. First, the genetic code of a given genome may vary from the universal code. [36, 37]Second, one tissue may splice a given mRNA differently from another, thus creating two similar but also partially different mRNAs encoding two related but different proteins. Third, mRNAs may be edited, changing the sequence of the mRNA and, as a result, of the encoded protein. Such changes also depend on interaction of RNA with RNA-binding proteins. Then there are issues of frame-shifts, insertions and deletions of bases, overlapping genes, genes on the complementary strand etc. Straight-forward solutions therefore do not work when we need to take all these issues into considerations. 4 Comparison between prokaryotic and eukaryotic genomes In prokaryotic genomes, DNA sequences that encode proteins are transcribed into mRNA, and then RNA is usually translated directly into proteins without significant modification. The longest ORFs running from the first available Met codon on the mRNA to the next stop codon in the same reading frame generally provide a good, but not assured, prediction of the protein-encoding regions. A reading frame of a genomic sequence that does not encode a protein will have short ORFs due to the presence of many inframe stop codons. In eukaryotic organisms, transcription of protein-encoding regions initiated at specific promoter sequences is followed by removal of noncoding sequence (introns) from pre-mRNA by a splicing mechanism, leaving the protein-coding exons. As a result of the presence of intron sequences in the genomic DNA sequences of eukaryotes, the ORF corresponding to an encoding gene will be interrupted by the presence of introns that usually generate stop codons. The transcription (the formation of mRNA from the DNA sequence) and translation (coding-regions of mRNA into corresponding proteins) differ at a fundamental level in prokaryotes and eukaryotes. Hence, the problem of Gene Prediction maybe divided into two, namely, Gene Prediction in Prokaryotes and in Eukaryotes. 5 Gene Prediction in Prokaryotes 5.1 Understanding prokaryotic gene structure The knowledge of gene structure is very important when we set out to solve the problem of gene prediction. The gene structure of Prokaryotes can be captured in terms of the following characteristics Promoter Elements The process of gene expression begins with transcription - the making of an mRNA copy of a gene by an RNA polymerase. Prokaryotic RNA polymerases are actually assemblies of several different proteins (alpha, beta and beta-prime) that each play a distinct and important role in the functioning of the enzyme. The -35 and -10 sequences recognized by any particular sigma factor are usually described as a consensus sequence - essentially the set of most commonly found nucleotides at the equivalent positions of other genes that are transcribed by RNA polymerases containing the same sigma factor. Open Reading Frames Since stop codons are found in uninformative nucleotide sequences, approximately once every 21 codons (3 out of 64), a run of 30 or more triplet codons that does not include a stop codon is in itself a good evidence that the region corresponds to the coding sequence of a prokaryotic gene. One hallmark of prokaryotic genes that is related to their translation is the presence of the set of sequences around which ribosomes assenble at the 5' end of each open reading frame. Often found immediately downstream of transcriptional sites and just upstream of the first start codon, ribosome loading sites (sometimes called Shine-Delgarno) sequences almost invariably include the nucleotide sequence 5'-AGGAGGU-3'. Termination Sequences Just as the RNA polymerases begin transcription at recognizable transcriptional start sites immediately downstream from promoters, the vast majority of prokaryotic operons also contain specific signals for the termination of transcription called intrinsic terminators.