Fast and Accurate Gene Prediction by Decision Tree Classification

Fast and Accurate Gene Prediction by Decision Tree Classification Rong She1, Jeffrey Shih-Chieh Chu2, Ke Wang1, and Nansheng Chen2 1School of Computing Science, Simon Fraser University, Canada. 2Department of Molecular Biology and Biochemistry, Simon Fraser University, Canada. Abstract of base pairs (bp) long, where usually only less than Gene prediction is one of the most challenging tasks in 5% contains instructions for protein-coding and non- genome analysis, for which many tools have been devel- coding genes. The DNA segments that carry this ge- oped and are still evolving. In this paper, we present a netic information are called genes (Figure 1). In order novel gene prediction method that is both fast and ac- to understand the sequenced genome of a species and curate, by making use of protein homology and decision exploit such resources for biological and medical pur- tree classification. Specifically, we apply the principled poses, one of the first and most important steps in entropy and decision tree concepts to assist in such gene genome study is gene prediction, i.e. determining the prediction process. Our goal is to resolve the exact gene positions of genes and their structures on the DNA structures in terms of finding \coding" regions (exons) sequences [9]. In this paper, we will focus on the pre- and \non-coding" regions (introns). Unlike traditional diction of protein-coding genes. classification tasks, however, we do not have explicit class A typical protein-coding gene contains \coding" labels for such structures in the genes. We use protein se- regions (exons) as well as \non-coding" regions (in- quence (the product of gene) as a query to help in finding trons) (Figure 1). When a gene is expressed, it is genes that are homologous to the query protein and de- first transcribed as pre-mRNA, which then undergoes duce class labels based on homology. Our experiments on a process called splicing, in which non-coding regions the genomes of two nematodes C. elegans and C. briggsae are removed. A mature mRNA, which does not con- show that in addition to achieving prediction accuracy tain introns, serves as a template for the synthesis of comparable with that of the state of the art methods, it a protein in translation. In translation, each codon, a is several orders of magnitude faster, especially for genes group of three adjacent base pairs in mRNA, directs that encode longer proteins. the addition of one amino acid (according to the ge- netic code [19]) to a peptide being synthesized. Thus, 1 Introduction. a protein is a sequence of amino acid residues corre- With the fast development of genome sequencing sponding to the mRNA sequence of a gene. technologies, the amount of genome data accumu- The junction between an exon and an intron is lated has been increasing exponentially. To date, called a splice site, which is either a donor (the start biologists have sequenced genomes of more than a site of an intron) or an acceptor (the end site of an thousand species, while thousands of more species are intron), as illustrated in Figure 1. The DNA sequence currently being sequenced (http://genomesonline. that is formed by removing the introns and joining the org/). The genome sequence for any organism is com- exons is known as a spliced sequence. posed of DNA sequences for each of the chromosomes The tasks of gene prediction include finding the in that organism. In the human genome, a DNA positions of gene start, gene end, and splice sites sequence (chromosome) can be hundreds of millions on the DNA sequence. Since introns and exons are complementary to each other (i.e. introns can 790 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. DNA gene gene gene gene Gene start Gene end donor acceptor donor acceptor gene Exon1 Intron1 Exon2 Intron2 Exon3 transcription mRNA translation protein Figure 1: The structure of a protein-coding gene on a DNA sequence and the process of gene expression. A DNA sequence may contain thousands of genes. A gene consists of exons that are coding regions (shaded regions in gene structure) and introns that are non-coding (non-shaded regions in gene structure). During transcription, introns are removed and exons are glued together to form mRNA. Finally, mRNA is translated into a protein (gene product). be identified as genomic regions flanked by adjacent stop codon and donor and acceptor signals, such as exons) (Figure 1), the gene structure is determined GENSCAN [20], and AUGUSTUS [21]. On the other once all its introns are identified. If a protein sequence hand, homology-based programs make use of extrinsic (gene product) is used as a query to align with the evidence such as protein sequences, mRNAs, ESTs or gene on the DNA sequence, introns should be DNA other genomic sequences in finding the genes on the regions that have no query alignment, because the DNA sequence. The availability of genome sequences query is translated from exons only. The spliced of related species has created growing demand for bet- sequence of the gene should be homologous to the ter and faster homology-based gene prediction pro- query protein. grams, which gives rise to many developments in such area, such as GeneWise [18], Projector [22], Twin- 1.1 Gene prediction: current status. Given a Scan [23], exonerate [24], SGP2 [25], SLAM [26]. In newly-sequenced genome, there are currently many general, it has been shown that homology-based gene computational tools to predict possible gene struc- prediction methods generally outperform the ab initio tures on the DNA sequences [5, 6, 7, 8, 9]. These methods in terms of accuracy when there are extrinsic tools can be divided into two major categories: ab evidences available [9, 11], with GeneWise [18] being initio and homology-based methods. ab initio meth- one of the most widely used homology-based gene pre- ods find genes by systematically examining the DNA diction programs. sequences for certain signals including start codon, 791 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Most current gene prediction programs such ogous to the query protein. These groups indicate as GeneWise are based on hidden Markov models the candidate and approximate regions where homol- (HMMs) and their computational complexity is in- ogous genes are present (gene regions). Each group trinsically high due to the high running cost of Vert- also identifies the relevant HSPs with alignment in- ibi algorithm [1] that is used in HMM solutions. This formation for each gene region. makes them very slow when annotating large scale Although each HSP group indicates a candidate of genome sequences. For whole-genome scale analy- region for a gene on the target DNA sequence, it sis, genome sequences are usually preprocessed to re- does not provide details of the exact structure of the fine the input sequence for these programs, including potential gene. To resolve the structure of a gene, GeneWise [2], so that they can be executed within a we need to identify the positions of its exons and reasonable time frame. Nevertheless, for longer query introns. This is exactly the purpose of genblastDT. In proteins, GeneWise takes more than one hour to finish this paper, we will focus on the problem of resolving prediction on one gene even with such preprocessing. gene structures using the HSP groups returned by genblastA. 1.2 Our contribution. We have developed a novel homology-based gene prediction program, Challenges in resolving gene structures. In genBlastDT, which is comparable on accuracy with each gene region, the HSPs reported by genBlastA GeneWise but is many orders of magnitude faster suggest the presence of exons of a gene. Introns than GeneWise. Similar to GeneWise, genBlastDT are not represented by HSPs because the query is takes two biological sequences as input: a query pro- a protein sequence that is coded only by the exons tein sequence (i.e., gene product), Q, and a target in a gene. However, mapping the HSPs to exons DNA sequence, T . genBlastDT is able to quickly and is a challenging task due to several reasons. First, accurately find genes on the target DNA sequence T HSPs often contain gaps and mismatches in their that are homologous to the query protein Q. The alignments and the boundaries of exons usually do not unique contribution of genBlastDT is a novel applica- coincide with boundaries of HSPs. Second, because tion of decision tree classification in identifying intron BLAST always tries to extend HSPs as long as its regions. More details are described in the later dis- score is above a threshold, it is possible for one HSP to cussion of our approach. correspond to the region that contains multiple exons. genBlastDT is built on top of a recently developed On the other hand, due to evolutionary divergence, program, genBlastA [17], which can quickly identify exons may not have precise correspondences with homologous gene regions on the DNA sequence for a query protein fragments and it is possible for the given protein and thus can be used as an indepen- region of one exon to be represented by multiple dent preprocessing tool for GeneWise-like programs. HSPs with small gaps between them. Third, splice genBlastA utilizes a fast and sensitive local alignment sites are usually signaled by reserved sequences (i.e. tool such as BLAST [16] that finds all sequence ho- splice site signals) which must be taken into account mologies between the given query protein Q and the when resolving the gene structure. However, random target DNA sequence T . The result is a set of un- sequences in the target DNA sequence may resemble organized local alignments called HSPs (high-scoring splicing signals, which makes it difficult to identify segment pairs), where each HSP is a pair of segments, the real splice sites.

Fast and Accurate Gene Prediction by Decision Tree Classification

Gene Prediction and Genome Annotation

Gene Structure Prediction

A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods

There Is a Lot of Research on Gene Prediction Methods

Gene Prediction Using Deep Learning

Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration

"An Overview of Gene Identification: Approaches, Strategies, and Considerations"

A Benchmark Study of Ab Initio Gene Prediction Methods in Diverse Eukaryotic Organisms

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Progress in Gene Prediction: Principles and Challenges Srabanti Maji and Deepak Garg*

PATTERNS of DIPEPTIDE USAGE for GENE PREDICTION a Thesis

Bioinformatics Is a New Discipline That Addresses the Need to Manage and Interpret the Data That in the Past Decade Was Massively Generated by Genomic Research