Distinguish Coding and Noncoding Sequences in a Complete Genome Using Fourier Transform

QUT Digital Repository: http://eprints.qut.edu.au/ Zhou, Yu and Zhou, Li-Qian and Yu, Zu-Guo and Anh, Vo V. (2007) Distinguish Coding And Noncoding Sequences In A Complete Genome Using Fourier Transform. In Proceedings Third International Conference on Natural Computation (ICNC 2007), pages pp. 295-299, Haikou, China. © Copyright 2007 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Distinguish Coding And Noncoding Sequences In A Complete Genome Using Fourier Transform Yu Zhou1, Li-Qian Zhou1, Zu-Guo Yu1,2∗ Vo Anh2 1School of Mathematics 2School of Mathematical Sciences, and Computational Science, Queensland University of Technology, Xiangtan University, GPO Box 2434, Brisbane, Hunan 411105, China. Q 4001, Australia. Abstract genomes has always been a challenging task for bioinfor- maticians and computational biologists [1]. A Fourier transform method is proposed to distinguish coding and non-coding sequences in a complete genome It is known that coding and non-coding sequences have based on a number sequence representation of the DNA different statistical and fractal behaviors. Li et al. [2] sequence proposed in our previous paper (Zhou et al., J. found that the spectral density of a DNA sequence contain- β Theor. Biol. 2005) and the imperfect periodicity of 3 in ing mostly introns shows 1/f behavior, which indicates 0 1 protein coding sequences. The three parameters Px(¯s)(1), the presence of long-range correlation when <β< . Px(¯s)(1/3) and Px(¯s)(1/36) in the Fourier transform of the The correlation properties of coding and non-coding DNA number sequence representation of DNA sequences are se- sequences were first studied by Peng et al. [3] in their frac- lected to form a three-dimensional parameter space. Each tal landscape or DNA walk model. They discovered that DNA sequence is then represented by a point in this space. there exists long-range correlation in non-coding DNA se- The points corresponding to coding and non-coding sequences while the coding sequences correspond to a reg- quences in the complete genome of prokaryotes are seen ular random walk. By undertaking a more detailed anal- to be divided into different regions. If the point (Px(¯s)(1), ysis, Chatzidimitriou-Dreismann and Larhammar [4] con- Px(¯s)(1/3), Px(¯s)(1/36)) for a DNA sequence is situated in cluded that both coding and noncoding sequences exhibit the region corresponding to coding sequences, the sequence long-range correlation. A subsequent work by Prabhu and is distinguished as a coding sequence; otherwise, the se- Claverie [5] also substantially corroborates these results. If quence is classified as a noncoding one. Fisher’s discrim- one considers more details by distinguishing c from t in inant algorithm is used to study the discriminant accuracy. pyrimidine, and a from g in purine (such as two or three- The average discriminant accuracies pc,pnc,qc and qnc dimensional DNA walk models [6] and maps given by Yu of all 51 prokaryotes obtained by the present method reach and Chen [7], then the presence of base correlation has 81.02%, 92.27%, 80.77% and 92.24% respectively. been found even in coding sequences. On the other hand, Buldyrev et al. [8] showed that long-range correlation ap- pears mainly in noncoding DNA using all the DNA se- 1. Introduction quences available. Based on equal-symbol correlation, Voss [9,10] showed a power law behavior for the sequences studied regardless of the proportion of intron contents. The frac- The DNA sequence is formed from four different nu- tal methods for DNA sequence analysis were reviewed by cleotides, namely adenine (a), cytosine (c), guanine (g) and Yu et al. [11]. Yu et al. [12] performed a multifractal anal- thymine (t). The complete genomes provide essential infor- ysis based on the chaos game representation of protein se- mation for understanding gene functions and evolution. The quences from complete genome. The measure representa- determination of patterns in DNA and protein sequences is tion of linked protein sequence from a complete genome also useful for many important biological problems such as was proposed and its multifractal analysis was performed identifying new genes and discussing phylogenetic relation- by Yu et al. [13]. Zhang et al. [14] used the parameters ships among organisms. Accurate prediction of genes in from root-mean-square fluctuation analysis to distinguish ∗Corresponding author Zu-Guo Yu, e-mail: [email protected] or intron-containing and intronless genes based on the prop- [email protected] erties of Z curves [15]. Kulkarni et al. [1] proposed to Third International Conference on Natural Computation (ICNC 2007) 0-7695-2875-9/07 $25.00 © 2007 use local Holder exponent formalism to identify coding and For each DNA sequence S¯ and a fixed integer K, non-coding sequences. we construct a partition of S¯ by dividing it into non- ¯ In their review paper, Fickett and Tung [16] pointed out overlapping K-strings. If we denote the partition as S = that future gene-finding algorithms should be Fourier, run, S1S2 ···SN , where each Si,i=1, 2, ···,N − 1, is a K- ORF and the in-phase hexamer [17]. Hence Yan et al. [17] string and SN is a substring with length less than or equal to ¯ proposed a new Fourier transform approach to distinguish K, then the sequence x(S)=(x(Si),x(S2), ···,x(SN )) is coding sequences from noncoding sequences. The data set called the number sequence representation of the DNA se- ¯ used in the above papers covers a large number of organ- quence S corresponding to the given K. It can be proved isms. that the number sequence representation is unique for each In our previous paper [18], a number sequence repre- DNA sequence with any fixed K [18]. sentation of DNA sequences was proposed. Then a frac- The power spectrum for a number sequence is defined as 2 tal method was used to distinguish coding and non-coding 1 N Px(S¯)(f)= N n=1 x(Sn)exp(−2πifn) , for a given sequences in a complete genome based on their different frequency f. statistical behaviors. In the present work, we propose to Our idea is to select three parameters from the power use the Fourier transform approach to distinguish coding spectrum {P ¯ (f): f ∈ [0, 1]} to form a three- and non-coding sequences in a complete genome based on x(S) dimensional parameter space, so that each DNA sequence the number representation of DNA sequences. The param- can be represented by a point in this space. eters Px(¯s)(1), Px(¯s)(1/3) and Px(¯s)(1/36) (to be elabo- rated below) in the Fourier transform of the number sequence representation of DNA sequences are selected to 3 The benchmark to evaluate the method form a three-dimensional parameter space, and each DNA sequence is then represented by a point in this space. If We use Fisher’s linear discriminant algorithm [19,20] to the point (Px(¯s)(1), Px(¯s)(1/3), Px(¯s)(1/36))foraDNA calculate the discriminant accuracies of our method. sequence is situated in the region corresponding to coding For all coding sequences of each genome, we ran- sequences, the sequence is distinguished as a coding se- domly selected 80% of coding sequences to compose quence; otherwise, the sequence is classified as a noncod- a training set, and the remaining 20% of coding se- ing one. Fisher’s discriminant algorithm is used to study the quences to form the test set. For all non-coding se- discriminant accuracy. The average discriminant accuracies quences of each genome, a similar selection is under- of all 51 prokaryotes obtained by the present method will taken. We consider the three-dimensional space spanned by be reported. {Px(S¯))(f1),Px(S¯))(f2),Px(S¯))(f3)}, where f1, f2, f3 are three frequencies selected from the interval [0, 1]. Each cod- 2Method ing or non-coding sequence can be represented as a point in this space. We described Fisher’s discriminant algorithm in [18]. In this paper, we use a unique number sequence repre- We define pc as the discriminant accuracy of coding se- sentation of each DNA sequence, which is proposed in our quences, pnc as the discriminant accuracy of noncoding se- previous paper [18]. Here we briefly describe this represen- quences, in the training set; qc as the discriminant accuracy tation. of coding sequences, qnc as the discriminant accuracy of Firstly, considering the properties of purine or pyrimi- noncoding sequences, in the test set as in [18]. dine, and strong or weak bonds, we define a map from the nucleotides to the numbers as 4 Results and Discussion →1 c , →3 : g , We selected the same 51 complete genomes of Archaea F →5 a , and Eubacteria available from the public database Genbank t →7. at the web site ftp://ncbi.nlm.nih.gov/ genbank/genomes/ as those in [18]. We use the abbreviations of these 51 prokary- Secondly, we map each K nucleotides to a number. otes in our figures and table in this paper. For full name and Any string made of K letters from the set {g, c, a, t} category of them, the reader can refer to Table 1 in [18]. is called a K-string. Denoting a K-string by S = We tried the cases K =1to 6, and found that ··· ∈{ } =1··· s1 sK,si c, g, a, t ,i , ,K , we define K =1is the best length of non-overlapping substrings K i x(S)= i=1 F (si)/l , where the base l can be any inte- in the number sequence representation for the method.

Distinguish Coding and Noncoding Sequences in a Complete Genome Using Fourier Transform

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support