14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006, copyright by EURASIP
IDENTIFYING INVERTED REPEAT STRUCTURE IN DNA SEQUENCES USING CORRELATION FRAMEWORK
Ravi Gupta, Ankush Mittal, and Kuldip Singh Department of Electronics & Computer Engineering, Indian Institute of Technology Roorkee Roorkee, Uttaranchal, 247 667, India phone: + (91) 9897043997, email: {rgcsedec, ankumfec, ksd56fec}@iitr.ernet.in
ABSTRACT occur as tandem or reverse complement repeats. Reverse The detection of inverted repeat structure is important in complement structure is also called inverted repeat. Inverted biology because it has been associated with large biological repeats are widespread in both prokaryotic and eukaryotic function. This paper presents a framework for identifying genomes [3, 4, 5], and have been associated with a large inverted repeat structure present in DNA sequence. Based number of possible functions. Promoters, viruses, and eu- on the correlation framework, the algorithm is divided into karyotes all contain inverted repeats. Origins of DNA repli- two stages. In the first stage the position and length of con- cation from higher eukaryotes, such as monkey and human, tiguous inverted repeats are identified based on the input are also enriched in inverted repeats. Inverted repeats have parameters using correlation function. Later on in the sec- been implicated in the regulation of initiation of DNA repli- ond stage maximal inverted repeats are constructed by cation in plasmids, bacteria, eukaryotic viruses and mam- merging of continuous inverted repeats. mals [6]. More details regarding about the roles of inverted The advantage of the framework is that it can be success- repeat in human disease is presented in [7]. fully used for identifying both exact and inexact inverted Thus, it is important to detect the inverted repeat structure in repeats, returning maximal inverted repeat. Additionally, the a DNA sequence. A major difficulty in identification of re- framework does not need the user to specify parameters peats arises from the fact that the inverted repeats present in which require knowledge of system details. Experiments DNA sequence can be either exact or inexact, and are of were performed on various chromosomes of Saccharomyces unspecified length. The detection of exact inverted repeat is cerevisiae (baker’s yeast) genome data available at NCBI simple and can be achieved in linear time but the detecting website and some of the typical results are presented in this an inexact inverted repeat has proven to be challenging task. paper. Many techniques have been developed to identify the in- verted repast. Suffix tree technique [8] transforms the in- 1. INTRODUCTION verted repeat detection problem to finding longest common extension subsequence. It identifies exact or inexact repeat Signal processing techniques offer a great promise in ana- with fixed number of mismatches in linear time. However, lyzing and deciphering genomic data. The genomic informa- the technique becomes both complex and inefficient for tion is inherently discrete in nature because there are finite finding inexact inverted repeat without any prior knowledge numbers of nucleotides in DNA alphabet. The standard ap- of mismatches due to substitution or insertion/deletion of proach of representing the genomic information by se- nucleotides. Another technique for detecting approximate quences of nucleotide symbols in the sequence of DNA and inverted repeats in nucleotide sequences is inverted repeat RNA molecule, by symbolic codons (triplets of nucleotides), finder (IRF) [9]. IRF is a statistically based heuristic algo- or by symbolic sequences of amino acids in the correspond- rithm and the approach is similar to BLAST algorithm. The ing polypeptide chains limits the methodology of handling program detects candidate inverted repeats by finding short, the genomic information to mere pattern matching or statis- exact, reverse complement matches of 4-7 nucleotides (k- tical procedures. By properly mapping nucleotide symbols tuples) between non-overlapping fragments of a sequence. A to some numeric sequences [1, 2], signal processing tech- “center” position is defined for each k-tuple match. Inverted niques provides a set of novel and useful tools for solving repeat finder detects “clusters” of k-tuple matches having relevant problems of genomic data. In recent years signal the same or nearly the same center and falling within a small processing is making its presence felt in the emerging uni- interval of sequence. Candidate inverted repeats are con- verse of genome exploration. firmed (aligned and extended) or rejected by computing For last few decades, the major thrust of DNA and protein Smith-Waterman style similarity alignment. IRF run against analysis has been on string matching, either with goal of a genome sequence using parameters match, mismatch, in- obtaining a precise solution (e.g., with dynamic program- del, and minimum score. Major drawback of this technique ming) or more commonly a faster solution (e.g., with heuris- is the requirement of input parameters listed above. The tic techniques). However, these heuristic methods do not result is technique is dependent on the input parameters. A work well on repetitive structures. In DNA, most repetitions 14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006, copyright by EURASIP user has to do many trial sessions specifying different set of values for these parameters in order to get a good match. EINVERTED is another program available at http://bioweb. pasteur.fr/seqanal/interfaces/einverted.html is also used for finding inverted repeats in a DNA sequence. The algorithm is based on dynamic programming methodology. It works by finding alignments between the sequence and its reverse complement that exceeds a threshold score. Gaps and mis- matches are assigned penalty (negative score). Matches are assigned a positive score. The score is calculated by sum- ming the values of each match, the penalties of each mis- match and the large penalty of any gaps. Any region whose Figure 1 -The correlation coefficients obtained for sequences exceeds the threshold value are reported. In this paper, we provide a correlation based signal process- where F1[n] and F2[n] are Fourier transforms (FT) of f1[n] −1 ing technique in order to locate and identify the inverted and f2[n] respectively and FD is the Inverse Discrete Fou- repeat structures in DNA sequences. The inverted repeat rier transform (IDFT). detection algorithm is divided into two stages. It takes input In this paper we provide a novel technique for identifying parameter as the maximum length of the inverted repeat and both exact and inexact inverted repeat structures in a DNA minimum contiguous matches. These parameters, unlike in sequence. The technique is based on correlation of DNA other search systems, do not require user to be expert in un- sequence and its reverse complemented sequence. The value derstanding the inner details of the system. of correlation coefficients that are obtained after performing The paper is organized as follows. Section II describes about correlation is directly related to the presence of inverted the correlation function and how it can be applied to detec- repeat structures in the DNA sequence. tion of inverted repeat problem in DNA sequences. Section A primary step before performing correlation of DNA se- III presents an inverted repeat detection algorithm for DNA quence is to assign some numeric values to DNA nucleo- sequences. In Section IV the algorithm is applied on some tides. An arbitrary assignment would not give a correct cor- actual DNA sequence and experimental result is presented. relation measure between the DNA sequences. For example, Conclusion and future work follow in Section V. let A=1, C=2, G=3, T=4, and three sequences be S1 = AACC, S2 = AACG, S3 = AACT. By observation, the corre- 2. CORRELATION FUNCTION lation between S1 and S2 is equal to the correlation between S1 and S3, since they are off by one element. However, us- Correlation is a mathematical tool to quantify the degree of ing the given numeric mapping of the sequence, the correla- interdependence of one data upon another. In other words, it tion coefficient between S1 and S2 is 0.9, S1 and S3 is 0.82. is used to establish the similarity between one set of data In order to obtain a correct correlation measure we have and another. The process of correlation occupies a signifi- converted the DNA sequence into four nucleotide subse- cant place in signal processing. Applications of correlation quences. Similar nucleotide assignment has been used in are found in image processing for robotic vision or remote [10, 11]. Consider a DNA sequence S of length L, sensing by satellite in which data from different images is compared, in detection and identification of noise, and in Ssssss= many other fields, such as, climatology. The correlation for 123L L− 1L N-point data is given as 1 N consisting of L letters belonging to a given finite alphabet rk[]=+ fnfn [] [ k ] (1) Ω . For DNA sequences, Ω = {Adenine (A), Cytosine (C), 12N ∑ 1 2 n=1 Guanine (G), and Thymine (T)}. The sequence S is divided
into four nucleotide sequences SA, SC, SG, and ST. The se- where f1 and f2 are two functions for which the correlation is quences are calculated as follows: to be calculated. When f1[n] = f2[n] then correlation is said to be auto-correlation and when f [n] ≠ f [n] then it is said to be 1 2 1, if Si [ ]=Ω where Ω∈ {A,C,G,T} Si⎡⎤= (3) cross-correlation. Ω ⎣⎦ {0, otherwise The complexity of the correlation operation is O(N2) which is quite expensive, especially when dealing with very large sequences. The correlation computation may be speeded up In this way the original DNA sequence is decomposed into by exploiting the correlation theorem, usually stated as four binary sequences. The decomposition of the DNA se- quence into nucleotide subsequences helps in obtaining cor- 1 rect correlation measure between the two DNA sequences. rk[]= F−1* ( FnFn [] []) (2) Consider two sequences S and S′ as the given DNA and its 12N D 1 2 reverse complemented sequences respectively. The correla-
tion between the two sequences is calculated as follows:
14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006, copyright by EURASIP
Table 1. Pseudo code for stage 1 of the algorithm rkSS′′[ ]=≤≤− r S S [ k ] 0 kL 1 (4) ∑ ΩΩ FIND INVERTED REPEAT (S, Start, End, Lmin ) Ω L−1 if CHECK FLAG(Start, End) = TRUE r[] k=⋅+ SiSi []′′′ [ k ] whereSi [] =+ Si [ L ] (5) then return SSΩΩ′ ∑ ΩΩ Ω Ω i=0 if (End – Start + 1) < 2* Lmin then return The parameter r gives the correlation between the DNA SS′ (NA, NC, NG, NT )← number of nucleotide ‘A’, ‘C’, ‘G’, ‘T’ re- sequences and its reverse complemented sequence (i.e. S and spectively in S[Start…End] S′ respectively) and is used as a measure for inverted repeat sequence identification problem. The correlation coefficient MaxMatch ← min(NA , NT ) + min(NC , NG) SET FLAG (Start, End) obtained for a random DNA sequence AGCGGCATGATCA if MaxMatch < L TGATCATGCCCG of length 25 is provided in figure 1. min From the figure it is clear that there is high correlation for then return i ←Start, j ←End, TotalMatch ← 0 delay 1 and 7. The delay parameter in the correlation opera- while i < j and TotalMatch < MaxMatch and S[i]=S[j] tion of sequences helps in identifying the position of the in- do i ←i+1, j ←j-1, TotalMatch ←TotalMatch+1 verted repeat in the given DNA sequence and the value of if TotalMatch >= L correlation coefficient gives an idea about the number of min then OUTPUT (Start, End, TotalMatch) matches in the inverted repeat. Start ←Start + TotalMatch End ←End + TotalMatch 3. INVERTED REREAT DETECTION if TotalMatch < MaxMatch ALGORITHM then FIND INVERTED REPEAT (S, Start, End, Lmin ) The main objectives of any inverted repeat detection algo- return rithm are to identify its length, its pattern structure and its else position in the DNA sequence. The inverted repeat detection Corr = FIND CORRELATION ( S, I, Start, End) // I is the reverse complemented sequence of S algorithm present in this section operates in two stages. The i ← 0 first stage identifies contiguous inverted repeats in the input while i < ⎣WindowLength⎦ DNA sequence. A contiguous inverted repeat is represented do if Corr[i] >= 2* L by tuple
with highest number of matches reported when applied to X 211≥+Xl & Y 211 ≤− Yl whereXX 12 < (6) chromosome III of Saccharomyces cerevisiae with window size equal to 100 and minimum continuous repeat length as For example, ACGGATATGT have contiguous inverted re- 5 is shown in figure 2. Total number of matches in the re- peat as <1, 10, 2> i.e., AC – – – – – – GT and <5, 8, 2> i.e., ported inverted repeat was 43. When the window size was ATAT, so both can be combined according to the above rule increased to 300 the length of the inverted repeat reported to obtain an inverted repeat as AC– –ATAT– – GT. An was increased and is given by: acyclic graph is constructed in order to obtain inexact in- <82899,83191,11> <82914,83176,24> <82939,83151,12> verted repeats from a list of contiguous inverted repeats gen- <82952,83138,5> <82958,83132,23> <82981,83108,29> erated in the previous stage. The nodes of the acyclic graph <83011,83078,8> <83020,83069,20> are labeled as
Input: Organism: Saccharomyces cerevisiae chromosome III Accession number: NC_001134 Length: 813178 bp Window size: 100 Minimum contiguous repeat: 5
Output: Inverted Repeat: <82995, 83094, 15> <83011, 83078, 8> <83020, 83069, 20> Total Matches: 43
15 8 20 82995 83011 83032
A T G G A A T C C C A A C A A T T A T C T C A A C A T T C A C C C A T T T C T C A A G T A
T A C C T T A G G G T T G T T G A T A G A G T T T T A A G T G G G T A A A G A G T T C A T 8309483078 83069 Figure 2 - An inverted repeat reported by the algorithm in chromosome III of Saccharomyces cerevisiae DNA sequence. 14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006, copyright by EURASIP
Input: Organism: Saccharomyces cerevisiae chromosome IV Accession number: NC_001136 Length: 1531916 bp Window size: 100 Minimum contiguous repeat: 5
Output: Inverted Repeat: <307956, 308027, 5> <307962, 308021, 5> <307982, 308010, 12> Total Matches: 22
55 12 307956 307962 307982
C C T G G C C G G A A C G A A C T G C T A C C A G T T C T G T T T C C C T C
G G A C C C G C C T T T A A G A G A G A C A A A G G G A G 308027 308021 308010 Figure 3 -An inverted repeat reported by the algorithm in chromosome IV of Saccharomyces cerevisiae DNA sequence. total number of matches in the inverted repeat sequence is [2] W. Wang, Don H. Johnson, "Computing linear trans- 132. This chromosome when tested by EINVERTED pro- forms of symbolic signals," IEEE Trans. Signal Processing, gram gives out the same result when the window size was vol. 50, no. 3, pp. 628-634, March 2002. taken to be equal to that what we have taken. [3] H. Ogata, S. Audic, C. Abergel, P. E. Fournier, and J. M. Claverie, "Protein coding palindrome are a unique but recur- Saccharomyces cerevisiae chromosome IV: rent feature in rickettsia," Letter: Genomic Research, vol. The inverted repeat with maximum number of matches re- 12, no. 2, pp. 808-816, 2002. ported by the algorithm for window size=100 and minimum [4] M. D. LeBlanc, G. Aspeslagh, N. P. Buggia, and B. D. contiguous repeat length = 5 is the following: Dyer, "An annotated catalog of inverted repeats of caenor- <307956, 308027, 5> <307962, 308021, 5> <307982, habditis elegans chromosomes III and X, with observations 308010, 12> concerning odd/even biases and conserved motifs," Ge- nomic Research, vol. 10, no. 9, pp. 1381-1392, 2000. The inverted repeat is shown in figure 3. It is formed by [5] H. Ogata, S. Audic, C. Abergel, P. E. Fournier, and J. M. merging 3 contiguous repeats of length 5, 5 and 12. The start- Claverie, "Selfish DNA in protein-coding genes of ing location of the inverted repeat is 307956. The total num- rickettsia," Science, vol. 290, pp. 347-349, October 2000. ber of matches in the inverted repeat sequence is 22. [6] C. E. Pearson, H. Zorbas, G. B. Price, M. Zanna Had- jopoulos, "Inverted repeats, stem loops, and cruciforms: 5. CONCLUSION significance for initiation of DNA replication," J. Cell. Bio- chem. vol. 63, no. 1, pp 1-22, 1996. A simple and efficient technique for identifying inverted [7] J. J. Bissler, "DNA inverted repeats and human disease," repeat structure in DNA sequences using correlation based Frontiers of Bioscience, pp. 408-418, March 1998. framework is presented in this work. The advantage of using [8] D. Gusfield, Algorithms on Strings, Trees, and Sequences: the algorithm is that it can efficiently detect both exact and Computer Science and Computational Biology. Cambridge, inexact inverted repeat present in DNA sequence. Another U.K.: Cambridge Univ. Press, 1997. advantage of this algorithm is that it generates all possible [9] P. E. Warburton, J. Giordano, F. Cheung, Y. Gelfand, G. inverted repeats present in DNA sequence satisfying mini- Benson, "Inverted repeat structure of the human genome: mum continuous repeat matching criteria fixed by the user. The X-chromosome contains a preponderance of large, The algorithm also reports the location and the length of the highly homologous inverted repeats that contain testes inverted repeat detected in the DNA sequence. The inverted genes," Genome Research, vol. 14, no 10A, pp. 1861-1869, repeat detection algorithm requires the specification of two 2004. well understood parameters: maximum length of inverted [10] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhat- repeat and minimum number of continuous match. tachrya, R. Ramaswamy, "Prediction of probable genes by A further work can be carried out in devising scoring fourier analysis of genomic sequences," CABIOS, vol. 13, method for faster merging of contiguous inverted repeats in no. 3, pp. 263-270, 1997. order to obtain a maximal inverted repeat. Also the work can [11] D. Sharma, B. Issac, G. P. S. Raghava, R. Ramaswamy, be extended for identifying other repetitive structures such "Spectral repeat finder (SRF): Identification of repetitive as tandem and dispersed repeat present in the DNA se- sequences using fourier transformation," Bioinformatics, quence. vol. 20, no. 9, pp. 1405-1412, 2004. [12] T. T. H. Cormen, C. E. Leiserson and R. L. Rivest, In- REFERENCES troduction to Algorithms, MIT Press, Cambridge, MA, USA, [1] P. D. Cristea, "Conversion of nucleotides sequences into pp. 477-484, 6th Indian edition, 2001. genomic signals," J. Cell. Mol. Med., vol. 6, no. 2, pp. 279- 303, 2002.