Reconstruction of DNA Sequences Using Genetic Algorithms and Cellular Automata: Towards Mutation Prediction? Ch

Available online at www.sciencedirect.com BioSystems 92 (2008) 61–68 Reconstruction of DNA sequences using genetic algorithms and cellular automata: Towards mutation prediction? Ch. Mizas a, G.Ch. Sirakoulis a, V. Mardiris a, I. Karafyllidis a, N. Glykos b, R. Sandaltzopoulos b,∗ a Democritus University of Thrace, Department of Electrical and Computer Engineering, 67100 Xanthi, Greece b Democritus University of Thrace, Department of Molecular Biology and Genetics, Laboratory of Gene Expression, Molecular Diagnostics and Modern Therapeutics, 68100 Alexandroupolis, Greece Received 21 December 2006; received in revised form 28 November 2007; accepted 10 December 2007 Abstract Change of DNA sequence that fuels evolution is, to a certain extent, a deterministic process because mutagenesis does not occur in an absolutely random manner. So far, it has not been possible to decipher the rules that govern DNA sequence evolution due to the extreme complexity of the entire process. In our attempt to approach this issue we focus solely on the mechanisms of mutagenesis and deliberately disregard the role of natural selection. Hence, in this analysis, evolution refers to the accumulation of genetic alterations that originate from mutations and are transmitted through generations without being subjected to natural selection. We have developed a software tool that allows modelling of a DNA sequence as a one-dimensional cellular automaton (CA) with four states per cell which correspond to the four DNA bases, i.e. A, C, T and G. The four states are represented by numbers of the quaternary number system. Moreover, we have developed genetic algorithms (GAs) in order to determine the rules of CA evolution that simulate the DNA evolution process. Linear evolution rules were considered and square matrices were used to represent them. If DNA sequences of different evolution steps are available, our approach allows the determination of the underlying evolution rule(s). Conversely, once the evolution rules are deciphered, our tool may reconstruct the DNA sequence in any previous evolution step for which the exact sequence information was unknown. The developed tool may be used to test various parameters that could influence evolution. We describe a paradigm relying on the assumption that mutagenesis is governed by a near-neighbour-dependent mechanism. Based on the satisfactory performance of our system in the deliberately simplified example, we propose that our approach could offer a starting point for future attempts to understand the mechanisms that govern evolution. The developed software is open-source and has a user-friendly graphical input interface. © 2008 Elsevier Ireland Ltd. All rights reserved. Keywords: Genetic algorithms; Cellular automata; DNA sequence modelling; Evolution; Simulation; Bioinformatics; Nanotechnology 1. Introduction 1993, 1994; Sirakoulis et al., 2004). In general, computational intelligence methods may be employed provided that they are In the recent years, advances of biochemical and biophysi- wisely combined with bioinformatics, as illustrated in Baxevanis cal instrumentation methodologies allowed collection of whole and Ouellette (1998). In order to contribute to this effort we mod- genome and proteome sequences (Venter et al., 2001). The elled DNA sequence as a one-dimensional cellular automaton complexity of the available information calls for the devel- (CA) thus allowing the use of powerful and sophisticated com- opment of ever more powerful bioinformatics tools to attack putation techniques for the development of useful bioinformatics complex problems associated with proper interpretation of the tools (Sirakoulis et al., 2003). obtained information. Faster and more accurate computational CAs, originally developed by von Neumann (1966) as mod- intelligence algorithms and data processing technologies are els of self-reproducing systems, have been extensively used required (Baxevanis and Ouellette, 1998; Chou and Zhang, to model and simulate environmental and biological systems 1992; Cios et al., 2005; Durbin et al., 2000; Zhang and Chou, (Bernadres and dos Santos, 1997; Chou et al., 2006; Gao et al., 2006; Gaylord and Nishidate, 1996; Kansal et al., 2000; Karafyllidis, 1997, 1998; Karafyllidis and Thanailakis, 1997; ∗ Corresponding author. Tel.: +30 25510 30622; fax: +30 25510 30624. Patel et al., 2001; Salzberg et al., 2004; Sirakoulis et al., 2000; E-mail address: [email protected] (R. Sandaltzopoulos). Wang et al., 2005; Xiao et al., 2005a,b, 2006a,b). CAs were 0303-2647/$ – see front matter © 2008 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2007.12.002 62 Ch. Mizas et al. / BioSystems 92 (2008) 61–68 showed to be a promising model for DNA sequence (Sirakoulis et al., 2003), because certain aspects of the DNA structure, function and evolution can be simulated using several mathematical tools (such as linear algebra and operators), introduced through the use of CAs. In our model the sugar-phosphate backbone of a DNA molecule corresponds to the CA lattice and the organic bases to the CA cells. At each position of the lattice one of the four bases A (adenine), C (cytosine), T (thymine) and G (gua- nine) of the DNA molecule may be allocated. These four bases correspond to the four possible states of the CA cell. In elementary CAs, the CA evolution rule can be extracted from a given number of CA evolution patterns. This method can also be applied to the CAs that model DNA sequences. Thus, we developed a simulator, named DNA EVO, which allows modelling DNA evolution by extracting CA rule(s) using genetic Fig. 1. (a) An example of a typical one-dimensional CA; (b) a sequence of algorithms (GAs) (Holland, 1975; Goldberg, 1989). The evolu- evolution steps of a one-dimensional CA. tion rule(s) can be determined by providing the global state of the DNA sequence in various evolution steps. Conversely, if the neighbouring cells. All the cells that may possibly affect the evolution rule(s) and DNA sequences are given for several evo- change of the state of the ith cell are the neighbourhood of this lution steps, it may be possible to determine the DNA sequence cell. The neighbourhood is defined as follows: in previous evolution steps. DNA EVO is an interactive simula- N i, r ={C ,...,C ,C ,C ,C,C ,C , tion tool that includes a graphical user interface [GUI] which has ( ) i−r i−3 i−2 i−1 i i+1 i+2 been implemented using Tcl/Tk facilities (Welch et al., 2003). Ci+3, ..., Ci+r},r= 0, 1, 2, 3,...,m (2) where r is the size of the neighbourhood. If r = 1, which is the 2. Modelling DNA in Terms of Cellular Automata most usual case then the neighbourhood of the ith cell consists of the same cell and its left and right immediate neighbours: In this work we match certain properties of DNA with the features of a proposed CA thus modelling DNA in terms of N(i, 1) ={Ci−1,Ci,Ci+1} (3) CAs. Sketching certain directions that DNA modelling might take, we hope to stimulate further research in this direction. The state of the ith cell at time step t + 1 is affected by the Cellular automata (CAs) were introduced by von Neumann states of its neighbours at the previous time step t, i.e. the state (1966) and Ulam (1974) as a possible idealization of biologi- of the ith cell at a time step is a function of the states of its cal systems, in order to model biological self-reproduction. In neighbours at the previous time step: this work, we model DNA as a one-dimensional CA and, there- Ct+1 = F Ct ,...,Ct ,Ct ,Ct ,Ct,Ct ,Ct , i ( i−r i−3 i−2 i−1 i i+1 i+2 fore, only one-dimensional CAs will be presented in this section Ct ,...,Ct (Sirakoulis et al., 2003). A one-dimensional CA consists of a reg- i+3 i+r) (4) ular uniform lattice, which may be infinite in size and expands in a one-dimensional space. Each site of this lattice is called a This function is the CA evolution rule. The upper index in cell. At each cell a variable takes values from a discrete set. The t+1 the state symbol denotes the time step. Ci is the state of the value of this variable is the state of the cell. Fig. 1(a) shows a ith cell at time step t +1.Ifr = 1, Eq. (4) becomes one-dimensional CA. The CA lattice consists of identical cells, Ct+1 = F Ct ,Ct,Ct ..., i − 3, i − 2, i − 1, i, i +1,i +2,i +3,..., and the correspond- i ( i−1 i i−1) (5) ing states of these cells are C − , C − , C − , C , C , C and i 3 i 2 i 1 i i+1 i+2 Fig. 1(b) shows the evolution of a one-dimensional CA. The C . i+3 horizontal axis is space and the vertical axis is time. Each row The state of the ith cell takes values from a predefined discrete represents the CA at each time step and each column represents set: the state of the same cell at various time steps. DNA may be modelled as a one-dimensional CA; the sugar- Ci ∈{c1,c2,c3,...,cn} (1) phosphate backbone corresponds to the CA lattice where the four where c1, c2, c3, ..., cn are the elements of the set. This set bases A, C, T and G may be attached, each one corresponding may be a set of integers, a set of real numbers, a set of atoms, a to the four possible states of a CA cell. The state of the ith cell set of molecules, or even a set of properties. If this set contains of this CA takes values from the discrete set that comprises the only the two binary numbers, i.e.

Reconstruction of DNA Sequences Using Genetic Algorithms and Cellular Automata: Towards Mutation Prediction? Ch

Phylogenetic Analysis of Anostracans (Branchiopoda: Anostraca) Inferred from Nuclear 18S Ribosomal DNA (18S Rdna) Sequences

Investgating Determinants of Phylogeneic Accuracy

Eidesstattliche Erklärung

Genetics 540 Winter, 2001 Models of DNA Evolution – Part 2 Joe Felsenstein Department of Genetics University of Washington

Convergent Evolution of Behavior in an Adaptive Radiation of Hawaiian Web-Building Spiders

Phylogenetics and Bioinformatics for Evolution [30Pt] Maximum Likelihood

PHYLOGENOMIC CONFLICT in HYLARANA 1 Exons, Introns, And

Hierarchical Phylogeny Construction

Molecular Phylogeny and Historical Biogeography of the Land Snail Genus Solatopupa (Pulmonata) in the Peri-Tyrrhenian Area

Deep Diversification and Long-Term Persistence in the South American ‘Dry Diagonal’: Integrating Continent-Wide Phylogeography and Distribution Modeling of Geckos

Mutation Patterns of Mitochondrial H- and L-Strand DNA in Closely Related Cyprinid Fishes

Fast Distance-Based Phylogenetic Placement