<<

ISSN 0006-3509, Biophysics, 2006, Vol. 51, No. 4, pp. 515–522. © Pleiades Publishing, Inc., 2006.

Original Russian Text © E.O. Ermakova,a,b D.B. Mal’ko,c M.S. Gelfand, 2006, published in Biofizika, 2006, Vol. 51, No. 4, pp. 581–588.

MOLECULAR BIOPHYSICS

Evolutionary Differences between Alternative and Constitutive -Coding Regions of Alternatively Spliced of Drosophila

E. O. Ermakovaa,b,D.B.Mal’koc, and M. S. Gelfanda,b,c

aLomonosov Moscow State University, Moscow, 199992 Russia bInstitute for Information Transmission Problems, Russian Academy of Sciences, Moscow, 127994 Russia cState Research Institute of Genetics and Selection of Industrial Microorganisms, Moscow, 117545 Russia

Received October 20, 2005; in final form, November 7, 2005

Abstract—A total of 790 Drosophila melanogaster genes that are alternatively spliced in a coding region and have orthologs in Drosophila pseudoobscura were studied. It proved that nucleotide substitutions are accumulated in alternative coding regions more rapidly than in constitutive coding regions. Moreover, the evolutionary patterns of alternative regions differing in mechanisms (use of alternative promoters, splicing sites, or sites) differ significantly. The synonymous substitution rate in coding regions of genes varies more strongly than the nonsynonymous substitution rate. The patterns of sub- stitutions in different classes of alternative regions of Drosophila melanogaster and mammals differ consid- erably. DOI: 10.1134/S0006350906040014

Key words: molecular , , Drosophila, nucleotide substitutions

INTRODUCTION isoforms, it is natural to expect that the corresponding regions of chromosomes undergo a different evolution Alternative splicing is one of the main mecha- than chromosomal regions corresponding to constitu- nisms creating the proteomic diversity in eukaryotes. tive coding regions, i.e., regions that are always pres- Only 80% of alternative of the fruit fly ent in a processed mRNA. A number of studies cor- Drosophila melanogaster are conserved in Droso- roborate this hypothesis. Alternative splicing sites are phila pseudoobscura, and 50% thereof are conserved weaker than constitutive ones [2]. Noncanonical in the mosquito Anopheles gambiae, whereas the per- centages of conserved constitutive exons are 97 and GC–AG introns are more often alternative than ca- 77%, respectively [1]. One alternatively spliced nonical GT–AG introns [3]. Among exons common to codes for several with common structural human and mouse, for 77% of alternative cassette segments. These proteins may perform different func- exons and only for 17% of constitutive exons the tions or the same function under different conditions neighboring long sequences of introns are also con- (e.g., they may be expressed in certain tissues and/or served [4]. certain developmental stages of the organism). The evolution of a gene proceeds by fixation of Since it is alternative coding regions of pre- (nucleotide substitutions) and its participa- mRNA, i.e., regions that can be eliminated or con- tion in chromosomal rearrangements, including the served in processing, and also coding regions located loss and gain of exons and introns. The evolution of between alternative promoters or polyadenylation the –intron structure of alternatively spliced sites that are responsible for the specificity of protein genes of D. melanogaster was studied earlier [1]. In

515 516 ERMAKOVA et al. particular, it was shown that new introns more fre- version 3 (ftp://flybase.net/genomes/Drosophila_me- quently arise in alternative regions of a gene than in lanogaster/), and genomic DNA of D. pseudoobscura constitutive regions. was obtained from The Baylor College of Medicine It is natural to assume that nucleotide substitu- Human Genome Sequencing Center (ftp://ftp.hgsc. tions leading to substitutions in protein are bcm.tmc.edu/pub/data/Dpseudoobscura). The ortho- more rapidly accumulated in alternative regions. Iida logy between genes of D. melanogaster and and Akashi [5] analyzed 26 pairs of alternatively D. pseudoobscura was established as described before spliced human genes and their orthologs in other [1]. The protein isoforms of D. melanogaster were mammals and showed that the nonsynonymous sub- mapped on the genome using the program Pro-Frame stitution rate in alternative regions of genes is higher [8]. We identified a total of 790 D. melanogaster than in constitutive regions, and vice versa, the synon- genes meeting the following requirements: the exis- ymous substitution rate in alternative regions is lower tence of different protein isoforms, the existence of a than in constitutive regions. However, our analysis of unique ortholog in D. pseudoobscura, and the exis- 3029 orthologous pairs of alternatively spliced genes tence of alternative regions conserved in D. pseudo- of human and mouse demonstrated a higher rate of obscura (the conservation of a region meant the exis- substitutions of both types in alternative coding re- tence of alignment that was significant according to gions as compared with constitutive regions [6]. In published criteria [1] and the conservation of splicing this work, we analyze nucleotide substitutions in 790 sites). alternatively spliced genes of D. melanogaster,com- Analysis Algorithm and Meta-Alignments paring them with orthologs in D. pseudoobscura. Data analysis consisted of the following steps. Three main types of alternative splicing are dis- First, we separated subsamples of genes (in particular, tinguished [7]. In type I splicing, different promoters genes with long alternatives were considered sepa- of the same gene are used to obtain cognate pre- rately, see below) and classes of constitutive and al- ′ mRNAs with different 5 -proximal regions of differ- ternative regions of alignments of these genes. Sec- ent lengths. As a result, proteins with different N-ter- ond, meta-alignments (concatenated alignments) of mini can form. Alternative type II splicing involves regions of separated classes were composed. We con- ′ pre-mRNAs with different 3 -proximal sequences of sidered meta-alignments of two types: local meta- different lengths, which generally results from poly- alignments combining coding regions of the same adenylation of transcripts at different sites. Corre- class within a single gene and global meta-alignments spondingly, proteins with different C-termini can be combining regions of the same class of all genes of a produced. And finally, in alternative type III splicing, certain subsample (Table 1). Incomplete codons and identical pre-mRNA are processed into different func- codons with deletions in alignment were not included tional mRNAs. In splicing, some or other alternative in meta-alignments. Third, we evaluated the similarity splicing sites are chosen. of the protein products and the synonymous and The existence of several types of alternative nonsynonymous nucleotide substitution rates in ge- splicing in eukaryotes stimulated us to separately nome regions of a given class (Table 2; Fig. 1). This study the evolution of DNA regions corresponding to evaluation required the existence of sufficiently long N-terminal alternative (N-alternative), C-terminal al- alignment [9]; we used a threshold of 80 nt. The use ternative (C-alternative), and inner alternative (I-alter- of meta-alignments allowed us to take into account native) regions of protein. It proved that the evolu- not only long alternative segments of genes, such as tionary behaviors of the three classes of alternative re- cassette exons, but also very short ones, such as ex- gions in Drosophila differ significantly (see Results tensions, i.e., regions between two alternative donor and Discussion). sites or two acceptor sites.

Considered Evolutionary Parameters PROCEDURE For each meta-alignment, we estimated four Genomic DNA and protein isoforms of D. mela- evolutionary parameters. The fraction Id of coinciding nogaster were taken from the FlyBase database, amino acids in translated meta-alignment enables one

BIOPHYSICS Vol. 51 No. 4 2006 DIFFERENCES BETWEEN ALTERNATIVE AND CONSTITUTIVE CODING REGIONS 517

Table 1. Structure of global meta-alignments

Number Description of Name of sam- of genes Meta-alignment Description of subsample of genes subsample of coding ple of genes in length, nt regions subsample

N-alternative 266935

I-alternative 84558 Reference All alternatively spliced genes 790 C-alternative 183581 sample All alternative 535074

Constitutive 1334760

Genes with Genes in which total alignment length of alternative Alternative 477229 long alterna- regions exceeds 80 nt and total alignment length of 588 tives constitutive regions exceeds 80 nt Constitutive 940465

Genes with Genes in which total alignment length of N-terminal N-alternative 247294 long N-alterna- alternative regions exceeds 80 nt and total align- 308 tives ment length of constitutive regions exceeds 80 nt Constitutive 451872

Genes with Genes in which total alignment length of inner alter- I-alternative 75271 long I-alterna- native regions exceeds 80 nt and total alignment 145 tives length of constitutive regions exceeds 80 nt Constitutive 342932

Genes with Genes in which total alignment length of C-terminal C-alternative 150755 long C-alterna- alternative regions exceeds 80 nt and total align- 184 tives ment length of constitutive regions exceeds 80 nt Constitutive 240306 to estimate the current divergence of regions of ho- Estimation of Nucleotide Substitution Rate. mologous proteins. The nonsynonymous substitution The Ina Method rate dN also serves as a measure of the divergence of We estimated dN and dS by the Ina method [9]. amino acid sequences corresponding to homologous This method belongs to “evolutionary pathway” me- thods, in which, for each pair of alignment codons, all regions of two genes, but, along with this, dN charac- terizes the “saturation” of a coding region with the shortest pathways of transformation of one codon nonsynonymous substitutions. The synonymous sub- into the other by single-nucleotide substitutions are considered. stitution rate dS allows one to judge about both the in- tensity of mutations in one or another coding region The is degenerate. A substitution for a codon as a result of a of a nucleotide (in comparison with dN) and the evolution of “non- protein” elements of the gene, e.g., regulatory se- can be synonymous, if the initial triplet and the triplet obtained by the mutation code for the same amino quences. The normalization of d and d is matched, N S acid. Each position of a nonterminating codon has and whereas d and d estimate the number of nucleo- N S synonymous potential s and nonsynonymous potential tide substitutions counted from the moment of diver- n,suchass + n = 1. In the general case, (non)synony- ω gence of two organisms, their ratio = dN/dS is al- mous potential of a position in a codon is the proba- ready not a function of time but a characteristic of the bility to obtain a (non)synonymous substitution by a selection pressure on this region. In particular, at ω > mutation of the nucleotide in this position. If a substi- 1, it is concluded that the coding region is under posi- tution for the base in one of the positions of a codon tive selection. (the other positions being fixed) leads to a

BIOPHYSICS Vol. 51 No. 4 2006 518 ERMAKOVA et al.

Table 2. Comparison of the evolutionary parameters of alternative and constitutive regions in terms of global meta-alignments

Class of coding re- Sample of genes Id d d gions N S Reference sample N-alternative 0.66 0.23 0.36 0.62 I-alternative 0.54 0.33 0.23 1.43 C-alternative 0.61 0.25 0.28 0.89 All alternative 0.62 0.25 0.31 0.80 Constitutive 0.65 0.22 0.27 0.81 Genes with long alter- Alternative 0.62 0.25 0.31 0.81 natives Constitutive 0.65 0.22 0.26 0.84 Genes with long N-al- N-alternative 0.66 0.22 0.36 0.62 ternatives Constitutive 0.63 0.23 0.26 0.90 Genes with long I-alter- I-alternative 0.53 0.33 0.22 1.47 natives Constitutive 0.63 0.23 0.21 1.06 Genes with long C-al- C-alternative 0.62 0.25 0.27 0.93 ternatives Constitutive 0.68 0.20 0.32 0.63 Id is the fraction of coinciding amino acids in translated meta-alignment; dN is the rate of nonsynonymous substitutions of nonsynonymous positions; dS is the rate of synonymous substitutions of synonymous positions. ω = dN/dS nonsynonymous substitution for the codon, this posi- Over all the pairs of codons, the number of nu- tion is called nonsynonymous and, for it, s =0and cleotide differences of four types is summed: synony- n = 1. And if any substitution for the base in this posi- mous and nonsynonymous transitions and trans- tions leads to a synonymous substitution for the versions (a is a substitution of a pyrimidine codon, this position is called synonymous and, for it, base for a pyrimidine base, or a substitution of a pu- s =1andn = 0. In the general case, s and n depend on rine base for a purine base, and its created difference; λ the probabilities XY of mutation of nucleotide X into and a is a substitution of a purine base nucleotide Y in a unit time (X,Y=A,C,T,G).Inthe for a pyrimidine base, or a substitution of a pyrimi- Ina method, the Kimura two-parameter model is ac- dine base for a purine base, and its created difference; λ λ λ λ α λ λ cepted [10]: TC = CT = AG = GA = and AC = CA Fig. 2). If two codons differ in two or three positions, λ λ λ λ λ λ β = CG = GC = AT = TA = GT = TG = (Fig. 2). In then, for each of the four types of substitutions, the this case, for each position of each codon s and n are arithmetic mean of their number over all possible readily expressed through R = α/β (Fig. 3). Counting minimal pathways of transformation of one codon nucleotide substitutions in coding regions of genes, into the other is taken. The minimal pathways con- the number of nonsynonymous substitutions is usu- taining termination codons are ignored. In particular, ally normalized to the total nonsynonymous potential if none of the minimal pathways contains termination of this region and the number of synonymous substi- codons, then, for differences in two positions, we tutions is normalized to the total synonymous poten- have two pathways, and for differences in three posi- tial of this region. tions, we have six pathways (Fig. 2).

BIOPHYSICS Vol. 51 No. 4 2006 DIFFERENCES BETWEEN ALTERNATIVE AND CONSTITUTIVE CODING REGIONS 519

Fig. 1. Comparison of the evolutionary parameters of alternative and constitutive regions in terms of global meta-alignments. ω Each diagram describes a P(A*) vs. P(C) experiment, where P is one of the evolutionary parameters (Id, dN, dS,or ) and A* is a class of alternative regions (A, AN,AI,orAC). A difference in a parameter P between regions of classes A* and C was consid- ered significant if ⏐P(A*) – P(C)⏐ / ⏐P(A*) + P(C)⏐ > 0.05. For each sector, the number of genes in the corresponding group is indicated. (a) Genes with long alternatives. All alternative (A) vs. constitutive (C). (b) Genes with long N-alternatives. N-alter- native (AN) vs. constitutive (C). (c) Genes with long I-alternatives. I-alternative (AI) vs. constitutive (C). (d) Genes with long C-alternatives. C-alternative (AC) vs. constitutive (C).

* Let S* be the arithmetic mean of the total synon- An estimate dS for dS is made by applying the * * ymous potentials of aligned sequences, STs be the Kimira correction for multiple substitutions to PS and * * number of transitions observed in alignment, and STv Q S [10]: be the number of observed in alignment. Then, the observed frequencies of synonymous differ- 1 1 dPQQ****=−ln(12 − − ) − ln(12 − ). ences—transitions and transversions—in synonymous SSSS2 4 positions are equal to * * * STs STv An estimate d for d is made similarly. The pa- P* = , Q* = . N N S * S * ω * * S S rameter is estimated as dN /dS .

BIOPHYSICS Vol. 51 No. 4 2006 520 ERMAKOVA et al.

Fig. 2. Kimura two-parameter model of DNA evolution λ λ λ λ α λ λ λ [10]: TC = CT = AG = GA = and AC = CA = CG = λ λ λ λ λ β GC = AT = TA = GT = TG = .

Fig. 3. Calculation of the synonymous potential s and the nonsynonymous potential n of a position in a codon in the Asymmetry of Mutations of Nucleotides Ina method [9]. Mutations of pyrimidine bases into pyrimidine bases or mutations of purine bases into purine bases ω synonymous substitution rate dS,and = dN/dS in al- in genomic DNA are much more frequent than muta- ternative (A) and constitutive (C) regions of genes tions of pyrimidine bases into purine bases and vice were estimated (Table 2; Fig. 1). It proved that, on the versa. The ratio R of the numbers of transitions and average, Id(A) < Id(C), dN(A) > dN(C), dS(A) > dS(C), transversions, which is necessary to calculate the syn- and ω(A) < ω(C). Thus, nucleotide substitutions are onymous and nonsynonymous potentials of a nucleo- accumulated more rapidly in alternative regions. This tide position, was estimated from meta-alignment of suggests weakening of negative selection and/or * * all regions of all of the 790 genes. If P3 and Q3 are strengthening of positive selection in alternative re- the observed frequencies of transitions and trans- gions of genes. Our observations are consistent with versions, respectively, in the third positions of align- the inverse dependence of the number of nucleotide ment codons, then we suppose [9] substitutions in a coding region of a gene on the num- 22ln(−−PQ** ) ber of tissues in which the gene is expressed [11] and R = 33− 1. − * on its expression level [12] if it is assumed that con- ln(12Q3 ) stitutive regions are expressed in a larger number of In our problem, R = 2.24. tissues than alternative regions are and the expression level of the former is higher. Analysis of Local Meta-Alignments For genes with long alternatives, we compared We also separately considered N-alternative the evolutionary behaviors of different alternative re- (AN), I-alternative (AI), and C-alternative (AC)re- gions of different classes and constitutive regions us- gions of genes. For global meta-alignments of the ing local meta-alignments (Fig. 1). A difference in a base sample (Table 2), Id(AN)>Id(AC)>Id(AI) and N C I parameter P between regions of classes X and Y was dN(A )0.05. conserved. Moreover, ω(AI) > 1. This suggests that inner alternative regions of genes are under positive C ≈ I selection. In addition, dS(A ) dS(C), dS(A )dS(C). In coding regions of genes, sev- We considered 790 alternatively spliced genes eral layers of information are coded, including infor- of D. melanogaster that have orthologs in D. pseudo- mation on protein sequences and regulatory sites. obscura. The fraction Id of coinciding amino acids, Nonsynonymous substitutions can change both of the nonsynonymous substitution rate dN, the these layers, whereas synonymous substitution can

BIOPHYSICS Vol. 51 No. 4 2006 DIFFERENCES BETWEEN ALTERNATIVE AND CONSTITUTIVE CODING REGIONS 521

Fig. 4. Count of substitutions of different types between the codons TAT and CGA: sTs is the number of synonymous transitions, sTv is the number of synonymous transversions, nTs is the number of nonsynonymous transitions, and nTv is the number of nonsynonymous transversions. Between the codons TAT and CGA, there are 6 minimal evolutionary pathways, but three of them were ignored in calculations since they contain termination codons. Note that pathways 2 and 3 contain different numbers of synonymous and nonsynonymous substitutions. change only the second of them. Our observations al- Since we considered only alternatives confirmed low us to assume that, in N-alternative and I-alterna- by proteins, some alternative regions may be included tive regions, two opposite scenarios are realized: in in a sample of constitutive regions; however, this N-terminal alternatives, only changes in regulatory could lead only to leveling of the observed differences N ≈ N between constitutive and alternative regions. sites are favored (dN(A ) dN(C), dS(A )>dS(C)), whereas in inner alternatives, only changes in amino Among genes with long alternative regions of I I acid sequences are (dN(A )>dN(C), dS(A )

BIOPHYSICS Vol. 51 No. 4 2006 522 ERMAKOVA et al.

This work was supported in part by the Russian 5. K. Iida and H. Akashi, Gene 261, 93 (2000). Foundation for Basic Research (project no. 04-04- 6. E. O. Ermakova, in Proceedings of the Second Interna- 49440), the Howard Hughes Medical Institute (grant tional Moscow Conference on Computational Molecu- no. 55000309), the Ludwig Institute for Cancer Re- lar Biology, 2005, p. 95 . search (CDRF RB0-1268), and also the Russian Sci- ence Support Foundation (M.S.G.) and the Program 7. M. Singer and P. Berg, Genes and Genomes: A Changing Perspective (University Science Books, Mill of the Russian Academy of Sciences “Molecular and Valley, Calif., 1991; Mir, Moscow, 1998), Vol. 2. Cell Biology”. 8. A. A. Mironov, P. S. Novichkov, and M. S. Gelfand, Bioinformatics 17, 13 (2001). REFERENCES 9. Y. Ina, J. Mol. Evol. 40, 190 (1995). 1. D. B. Malko, in Proceedings of the Second International Moscow Conference on Computational Molecular Biol- 10. M. Kimura, J. Mol. Evol. 16, 111 (1980). ogy, 2005, p. 224. 11. L. Duret and D. Mouchiroud, Mol. Biol. Evol. 17,68 2. F. Clark and T. A. Thanaraj, Hum. Mol. Genet. 11, 451 (2000). (2001). 12. C. Pal, B. Papp, and L. D. Hurst, Genetics 158, 927 3. M. Burset, I. A. Seledtsov, and V. V. Solovyev, Nucl. (2001). Acids Res. 28, 4364 (2000). 13. Y. Xing and C. Lee, Proc. Natl. Acad. Sci. USA 102, 4. R. Sorek and G. Ast, Genome Res. 13, 1631 (2003). 13526 (2005).

BIOPHYSICS Vol. 51 No. 4 2006