<<

Fungal Intron Evolution: Why a small genome has many introns? Kemin Zhou, Alan Kuo, Asaf Salamov, and Igor Grigoriev

Introduction Exon number reduction half loss rule. S. roseus is an exception Most frequent and the shortest exon length and evidence of intron loss Here we are trying to answer the question why one of the Table 2. Intron evolution within genomes. Coding exon number between differently smallest genome Sporobolomyces roseus has one of the most conserved genes were compared. The genes were divided in four conservation groups: introns of all fungal genomes in the context of fungal intron Sporo1 all--all genomes (GCAS), between-- between different phyla (GCBP), phylum--within the y = 0.503 x + 1.172

evolution. In this study we used a statistical comparative 1000 −0.7812x same phylum (GCWP), and species-specific genes (SSG). The p-values for t-test are 0 280067 genomics approach toward intron number evolution among 16 No Sporo1, p-val=8.196e-07 L = 1060.1e −1.8961x +198.5 1 206910 colored red if less than 10e-4, pink if less than10e-3, yellow if less than 10e-2, and green fungal genomes. Pospl1 2 206494

if less than 0.05. 600 2000 2500 3000 Table1. Fungal genomes used in this study. cryneo1 567

Database Species dbname all p-val between p-val phylum p-val species Lacbi1 Length Exon Average Count 0 200 Aspni1 Aspergillus niger Aspni1 3.76 8.61E-08 3.31 3.24E-09 3.02 0.002406 2.83 Phchr1 Phybl1 copci1 Mycfi1 Mycosphaerella fijiensis Batde5 5.90 0.000305 5.18 0.224571 4.86 8.79E-07 3.57 copci1 7.32 0.000322 6.61 8.94E-18 5.65 4.17E-34 4.38 Mycgr1 Mycosphaerella graminicola Batde5 G L(x 1) Necha2 Nectria haematococca cryneo1 7.29 0.000356 6.69 0.171129 6.37 1.32E-06 5.18 Necha2 = + Trire2 Picst3 Pichia stipitis Lacbi1 7.89 2.61E-07 6.84 2.97E-10 6.11 1.43E-35 4.83 Mycgr1 6000 8000 Trive1 Aspni1 0 500 1000 1500 Mycfi1 2.49 0.5253 2.44 0.065689 2.52 5.96E-09 2.23 Species-specific Number Exon Trire2 Trichoderma reesei Mycfi1 Trive1 Trichoderma virens Mycgr1 2.48 0.087217 2.59 0.000774 2.75 0.00569 2.91 0 100 200 300 400 500 234

Necha2 3.33 0.001856 3.09 0.002464 2.97 0.00023 3.14 ustma1 Total Exon Length copci1 Coprinus cinereus 80 160

Picst3 2000 4000 Phchr1 7.08 1.40E-05 6.32 2.41E-22 5.18 5.14E-12 4.34 Exon Length cryneo1 Cryptococcus neoformans 0 10203040506070 Phybl1 6.18 0.002574 5.68 0.003718 6.54 2.89E-14 4.21 2345678 Lacbi1 Laccaria bicolor Picst3 1.44 0.425451 1.41 0.510146 1.44 0.098549 1.54 Number of Introns Phchr1 Phanerochaete chrysosporium Exon Number Conserved in All Figure 9. Exon length distribution. Exon length shorter than 500 nt from Pospl1 6.90 0.31744 6.69 0.983844 6.68 1.28E-06 5.92 Z y Pospl1 Postia placenta all 16 genomes are plotted with exon of different phases. Exon phase is g Sporo1 7.21 0.677986 7.29 0.519538 7.48 0.405048 7.21

C o Figure 4. Half loss rule. Showing the linear relationship between the average Figure 8. The shortest most frequent exon length. Top half, mean exon length as a h m Sporo1 Sporobolomyces roseus defined as the remainder of the length of exon divided by 3. The total y y tr c Trire2 3.31 3.97E-05 2.99 0.001305 2.85 0.046557 3.06 number of exons in species specific genes (SSG) and that of genes function of number of introns. The equation set x to 60 to 70, the estimated exon id o ustma1 Ustilago maydis number of exons (all sizes) in different phases are shown in the legend. io ta conserved in all species (GCAS). Sporo1 is an exception although its inclusion length is 66-86 nt long. The bottom half is simply plots the total exon length against m Trive1 3.35 5.51E-06 2.99 0.000228 2.84 0.045978 2.95 Phase 0 exon dominates. yc Phybl1 Phycomyces blakesleeanus o still make the correlation statistically significant to p-value of 9.996e-06. the intron numbers. t ustma1 1.67 0.846634 1.69 0.778802 1.67 0.003047 1.90 a Batde5 Batrachochytrium dendrobatidis

Table 3. Conserved gene have shorter introns. Reversetranscriptase have divergent effects on No intron loss for S. roseus (Sporo1) Average of the log intron length (ALIL) were compared between conservation level all and species. Only Aspni1 exon number 7.29 Basidiomycota Aspni1 Mycfi1 Mycgr1 Necha2 250 cryneo1 Ag 500 showed no significant difference. The two genomes with the ari 300 400 7.66 T 7.32 co most intron loss ustma1 and Picst3 showed the opposite r m e yc 200 Sporo1 300 400 m copci1 7.89 et 78 trend. Column diffexp is the natural exponential of the

e 150 200 250 300 e s Lacbi1 l 200 lo Lacbi1 100 m 100 differences of ALIL (species – all).

7.21 100 150 200 y 6.90 0204060801000 20406080100020406080100020406080100 cryneo1 copci1 c 0.05 Pospl1 6 cryneo1 Sporo1 e Phchr1 70 Pospl1 Sporo1 350 t Phchr1 Pospl1 e 0.07 0.62 Picst3 Trive1 Trire2 copci1 dbname All Species diffexp P-value s P Lacbi1 u -0.14 Phybl1 copci1 c 5 Aspni1 4.277 4.279 0.2 0.88794301 Phybl1 ci n 0.28 7.08 500 700 all io -0.03 Phchr1 Batde5 m 0.04 Batde5 4.605 4.653 5.0 1.47E-05 between Batde5 yc 0.05 phylum ot i 300 species n 10 30 50 Batde5 copci1 4.130 4.208 5.0 4.93E-22 a -0.23

5.9 50 150 250 5678 -0.28 0204060801001000 200 300 20406080100 400 020406080100020406080100 Aspni1 cryneo1 4.083 4.175 5.7 4.99E-30 -0.02 Necha2 80 ustma1 1.68 Lacbi1 Phchr1 0 Count 400 1000 Trive1

Mean Number of Exons of Mean Number Lacbi1 4.035 4.319 18.6 0 -5.57 Trire2 0 4 300 -1.07 800 Aspni1 Mycgr1 Mycfi1 4.288 4.776 45.8 1.82E-81 Trire2 Mycfi1 200 7.25 -3.48 cryneo1 300 400 500 Trive1 Necha2 Pospl1 Mycgr1 4.283 4.878 58.9 7.27E-137 Average Number of Exons 200 300 400 500 100 200 400 600 ustma1 0 Necha2 4.180 4.235 3.7 9.06E-05 0204060801000 20406080100020406080100020406080100 Picst3 Mycgr1 Mycfi1 6.18 Phybl1 1234 Batde5 800 Phybl1 Phchr1 4.026 4.069 2.5 8.99E-08

ustma1 23 -0.04 -2.33 Phybl1 4.584 4.653 7.0 2.87E-12 ustma1 0.40 0.42 0.44 0.46 0.48 0.50 Picst3 -0.04 64 250 350 Picst3 4.428 4.251 -13.6 0.01634488 -1.21 Sporo1 Mean Relative Intron Location 1234567 150 40 60 80 100 100 200 300 400 -0.01 -0.40 Picst3 1.44 200 400 600 Pospl1 4.200 4.507 23.9 5.11E-129 log (Total Number of RT) 50 0204060801000 20406080100020406080100020406080100 Sporo1 4.412 4.529 10.2 7.64E-31 2.49 Mycfi1 Figure 3. Estimating the number of exons in the ancestor of -0.02 -0.003 0.1 Trire2 4.460 4.527 6.0 0.00102148 0.01 Percent Relative Location from 5’-End fungi with relative intron location. Figure 13. Average number of exons and amount of reverser transcriptase 0.1 Mycgr1 -0.03 0.01 Necha2 3.33 Ascomycota Trive1 4.361 4.450 7.3 5.26E-08 relationship. At species level, there is a positive correlation for 10 out of 16 2.48 Trire2 Trive1 Figure 2. Intron relative location distribution. A regression line was drawn with data ustma1 4.689 4.506 -18.2 0.0001481 genomes. At more conserved levels there is a negative correlation. Aspni1 3.35 3.31 excluding the extreme values from both ends. The dip from both ends are due to 3.76 edge effects. Data are grouped for every 1%. Summary and Discussion Intron length variability Figure 1. Whole genome phylogenetic tree and intron gain/loss estimates with Linear Least Square (LLS) method. The average number of coding exons from Pospl1 GCAS are labeled next to each database name (used for abbreviation of p-val: 0.01845 Why S. roseus has the smallest genome Intron lengths in fungi assume roughly log normal distribution so our analysis was carried out in log scale. Species names). Bootstrap values are all 100% except for the two values Mycfi1 Lacbi1 The average introns from GCAS range from 56 to 109 nt, but those from SSG range from 58 to 131 nt. Our shown in light blue boxes. Each value on the branch represents the estimated 18.0 analysis clearly showed that there is an overall trend of less conserved genes tend to have longer introns in intron gain or loss. The major phyla and subphyla are labeled. The number in Phybl1 No intron loss detected by phylogenetic tree method, or very few by the relative intron Necha2 most genome, but shorter introns from P. stipitis and U. maydis the only two genomes where intron loss has circle is the estimated number of coding exons of ancestor of fungi. location method. One of the smallest genomes with the least number of genes. Very few Mycgr1 reached the maximum extend and both have smaller genomes. Even with these small scale changes, overall Trive1 Aspni1 RT footprints. Exception to the rule of exon number reduction in less conserved genes. copci1 Phchr1 the introns in fungi are really short. This may explain lack of correlation between intron size and genome size Trire2 S. roseus has the least number of genes and the fourth smallest or amount of RT. It looks that the small size of intron has excluded transposable elements such as RT.

genome. Four genomes are clustered at the lower end. Size log Genome Batde5 Number of exons in ancestor genomes and gene birth big bang

Sporo1 ustma1 cryneo1 The carrying capacity of fungal genomes Our phylogenetic tree gave us 7.25 which is an under estimation since intron loss do Rhior3 Lacbi1 Picst3 occur to even conserved genes such as those from the Ascomycota. There were 7.66 Ng=2.573e-04G + 1.278e+03 16.5 17.0 17.5 There is a linear relationship between the number of genes and genome size for most fungi. Two out of the 234567 coding exons in the common ancestor of fungi. This method has limitation in ignoring p-val=2.765e-07 18 genomes are exceptions to this rule. The genomic sequence of the first one P. placenta was assembled 3887 nt/gene log Number of RT any intron loss mechanism the leave the uniform distribution of intron undisturbed. Necha2 from diploid DNA. The other M. fijiensis is highly populated by RT. Nearly all fungal chromosomes are Figure 6. Correlation between genome size and number of reverse Phybl1 small and highly condensed. Most fungal genomes contain little repetitive DNA sequences. Fungi are the 14000 16000 transcriptase. We got similar plot if using the total length of RT. We found a linear relationship between the average exon length and the number of copci1 only major eukaryotic groups that are haploid. This linear equation that states for every 3887 nt there is a 8 introns based on the 16 genomes: L=1060.1*exp(-0.8712) – 1.8961*x + 198.5 where x Lacbi1 gene might be a summary for the above features. Furthermore there is a correlation between the number or Trive1 is intron number and L is average exon length. The exponential term provide evidence Mucci1 copci1 Mycgr1 cryneo1 Sporo1 total length of RT and genome size. It appears that RT and genes compete for the same genome space. It Aspni1 Phchr1 supporting intron loss. Another usage of the equation is to find the most dominant Phchr1 Mycfi1 Pospl1

Number of Genes is reasonable to assume that larger genomes can afford having extra “useless” intron DNA and so there exon length of ancient ancestors. It seems that intron loss operates on gene level. If Trire2 Pospl1 should be a correlation between the number of exons per gene and the genome size, but this is not so Phybl1 each gene have the same probability to have intron loss which assume to be a fixed Batde5 67 straight forward. There is only correlation, with exceptions of two Basidiomycota S. roseus and C. Batde5 number, then exons in longer genes will more likely to represent the ancestral state. neoformans within the SSG. cryneo1 From this study, 60-70 is the largest number of introns per gene, the equation tells us ustma1

6000Picst3 8000 10000 12000 Sporo1 that the average length of this is 66-86 nt. If we independently look at the exon length all 2e+07 4e+07 6e+07 8e+07 between distribution, we found that the peak of the exon length of all the fungal genomes is also Conclusion Phylum Species Genome Size Aspni1 at around 70-80 nt with a second shoulder at around 160 nt. Except for very short Trive1 We have proposed a new approach to look at the fungal intron evolution at genome scale. Out novel number of exons Trire2 Necha2 exons, fungal exons are dominated by phase 0 exons. This supports a hypothesis that Figure 5. Constant carrying capacity of fungal genomes. Genome size was there was an earlier gene birth “big bang” in the earlier evolution of . In this method to estimated the number of exons in the common ancestor of fungi could be applied to other estimated by the total length of scaffolds belonging to each genome. The p-value: genomes such animal or plants. Our proposal of a gene birth big bang based on solid fungal 0.0006588 Mycgr1 Mycfi1 process, genes are predominantly generated by exon-shuffling of exons dominated by linear regression equation was derived from excluding Pospl1 and Mycfi1. Ng comparative genomic analysis could be a useful framework for further analysis. We also observed that 2345 size of less than 100 nt with peak at 70-80 nt. Phase 0 exons dominates because they is the number of genes; G is the genome size. Pospl1 genome was highly ustma1 can be formed from intron loss between phase 1 and phase 2 exons. The average reverse transcriptase can have divergent effect on the number of exon in the genome. polymorphic and assembled as diploid. Mycfi1 had large number of Picst3 retroposons in the genome. proteins are 400 aa accordingly the genes should have 15-16 exons. After this intron 16.5 17.0 17.5 18.0 evolution was dominated by intron loss although intron gain still happen on an of log (genome size) Acknowledgement magnitude less through exon-shuffling or intron insertion into existing genes. The S. roseus Is an exception to number of exons and genome size Figure 7. Larger genomes tend to have more exons per gene. We exon-shuffling has become less frequent largely due to selection pressure against divided the genes into four groups as described in table 2. There is a Statistical and mathematical assistance from Mingkun Li. Annotation efforts from Andrea Aerts, Bobby correlation for SSG. It also has little RT footprints protein function disruption. There could be a period of intron loss from 16 to 8 by linear relationship between the number of exons of SSG and log Otillar, Frank Korzeniewski, and Xueling Zhao. Computer support from IT: Jereme Brad and Ilya Malinov. genome size. The p-value is 0.01588 when Sporo1 and cryneo1 are mechanism other than RT. excluded, and 0.0006588 when Mycif1 is excluded in addition (line Tireless support from GDS groups. shown).

This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and by the University of California, Lawrence Berkeley National Laboratory under contract No. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344, and Los Alamos National Laboratory under contract No. DE-AC02-06NA25396.