The Bulk and the Tail of Minimal Absent Words in Genome Sequences
Total Page:16
File Type:pdf, Size:1020Kb
The Bulk and The Tail of Minimal Absent Words in Genome Sequences Erik Aurell ∗ y, Nicolas Innocenti ∗ and Hai-Jun Zhou z ∗Department of Computational Biology, KTH Royal Institute of Technology, AlbaNova University Center, SE-10691 Stockholm, Sweden,yDepartment of Information and Computer Science, Aalto University, FI-02150 Espoo, Finland, and zState Key Laboratory of Theoretical Physics, Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China Minimal absent words (MAW) of a genomic sequence are subse- protocol [20]. In these as in other biotechnological applications quences that are absent themselves but the subwords of which are there is an interest in finding short absent words, preferably all present in the sequence. The characteristic distribution of ge- additionally with some tolerance. nomic MAWs as a function of their length has been observed to be Minimal absent words (MAW) are absent words which can- qualitatively similar for all living organisms, the bulk being rather not be found by concatenating a substring to another absent short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects word. All the subsequences of a MAW are present in the a biological mechanism, and what biological information is contained text. MAWs in genomic sequences have been addressed re- in absent words. In this work we demonstrate that the bulk can be peatedly [14, 15, 16, 17, 18] as these obviously form a basis described by a probabilistic model of sampling words from random for the derived set of all absent words. Furthermore, while sequences, while the tail of long MAWs is of biological origin. We in- the number of absent words grows exponentially with their troduce the novel concept of a core of a minimal absent word, which length [21], because new AWs can be built by adding letters are sequences present in the genome and closest to a given MAW. to other AWs, the number of MAWs for genomes shows a dras- We show that in bacteria and yeast the cores of the longest MAWs, tically different behavior, as illustrated below in Fig. 1(a) and which exist in two or more copies, are located in highly conserved re- previously reported in the literature [21, 15, 16]. The behav- gions the most prominent example being ribosomal RNAs (rRNAs). We also show that while the distribution of the cores of long MAWs ior can be summarized as there being one or more shortest is roughly uniform over these genomes on a coarse-grained level, on a minimal absent word of a length which we will denote l0, a more detailed level it is strongly enhanced in 3' untranslated regions maximum of the distribution at a length we will denote lmode, (UTRs) and, to a lesser extent, also in 5' UTRs. This indicates that and a very slow decay of the distribution for large l. In human MAWs and associated MAW cores correspond to fine-tuned evolu- l0 is equal to 11, as first found in [13], lmode is equal to 18, tionary relationships, and suggest that they can be more widely used there being about 2:25 billion MAWs of that length, while the as markers for genomic complexity. support of the distribution extends to around 106 (Fig. 1(c) and (d)). The total number of human MAWs is about eight Minimal absent word; copy-mutation evolution model; random sequence billion. As already found in [15] the very end of the distri- bution depends on the genome assembly; for human Genome Abbreviations: AW, absent word; MAW, minimal absent word; rRNA, ribosomal assembly GRCh38.p2 the three longest MAWs are 1475836, RNA; UTR, untranslated region 831973 and 744232 nt in length. Several aspects of this dis- tribution are interesting. First, in a four-letter alphabet there k enomic sequences are texts in languages shaped by evo- are 4 possible subsequences of length k, but in a text of length Glution. The simplest statistical properties of these lan- L only (L − k) subsequences of length k actually appear. If guages are short-range dependencies, ranging from single- the human genome were a completely random string of letters nucleotide frequencies (GC content) to k-step Markov models, one would therefore expect the shortest MAW to be of length both of which are central to gene prediction and many other 15. The fact that l0 is considerably shorter (11) is therefore bioinformatic tasks [1]. More complex characteristics, such as already an indication of a systematic bias, in [22] attributed abundances of k-mers, sub-sequences of length k, have applica- to the hypermutability of CpG sites. We will return to this tions to classification of genomic sequences [2, 3, 4, 5], and e.g. point below. More intriguing is the observation that the over- to fast computations of species abundancies in metagenomic whelming majority of the MAWs lie in a smooth distribution data [6, 7, 8, 9]. around lmode, and then a small minority are found at longer The reverse image of words present are absent words (AWs), lengths. We will call the first part of the distribution the bulk subsequences which actually cannot be found in a text. In and the second the tail. We separate the tail from the bulk by genomics the concept was first introduced around 15 years a cut-off lmax which we describe below; the human lmax is 33, a ago for fragment assembly [10, 11] and for species identifica- typical number for larger genomes, while the Escherichia coli tion [12], and later developed for inter- and intra-species com- lmax is 24. Using this separation there are about 35 million parisons [13, 14, 15, 16, 17] as well as for phylogeny construc- human tail MAWs, about 0:447% of the total, while there are tion [18]. A practical application is to the design of molecular 7632 E. coli tail MAWs, about 0:053% of the total. The ef- arXiv:1509.05188v1 [q-bio.GN] 17 Sep 2015 bar codes such as in the tagRNA-seq protocol recently intro- fect is qualitatively the same for, as far as we are aware of, all duced by us to distinguish primary and processed transcripts eukaryotic, archeal and baterial genomes analyzed in the lit- in bacteria [19]. Short sequences or tags are ligated to tran- erature [21, 15], as well as all tested by us. Only a few viruses script 50 ends, and reads from processed and primary tran- with short genomes are exceptions to this rule and show only scripts can be distinguished in silico after sequencing based the bulk, see Fig. 1(c). on the tags. For this to be possible it is crucial that the tags The questions we want to answer in this work are why the do not match any subsequence of the genome under study, distributions of MAWs are described by the bulk and the tail. i.e. that they correspond to absent words. In a further recent Can these be understood quantitatively? Do they carry bio- study we also showed that the same method allows to sepa- logical information or are they some kind of sampling effects? rate true antisense transcripts from sequencing artifacts giving Can one make further observations? We will show that both a high-fidelity high-throughput antisense transcript discovery the bulk and the tail can be described probabilistically, but in August 22, 2018 1{8 10 Uniform dist. 10 6 Nb. AWs 10 Same nucleotide dist. Nb MAWs Same 2-mers dist. 4 y = 4x 10 Same 3-mers dist. 5 10 Same 4-mers dist. Number 102 Same 5-mers dist. Numberof MAWs Same 6-mers dist. 100 100 E. faecalis 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Length (nt) Length (nt) (a) (b) 1010 6 Picea abies Picea abies 10 9 10 Human Human 8 Mouse M. genitalium 10 Mouse 105 Drosophila Drosophila T. kodakarensis 107 S. cerevisiae S. cerevisiae N. equitans 6 4 E. coli Cafeteria roenbergensis virus 10 E. coli 10 E. faecalis Human Herpers 5 virus 5 E. faecalis 10 Influenza H5N1 103 104 M. genitalium Hepatitis B virus Length (nt) 3 T. kodakarensis Numberof MAWs 10 N. equitans 102 102 Cafeteria roenbergensis virus 1 Human Herpers 5 virus 10 1 Influenza H5N1 10 100 0 2 4 6 8 10 5 10 15 20 25 30 35 40Hepatitis 45 50 B virus 10 10 10 10 10 10 Length (nt) Rank (c) (d) Fig. 1. Distributions of the lengths of absent and minimal absent words in genomes and random texts. (a) Number of AWs and MAWs as a function of word length in the genome of E. faecalis v583. The number of AWs grows exponentially while the distribution of MAWs shows a maximum and a decay. (b) Comparison between the distribution of MAWs in E. faecalis and the ones for a random genome of the same size using different random models. (c) Distributions of MAWs for a few common organisms and viruses. (d) Lengths of a MAW as a function of its rank for the distributions shown in (c). two very different ways. The bulk of the MAW distribution the cores from the longest MAWs are predominantly found in arises from sampling words from finite random sequences and regions coding for ribosomal RNAs (rRNAs), regions present are contained in an interval [lmin; lmax], where lmax was in- in multiple copies on the genome and under high evolutionary troduced above and lmin is a good predictor of l0, the actual pressure as their sequence determines their enzymatic prop- length of the shortest AWs.