arXiv:1809.03950v1 [q-bio.GN] 11 Sep 2018 uloie,wr- n opeso-ae etrsw con on we to features a compression-based the assigning and of task word-, the nucleotide-, in differe collec classifiers of sequence and performance reference features f the viroid compare successfully to and used framework viral been experimental NCBI’s have the sequence use genome Varwe taxonomic viral problem. a a important into of increasingly virus tics an a is placing structure Thus, genome datase agents. metagenomic biological from assembled to are o genomes various viral and new characteristics all morphological of (sub) basis ICTV the highe Historically, on the taxa. is viral Order recognised taxonomy; of erarchy virus universal a maintains and Abstract iue htaekonol rmsqec data. sequence from only coul Keywords known approach u are our previously that assignments, these these of match 17 predictions to uncla Orders our 1,834 invest assigned of We RefSeq Orders difficul of the 0.006). version more predict of are to rate combination viruses error performing which h (mean explore SVM misclassification, be k-NN the of and and 0.137); counts length of 4-mer rate genome ing error of (mean (SV combination performance Machine The respectable, Vector Support kernel. the function and (k-NN) c method the Neighbours ratios; compression DNA-specific and general-purpose aavisualisation data iu eoesqec lsicto sn etrsbsdo based features using classification sequence genome Virus k mrNtrlVco ( Vector Natural -mer h nentoa omte nTxnm fVrss(CV d (ICTV) Viruses of Taxonomy on Committee International The eoedtbss etr ersnain classificatio representation, feature databases, genome : eateto optrSine nvriyCleeLondon College University Science, Computer of Department uloie,wrsadcompression and words nucleotides, ..Ma [email protected] Mian I.S. .Wn [email protected] Wang T. .Hrse [email protected] Herbster M. 6GwrSre,Lno CE6T UK 6BT, WC1E London Street, Gower 66 k 1 = , · · · , )adisdrvtvs euntm itiuin and distribution, time return derivatives, its and 6) 1 omteshv lsie viruses classified have committees hratiue.Tdy virtually Today, attributes. ther asfir sdaetekNearest k the are used lassifiers ocasf,aduetebest the use and classify, to t tpromnei civdus- achieved is performance st fsvnIT res The Orders. ICTV seven of e ossml ecitv statis- descriptive simple ious ie nld eoelength, genome include sider i iooit eln with dealing virologists aid d in(eSq n common a and (RefSeq) tion lble iue.Sne1 of 16 Since viruses. nlabelled sfidvrss subsequent A viruses. ssified ttxni h rnhn hi- branching the in taxon st sadaentlne directly linked not are and ts rvrscasfiain Here, classification. virus or ceesll rmprimary from solely scheme tgnm sequence-derived genome nt )wt ailbasis radial a with M) stewrt e still yet worst, the as gt h ancauses main the igate ,mcielearning, machine n, vlp,refines evelops, n 1 Introduction

Virus classification is the task of placing a virus into a taxonomic scheme, a system that groups organisms together on the basis of shared features such as evolutionary history, genotypic and phenotypic characteristics, and biological properties [40]. For researchers and clinicians, the taxonomy provides insights into virus biology [45, 40, 64], enables the development of reliable diagnostic procedures, enhances epidemiological studies and facili- tates the treatment and prevention of disease [36]. The International Committee on Taxonomy of Viruses (ICTV) is the authority charged with classifying viruses into taxa and naming them [40]. It defines a hierarchy of taxa including Order (most general), Family, Subfamily, Genus and Species (most specific). ICTV (sub)committees use verifiable genotype- and phenotype-related data to propose the (re)assignment of a virus to a taxon. This is a labour-intensive and time-consuming approach, which is ill-suited to handle the influx of metagenomic sequence data produced by modern sequencing technologies. For this reason, the ICTV Executive Committee has proposed new pathways that allow viruses identified from only data to be admitted into the current formal taxonomy [65]. However, although utilising only primary genome structure to classify viruses can be effective and efficient [71, 61, 66], it may not always yield significant biological insight [64, 45]. Computational methods for comparing virus genome sequences and thus classifying viruses fall into two categories according to how the distance between sequences is de- fined. Alignment-based methods align sequences and then use an alignment score (local or global, multiple or pairwise) to determine a distance [14, 55, 37, 27, 47]. Such methods do not perform well on highly diverged viruses whose sequences cannot be aligned reliably, and are hard to scale to massive datasets due to issues of computational complexity [73]. In contrast, alignment-free methods map sequences to points in a feature space and use the distance between points in that space as the distance between sequences. Sequence representations and analytical approaches include descriptive statistics such as nucleobase occurrence and positional information [75, 79, 25, 74], matrix invariants [30], genome signal processing [77], curves [80], and images [8]. The complementary of alignment-based and alignment-free methods has been exploited to classify long and short sequence reads from next-generation sequencing technology [20]. This work focuses on three categories of alignment-free transformations of long complete genome sequences of varying lengths into much shorter fixed length vectors of simple yet intuitive features calculated directly from a single sequence; below we list some specific feature vector representations and their application to questions related to the phylogeny, taxonomy and genome biology of not just viruses but also cellular organisms. Nucleotide-based features: summary statistics of individual nucleotides. Genome length, the number of nucleotides in a sequence, has been shown to distinguish a number of differ- ent organisms [50]. The Natural Vector (NV) [25] comprises the count of each nucleobase, together with the mean and normalised variance of that nucleobase’s position in the se-

2 quence. It has been used to predict the ICTV Order of both non-segmented viruses (whose genetic material exists as a single nucleic acid molecule) [78] and multi-segmented viruses (whose genetic material exists as two or more molecules) [34]. Word-based features: summary statistics of subsequences of k consecutive nucleotides (k-mers). The k-mer NV [74] comprises the count of each k-mer together with the mean and normalised variance of its position. It has been utilised to determine phylogenetic re- lationships between mammalian mitochondrial genome sequences and 18S ribosomal RNA sequences without the need for a model of evolution. An ensemble distance measure of k-mer count vectors and NV has been used for phylogenetic analysis of multi-segmented viruses [33]. The Generalized Vector, a concatenation of the k-mer NV and the Complete Composite Vector, has been employed for virus classification [31]. The Return Time Dis- tribution (RTD) [42, 41, 46] describes the mean and variance of the successive appearances of k-mers. It has been used for phylogenetic analysis of the ICTV Family and subtyping of Dengue viruses [41] as well as genotyping of Mumps viruses [43]. Compression-based features: the compressibility of sequences when treated as “natural language” text or nucleic acid strings. Both general-purpose algorithms (for example, gzip, bzip2, xz and zip) and algorithms tailored to DNA sequences (for example, DELIMINATE, MFCompress, LEON and Nucleotide-Pair Encoding) have been used for storage, retrieval, and transmission of high throughput DNA sequencing data [13, 58, 60, 32, 11]. Here, we employ a uniform experimental framework to investigate the performance of different alignment-free representations of complete viral genome sequences (vectors constructed from nucleotide-, word- and compression-based features) and different classifi- cation methods (k-NN and SVM) on the same virus classification problem (assigning one of seven ICTV Orders to a sequence). The paper is structured as follows. Section 2 reviews the feature vectors analysed: genome length, the k-mer NV and its derivatives, RTD, and compression ratios. We consider k-mers based on both the standard ACGT alphabet and three 2-letter alternatives. Section 3 summarises the two classifiers employed: k-NN and SVM with a radial basis function kernel. Section 4 describes the dataset: the viral and viroid reference sequence (RefSeq) collection from the National Center for Biotechnology Information (NCBI). Sections 5 and 6 discuss our exploratory data analysis (EDA) and classification exper- iments. We visualise the relationship between various underlying statistical properties of complete viral genome sequences and the ICTV Orders assigned to viruses. We determine the error rates of different combinations of feature vectors and classifiers in the task of predicting Order. We ascertain the components of a feature vector important for classi- fication, and the main causes of misclassification. We utilise the best performing feature vector-classifier combination (AGCT 4-mer counts plus SVM) to identify sequences that are difficult to classify and to predict the Order for sequences without this assignment. Section 7 explores the implications of our findings. We discuss the problem of imbalanced classes and the validity of our Order predictions in light of the updated RefSeq dataset. Section 8 suggests directions for future research. We suggest hierarchical classification

3 as a strategy for assigning not only Order but all taxa in the ICTV hierarchy. We discuss distributed representations of biological sequences, a methodology proposed originally in Natural Language Processing to identify terms with a similar linguistic context. We propose graph-based semi-supervised learning as a way to predict the phenotypic properties of viruses and to infer phylogenetic relationships between them from the large and rapidly increasing corpus of viral genome sequences and the small collection of knowledge about the biological properties of virions.

2 Sequences as vectors of features

2.1 Nucleotide-based features 2.1.1 Genome Length Length, the total number of nucleotides, is the simplest descriptive statistic of a genome sequence [50].

2.1.2 Natural Vector (NV) and its derivatives The NV methodology [25] recasts a variable length nucleic acid sequence as a fixed-length low dimensional vector of global composition- and location-based statistics for each nucle- obase present in the sequence. The simple and intuitive statistics used for this mapping are the count of each nucleobase together with the mean and variance of its position in the sequence. Let S = s1s2 · · · sn be a sequence of length n where si, i = 1, 2, · · · ,n is a letter in the nucleobase alphabet, typically s ∈ {A, C, G, T }. For a nucleobase s, the count ns, mean position in the sequence µs and the normalised variance of the position ds are defined as n ns = X I(s = si) i=1 n 1 µs = X i I(s = si) ns i=1 n 1 2 ds = X(i − µs) I(s = si) nsn i=1

where I(s = si) is the indicator function which equals 1 when s = si and 0 otherwise. The 12-dimensional NV representation of a nucleic acid sequence S used in earlier work [78] contains the following variables,

NV 12 = [n, µ, d]

4 where n, µ and d are the following vectors

n = [nA,nC ,nG,nT ]

µ = [µA,µC ,µG,µT ]

d = [dA, dC , dG, dT ]

We extend this previous work by investigating the contribution of components of NV12. We derive two “simpler” feature vectors: NV4 (counts) and NV8 (counts and mean posi- tions).

2.1.3 Summary of nucleotide-based features Table 1 lists the names, dimensions and components of the nucleotide-based feature vector representations of the viral genome sequences we studied.

Table 1: Feature vectors based on nucleotide statistics.

Name Dimension Variable(s) Length 1 n NV4 4 nA,nC ,nG,nT NV8 8 nA,nC ,nG,nT , µA,µC ,µG,µT NV12 12 nA,nC ,nG,nT , µA,µC ,µG,µT , dA, dC , dG, dT

2.2 Word-based features

A word, or k-mer, is a short sequence of k consecutive nucleotides w = sisi+1 · · · si+k−1. The k-mers in a genome sequence can be extracted by sliding a window of length k over the sequence from position s1 to position s(n−k+1).

2.2.1 k-mer NV and its derivatives k-mers and the NV model have been combined [74] by generalising n, µ and d to the counts, mean positions and normalised variances of all possible k-mers,

n = [nw1 ,nw2 , · · · ,nwak ]

µ = [µw1 ,µw2 , · · · ,µwak ]

d = [dw1 , dw2 , · · · , dwak ]

5 For a nucleobase alphabet of size a, there are ak possible k-mers so the dimensionality of the k-mer NV (total number of features) is 3 × ak; the count, mean position and normalised variance of k-mers that are absent from a sequence are set to zero. The 1-mer NV is equivalent to NV12 when the alphabet is {A, C, G, T }.

2.2.2 Return Time Distribution (RTD) The RTD [42, 41, 46] takes into account both the frequency and relative order of k-mers in a sequence. The return time is the number of nucleotides between successive appearances of a particular k-mer and the RTD is the frequency distribution of those times. The RTD feature vector consists of the means and variances of the RTDs of all k-mers. t Let pwi denote the position of the t-th occurrence of k-mer wi in a sequence. If the k-mer occurs T times in total, then the t-th return time is

t t t−1 ∆pwi = pwi − pwi , (2 ≤ t ≤ T ). The RTD is summarised by the mean and variance of the series of return times,

T 1 t t µ∆pw = X ∆pwi , (T ≥ 2), i T − 1 t=2 T 1 t t t 2 d∆pw = X(∆pwi − µ∆pw ) , (T ≥ 3). i T − 1 i t=2

2.2.3 Nucleobase alphabets We investigated the standard 4-letter alphabet and three reduced 2-letter nucleotide al- phabets (Table 2).

Table 2: 4- and 2-letter alphabets used to construct feature vectors.

Alphabet Size Symbol Name Nucleobases ACGT 4 A Adenine A C Cytosine C G Guanine G T Thymine T SW 2 S Strong C,G W Weak A, T RY 2 R puRine A,G Y pYrimidine C,T MK 2 M aMino A, C K Keto G, T

6 2.2.4 Summary of word-based features Table 3 lists the features, alphabets and dimensions of the word-based feature vector repre- sentation of the viral genome sequences we studied. “Concat-k-mer NV” is a concatenation of the NV of k-mers and all their subwords; NV12 is equivalent to the k-mer NV when k = 1 and the alphabet is ACGT.

Table 3: Feature vectors based on k-mer statistics (k = 1, · · · , 6) and different alphabets.

Variables Alphabet Dimension of k-mer feature vector 1234 5 6 Counts SW 2 4 8 16 32 64 Counts RY 2 4 8 16 32 64 Counts MK 2 4 8 16 32 64 Concat- SW, 6 12 24 48 96 192 counts RY, MK Counts ACGT 4 16 64 256 1,024 4,096 RTD ACGT 8 32 128 512 2,048 8,192 Counts, RTD ACGT 12 48 192 768 3,072 12,288 k-mer NV ACGT 12 48 192 768 3,072 12,288 Concat- ACGT 12 60 252 1,020 4,092 16,380 k-mer NV

2.3 Compression-based features Algorithmic information theory has been used to construct a distance metric between (mitochondrial) genome sequences [49]. Kolmogorov complexity [44] is a measure of the computational resources needed to specify an object. It reflects the complexity of the underlying structure of the object and is defined as the length of the shortest computer program (in a predetermined programming language) that produces the object. There are no absolute measures of Kolmogorov complexity so it is estimated using (efficient) compression algorithms [72]. The compression ratio, defined as the ratio between the compressed and original file size, reflects the “complexity” of the content of the file. We construct feature vectors for a viral genome sequence by concatenating compression ratios obtained using four general-purpose compression tools and three DNA-specific compression tools that are specialised for FASTA files, the format of our dataset [13, 58, 60, 32].

2.3.1 General-purpose tools for compressing text Gzip uses Lempel-Ziv coding (LZ77). The Lempel-Ziv algorithm [82] is a dictionary com- pression scheme that works by finding duplicated strings in the input text and replacing

7 those duplicates with a reference (distance and length) to the original string. The dis- tances are compressed with one Huffman tree, and the lengths with another. The gzip implementation limits distances to 32K bytes and lengths to 258 bytes. Bzip2 uses the Burrows-Wheeler (BW) block sorting text compression algorithm [21] along with Huffman coding [35]. The BW algorithm applies a reversible transformation to a block of input text with the aim of bringing the same characters closer together, making the transformed text easier to compress with simple algorithms. Xz uses the Lempel-Ziv-Markov chain (LZMA) compression algorithm [5]. LZMA uses a variant of LZ77 with large dictionary sizes and support for repeatedly used match distances, with the output encoded with a range encoder. Range encoding is an entropy coding method which is similar to arithmetic coding. It differs in that digits can be expressed in any base, rather than just base 2 (as bits). Zip supports a number of compression algorithms. By default (and in our usage), this Unix/Linux file compression and packaging utility uses the DEFLATE compression algorithm, which itself uses a combination of LZ77 and Huffman coding [26].

2.3.2 DNA-specific tools for compressing sequences DELIMINATE[54] (Delta encoding and progressive ELIMINATion of nucleotide charac- ters) is a fast and efficient method for lossless compression of genomic sequences. During compression, the positions of the two most frequently occurring bases are delta encoded to be stored as differences rather than absolute values. These bases are subsequently elim- inated from the sequence, and the remaining bases are then represented with a binary code. This method outperforms general-purpose tools and the compression gains on large sequence datasets are dramatically higher. MFCompress [56] (Multi-fasta and Fasta Compression) is a lossless compression method that is claimed to provide better compression than DELIMINATE for similar compression and decompression times. It relies on multiple competing finite-context models and arith- metic coding [57]. One finite-context model is used for encoding the header and multiple models are employed for the main stream of the DNA sequences. LEON [18] is a method for compression of data generated from high throughput se- quencing techniques. It is based on a reference probabilistic de Bruijn Graph, built from the set of short sequences (called reads) acquired by a sequencing device and stored in a Bloom filter [19], a space-efficient probabilistic data structure used to test whether an element is a member of a set. Each read is encoded as a path in this graph, by memorising an anchoring k-mer and a list of bifurcations, enough to rebuild it from the graph.

2.3.3 Summary of compression-based feature vectors Table 4 lists the names, tools and features, dimensions and variables of the compression- based feature vector representations of viral genome sequences we studied. rtool is the

8 compression ratio computed using tool.

Table 4: Feature vectors based on compression ratios calculated using different compression tools.

Name Tool(s), Feature Dimension Variable(s) CRB bzip2 1 rbzip2 CRL LEON 1 rLEON CRBL bzip2, Length 2 rbzip2, log(Length) CRLL LEON,Length 2 rLEON , log(Length) CRLB LEON,bzip2 2 rLEON , rbzip2 CRDNA DNA-specific 3 rDELIMINATE, rMFCompress, rLEON CRGP General-purpose 4 rbzip2, rgip, rxz, rzip CRA All 7 rbzip2, rgip, rxz, rzip rDELIMINATE, rMFCompress, rLEON

3 Classifiers

3.1 k-Nearest Neighbours (k-NN) The k-NN [9] classifier was used in previous studies of virus classification [78, 31]. To predict the label of a new virus genome sequence, the method first computes the distance between the feature vector of the given sequence and that of every other sequence in the training set. A prediction is made using the majority vote of labels of its k nearest neighbours (closest sequences) where k is a parameter of the model. We used the knn function implemented in the R package class [1].

3.2 Support Vector Machine (SVM) The SVM [23, 24, 62] is a binary classifier that aims to find a separator that best separates samples from different classes. The optimisation goal is to find a hyperplane that maximises the margin, the distance of the samples closest to the hyperplane. Given training samples d (xi,yi), i = 1, · · · ,m, where xi ∈ R is a feature vector of dimension d and yi ∈ {−1, 1} is the label of the i-th sample, the primal form of the classical soft-margin problem is

9 formulated as m 1 2 min ||w||2 + C X ǫi w,b 2 i=1 T s.t. yi(w xi + b) ≥ 1 − ǫi

ǫi ≥ 0, i = 1, · · · , m.

The slack variables ǫi(i = 1,...,m) “soften” inversely with respect to C the optimisation problem of finding the maximum margin hyperplane thereby accommodating incorrectly labelled (noisy) data points [69]. The dual form of the optimisation problem with a kernel function is m 1 max − X αiαjyiyjK(xi, xj)+ X αi αi 2 i,j=1 i

s.t. X yiαi = 0 i

0 ≤ αi ≤ C, i = 1, · · · , m.

The kernel function K(xi, xj) = < φ(xi), φ(xj) > is a measure of similarity between two transformed data points, φ(xi) and φ(xj), where φ(.) is a feature map that projects the original feature vector xi into a higher dimensional space in order to enhance the separa- bility of the data points. One effective kernel function is the radial basis function (RBF) 2 K(x, y) = exp(−γ||xi − xj|| ), where γ is the inverse “width” of the kernel [70]. A new m sample x is classified as sign( i=1 αiyiK(xi, x)+b), where αi are the optimised parameters m P and b = yj − Pi=1 αiyiK(xi, xj) for any j such that 0 < αj

4 Dataset

The ICTV defines a hierarchy of taxa encompassing Order (most general), Family, Sub- family, Genus and Species (most specific). We retrieved the entire viral and viroid RefSeq collection from the “Viruses” directory of the NCBI FTP site on 18th September 2015 [4]. Each RefSeq file contains complete genome sequence data and annotation data for a virus species; any ICTV taxa assigned to the virus are present in the “ORGANISM” field of the annotations. The 4,420 RefSeq files downloaded span 3,910 non-segmented viruses − of which 211 are satellites − and 510 multi-segmented viruses. We analysed the complete genome sequences of the 3,699 non-satellite non-segmented viruses and considered only the ICTV Order assigned to a virus. Table 5 provides details regarding the dataset we studied.

10 Table 5: Non-satellite non-segmented viruses studied. The name and our abbreviation for an ICTV Order plus the number of complete virus genome sequences with that taxonomic label are shown. Labelled: the num- ber of sequences with an Order label. Unlabelled: the number of sequences with- out an Order label. Total: the number of labelled and unlabelled sequences. ICTV Order Abbreviation Number of sequences C 1,281 H 66 Ligamenvirales L 13 Mononegavirales M 156 Nidovirales N 67 Picornavirales P 135 T 147 Labelled 1,865 Unlabelled 1,834 Total 3,699

In 2016 and 2017, the ICTV added two new Orders, and Ortervirales, yielding a current total of 9 named Orders. Since ratification occurred after the research described in this work was performed, we consider only the seven Orders defined in 2015.

5 Exploratory data analysis

We use a variety of techniques to visualise different statistical properties of the complete virus genome sequences stratified by ICTV Order. The intuition and qualitative under- standing gained by uncovering features which contribute to the separability of taxonomic classes will assist us in formulating hypotheses and interpreting the results of the classifi- cation experiments described later.

5.1 Methods 5.1.1 Box plot This non-parametric visualisation is a convenient way to show key properties of a dataset: the bottom, internal and top bands of the box are the first, second (median) and third quartiles with their spacing indicating the degree of dispersion and skewness; the whiskers are the minimum and maximum values of inliers; and the circles are outliers. We used the boxplot function implemented in the R package graphics [6] to visualise the distributions of the log transformed values of various nucleotide-based features.

11 5.1.2 t-distributed Stochastic Neighbour Embedding (t-SNE) plot This is a visualisation of the output of an unsupervised non-linear dimensionality reduc- tion technique that projects high-dimensional data into a low-dimensional space. Similar (dissimilar) objects are modelled as nearby (distant) points. We used the Rtsne function implemented in the R package Rtsne [7] to visualise various feature vectors in a 2-dimension space. Using a random seed and feature vectors for all labelled and unlabelled sequences, we compute the coordinates for every virus in our dataset. We display the resultant embedding as a scatter plot with points coloured and displayed according to the following schemes. Colour can indicate the taxonomic class assigned to a virus: grey represents unlabelled viruses; red, yellow, brown, green, light blue, dark blue, and pink indicate ICTV Order C, H, L, M, N, P, and T respectively (Table 5). Point style indicates the taxonomic class assigned to a virus: circles represent unlabelled viruses; the one-letter abbreviations indicate ICTV Order. Colour can also indicate the normalised length of a sequence: normalisation is performed by scaling the natural log transformed Length to the range between 0 and 1, corresponding to the colours red and blue.

5.1.3 Ternary (simplex) plot or de Finetti diagram This visualisation is a concise way to show the relationship between three variables in a two-dimensional space. The coordinate system is organised as an equilateral triangle whose vertices are associated with three variables; the distance from a point to a vertex is inversely proportional to the magnitude of the value of the variable associated with the vertex. We used the ggtern function implemented in the R package ggtern [3] to visualise the composition of sequences using three 2-letter alphabets. The precise location of a point depends on the physicochemical property used to group the four nucleobases: strong/S (G, C) or weak/W (A, T), purine/R (A, G) or pyrimidine/Y (C, T), and amino/M (A, C) or keto/K (G, T). Vertices labelled S, R, and M indicate the percent of nucleotides in a genome characterised as Strong, puRine and aMino respectively. Like the t-SNE plots, labelled virus genomes are coloured and marked according to ICTV Order whereas unlabelled ones are depicted as open grey circles.

5.2 Nucleotide-based features Figure 1 shows the distribution of the simplest descriptive statistic of a sequence − Length − for sequences assigned to a taxonomic class; the seven box plots are arranged by increas- ing median Length. ICTV Orders can be distinguished fairly well using Length. Figure 2 shows the distributions of slightly more sophisticated summary statistics of a sequence − nucleotide count (nA,nC ,nG,nT ), mean position (µA,µC ,µG,µT ) and nor- 2 2 2 2 malised variance of the position (dA, dC , dG, dT ) − for sequences assigned to each taxonomic

12 n ieetIT resi lorflce nterslso our of results the in reflected imp also The is experiments. Orders details). technical ICTV l different for di the ing high [67] to embedding (see s sequence when space a spectrum) occurring dimensional discontinuity is the a There of to end subject neighbours. (red are sequence sep 1) well shortest Viru fairly (Figure are distributions column). classes (right Length different length together, sequence cluster or Order column) poi (left where class two-dimensions into embedded mos (rows) the derivatives b being variance correlation normalised linear and strong position and a mean sequence for exhibits the that plot in Every position mean position. count, nucleotide usi length, well reasonably distinguished be can Orders ICTV class. sequences genome virus for log(Length) of plot Box 1: Figure iue4sossatrposo h niedtstrepresent dataset entire the of plots scatter shows 4 Figure desc the of pairings various for plots scatter shows 3 Figure

log(Length) 8 10 12 14 H C L N M P T ICTV Orders 13 rtd n lse ihsimilar with classes and arated, tentetovralswith variables two the etween t r oordb taxonomic by coloured are nts geeyoeo hs features. these of one every ng rac fLnt nseparat- in Length of ortance sindt ahIT Order. ICTV each to assigned e sindtesm ICTV same the assigned ses esoa bet noalow a into objects mensional marked. t omlsdvrac fthe of variance normalised ot rniinfo the from transition mooth netsqec (blue) sequence ongest duigteN n its and NV the using ed itv ttsisgenome statistics riptive usqetclassification subsequent − V n V2 Fgr ) ftegnrlproetos bz tools, general-purpose (0.290 the dataset Of entire the 2). across - ratio 1 compression nuc average (Figure than NV12) worse Orders and consistently ICTV rat NV8 are separating compression all at tools, Although better general-purpose somewhat are class. tools taxonomic specific rat each compression different to of assigned distributions the shows 6 Figure features Compression-based (Tabl NV12 5.4 than sequence a of representation richer much a is ewe CVOdr ihls vra ewe hmrflc th reflect them between overlap less with Orders taxonomic by ICTV coloured between are represent points dataset entire where the two-dimensions of into plot scatter a shows 5 Figure features Word-based 5.3 Order. ICTV stat each location-based to and composition- assigned of plots Box 2: Figure ( 2) (µ ) ( ) log dA log A log nA 6 7 8 9 10 11 12 6 7 8 9 10 11 12 6 7 8 9 10 11 12 H C L N M P T H C L N M P T H C L N M P T

( 2 ) (µ ) ( ) log dC log C log nC 6 7 8 9 10 11 12 6 7 8 9 10 11 12 6 7 8 9 10 11 12 H C L N M P T H C L N M P T H C L N M P T 14

( 2 ) (µ ) ( ) log dG log G log nG 6 7 8 9 10 11 12 6 7 8 9 10 11 12 6 7 8 9 10 11 12 H C L N M P T H C L N M P T H C L N M P T duigte6mrN embedded NV 6-mer the using ed u hsi nysihl better slightly only is this but ) sisfrvrsgnm sequences genome virus for istics etd ttsis(egh NV4, (Length, statistics leotide o fasqec o sequences for sequence a of ios ls.Tecenrboundaries cleaner The class. 3). e o optduigDNA using computed ios hntoeotie using obtained those than atta the that fact e p civstelowest the achieves ip2

( 2) (µ ) ( ) log dT log T log nT 6 7 8 9 10 11 12 6 7 8 9 10 11 12 6 7 8 9 10 11 12 H C L N M P T H C L N M P T H C L N M P T k mrNV -mer log(Length) log(Length) log(Length) log(Length) 6 8 10 12 14 6 8 10 12 14 6 8 10 12 14 6 8 10 12 14

4 6 8 10 12 4 6 8 10 12 4 6 8 10 12 4 6 8 10 12

log(nA) log(nC) log(nG) log(nT) ) ) ) ) T A C G µ µ µ µ ( ( ( ( log log log log 6 8 10 12 14 6 8 10 12 14 6 8 10 12 14 6 8 10 12 14

4 6 8 10 12 4 6 8 10 12 4 6 8 10 12 4 6 8 10 12

log(nA) log(nC) log(nG) log(nT) ) ) ) ) 2 T 2 A 2 C 2 G d d d d ( ( ( ( log log log log 4 6 8 10 12 4 6 8 10 12 4 6 8 10 12 4 6 8 10 12

4 6 8 10 12 4 6 8 10 12 4 6 8 10 12 4 6 8 10 12

log(nA) log(nC) log(nG) log(nT) ) ) ) ) 2 T 2 A 2 C 2 G d d d d ( ( ( ( log log log log 4 6 8 10 12 4 6 8 10 12 4 6 8 10 12 4 6 8 10 12

6 8 10 12 14 6 8 10 12 14 6 8 10 12 14 6 8 10 12 14

log(µA) log(µC) log(µG) log(µT)

Figure 3: Scatter plots between different summary statistics of virus genome sequences. than the others (gzip: 0.315, xz: 0.303, zip: 0.359). The DNA specific tools are all better with LEON being the best (0.142, DELIMINATE: 0.283, MFCompress: 0.273). Figure 7 shows histograms of compression ratios for the entire dataset. The distribu- tions tend to contain one major peak and two relatively minor ones and skew towards the left side; the degree of skewness is greater for DNA specific tools.

15 ICTV Orders Length H ooo oo o HCoHCCH oCoo oooooo ooo oHHHHHoHCCCCCHCHCHHCHooCoooCoo ooooooooooooooooooooo HCHC CCoo ooo oooo o 0 CH CoCC o oooo Ho CCooCooo o ooooooo CHooCoCHo oCoo oo oooooooo ooo o CCHCooCC oooooooo o 0.1 HCoH C CoCoCCCC ooo o oooooooo HoCoHCoHoCC CCCCCCCoCCC oooooooo ooooooooooooo HoCHoHCoCCHCoHoooCoooCCC oooooooooooooooooooo o 0.2 oH CoH CCCo ooTTo oo oo oooo ooooo HooHoHH Toooooo ooooo ooooooo oooCo CCCCCCC oo PToPoToPo oo ooooo oooooo oo oooooo oo CCooooooo ooToToo oPooT Toooooooooooo oooooooo ooooo oooo ooooooooooooo o 0.3 oCCHoCooCoCCCCCC ooooooooooToP oP oo oooo oooooooooooooo ooooooooooooo o oo oooo CoHCHoooCo CCCCoCCCC Po oo ooToPo T ooooooo oooooooooo oo oo ooooo o CoCoCoHoC o ooooooooo ooPoooooPPPTPoPo P o ooooo o ooooooooo ooooooooooooo o o o 0.4 Co oooooooooooT oooo PPoooPoPo oooooPoT o ooooooooooo oooo ooooooooo ooooooo ooooooooTo ooo o PToTooToPPPPPPoToooTPoToo ooooooooo ooo o oooooooooooo oooooooo oo ooo o oo TooooPPooPoooT o TTP oo ooo o oo oooooooooooo o ooo ooooooo oooooo oT o Tooo PPooooooooooPoTPoTTooT ooooooo oooooo oo o ooooo oooooooooooooooooo o 0.5 ooo ooooo oooo TToT PoPoPooooooToTTTPTTTMoPTPo ooo ooooo oooo ooo ooooooooooooooooooooo oooo o oooToooooTT PoPoooo TT T Poo oooo o oooooooooo oooooo oo o oo oooooTToooTTT oooooo oTMTP ooooooooo ooooooooooooo oooooo oooo oo0.6ooooooo oooo ooo TToT ooooo oTo PPoToP ooooooooooo oooo ooo ooo ooooo oo oooooo ooooooooooo oooooo o oooo ooo ooo T ooo oToT PoPoPoPPPPP oooooooooo oooooo o oooo ooo ooo oo ooo oo ooooooooooo oooooooooo oooooooooooooo oo o ooooo TT o Tooooo MPooo o PoPoPPPoPoooooooooo oooooooooooooo oo o ooooo oo o ooooo ooo o ooooooooooooooooo ooooo oooooooooo oo oooooooooooo TTTTTT oPoo o PPPMP ooooooooo ooooo oooooooooo oo oooooooooooo oooooo ooo o ooooooooooo0.7ooo ooooooooooooooooo ooo oo oooooooooo o TTTTT PPP oooo Po oooooooo ooooooooooooooooo ooo oo oooooooooo o oooooo ooo oooo o oooooooo o ooooooooooooooooo oo oooooooo ooo Pooo oooo o o ooooooooooooooooo oo ooooooooooo ooooo oooo o 2 o ooooooooooooooooooooo ooooooo o ooo o ooooooooooooooooooooo ooooooo o ooo 2o2 oo oooo ooooooooooooooo o oo ooooo PMoo oo oooo ooooooooooooooo o oo ooooo oooo 0.8 2oo2oo2o2oooooo ooo o ooooooooooooooo ooo MMoooo oo oooooo ooo o ooooooooooooooo ooo oooooo oo 22o22oo2oo2o2o2ooooo o ooo MMo oo ooooooooo o ooo oooo oo 2222o2oooo2o2o2ooo oM MPMM ooooo ooooooooo oo oooo 0.9ooooo 222o2o2oo232oo2oooo N Mo ooooo ooooooooo o o ooooo NV4 2222o2o2oooo M o oooooo o o 22222oo22o2ooooo MPMMoMooNCMooooo oooooooo oooooooooooooo 22o5oo CCoCC CMoM ooMMoMooooMo ooo oooo o o ooooooooo CoCC oMo ooMo oooo oo o 1ooo CCCCCC ooMoMoo oooo ooooooo ooooooo oo CC CCCoCCCoC oMoo oo ooooooo o oooo CooCoooCCCooCoC CC C CCCCCC MoMM oooooooo ooooo oo o oooooo ooo CCCCCCCCCC oCCC Co CoC CCC NNoMoMMMMo ooo ooooooooooooo oooo o oo oooo ooooooooooo ooo C CCCC CCC CoCC oCCooCCC oMMMMMMoCMMoo oo ooooo ooo ooo o ooooooo ooooooooooooo C C C oo CoCo CL CCCCoCoCCCoC CoooCooMooM CoMoMMo oo o o oo oooo o ooooooooooo oooooooo oooooo CC CCCooCC CCCCCo CoL CCoCoCCCoCCL oo oCMoCo oo oooooo ooooo o ooo oooooooo oo oo o CoCoCoCC Co C CCoCo oC Co Co MMMMoM ooooooo oo o ooo oo ooo o oooooo C C CoCCCC Co MMCMC o oo oooooo o oooo CCCCCCCC CC CCCCoCoCCCCCCC CCCCoC NLoCCo ooooooooo o o oooooooooooooooo oooooo ooooo CC CCC CCCCo CoCoCCoC CCC CCoCC CCCoooo oo oo o ooooo oooooo ooo oooooo ooooooo CCC C CCoCCCC CCC C CCC ooooo oo ooooooo ooo o ooo CCCC CCo CCoCCC CCCCCCCC ooooo ooo ooooo oooooooo CCCCC CCoCCCoCoCCCCC CoCL oooooo ooooooooooooooo ooo CC CCCCoCCCCCCCCC CCCoCLC oo ooooooo oooooooo oooooo CC CoCC CCCCCCC oNCCC oo oooo oooooooo ooooo C CooCCoCCC C CN CCLNNoNLoNoNo o oooooooooo o oo oooooooo CC oCC CCCooCoo ooN NNNNN oo ooo oooooo ooo oooooo CC CoCC oCCo CoCoCoo NNNNNN ooo ooo oooo oooooo oooooo CCCCCC C CCooC CCoooooCoCoC oooooooo o oooo ooooooooo CCCCCCCCCCCCC CCCCCCCC oCo ooooooooooooooo ooooooooo oo C Co CCCCoCoCCoCoCCo o ooo ooooooooooooooo CooCCCoCCoC CCCCCCoCCoCCCCCoCoo oooooooo oooooooooooooooo CCCCo oooo CC ooo

Coo oo CooCoo ooooo CCHHHoCCoo oooooo HCHHCCCo ooooooo CHHCHHCoCCCoC ooooooooo ooCHo oooo CoooCoCoH T ooooooooo o oooo Cooo oToooo oo ooooo oooooo CCCoHCo oooooT oooooooo oooooo oCoCCCoCH oPToT oooo ooooooo ooooo oooo CCCCCoCC ooo oo o oToPoPoTooo oToooooo oooooooo ooo oo oooo ooooooo ooooooo CCCCC oooooPTooTooPooPPoPP oPPo Too oooooo oooooooooooooooo ooooo ooo C oooo ooo TooooToo PPPoo o oooo ooo ooooooo ooooo CoCC oooo ooT PoPooT PoPPP ooo oooo oooo ooo ooo ooo ooooo o CC CoCH Ho oooTooooTTT PPooooooooPToPooP o oo oooo o ooooooooooo ooooooooooooooo o CCooCoHCo Ho ooooTTTo oPoPooooooooooooT ooo oooooo oooooooo o oooooooo ooooooooooooooo ooo oooooo HCoCCHoCCooHCoHC oooTT T ooooooooooooooo ooPTPTTo ooo oooooooooooo oooo o ooooooooooooooo oooooo C CoHCoHH HCCoHH ooooo TTTTT o oooo PoPoToTo o oooooo ooooo ooooo ooooo o oooo oooo oCCCHCHo oCH ooooooToT TTTTTT o ooPoooT ooo o ooo oo oooooooo oooooo o ooooooo CCCoCC o Toooooo To o ToTPoTTToPo oooooo o oooooo o o oooooooo CCCCooCo oo o ooo oToTTTTTPTTTT oooooooo oo o ooo ooooooooooo CCCCoooCoHoHo Too oooo PPoooooToToooooTMoPoT ooooooooooo ooo oooo ooooooooooooooooo 2 o oCooCoHooo oooo ooooooo TPPoPoo oT P PTTPTo o ooooooooo oooo ooooooo oooooo o o ooooo 22222o2oooooo CoHoCoHoHo oooooooooooooo oo M PPoPoP ooooooo oooooo oooooooooooooo oo o ooooo 23222oooo oCo ooooo oooo o oooPoPoP PoPoP oooo oo ooooo oooo o ooooooo ooooo 522o22oooo CHoo oooooooooo oooo o TP Poooo ooooo ooo oooooooooo oooo o oo oooo 22ooo CHoHCooC ooo ooo oo PPPooP oooooooooo ooo oooo ooo ooo oo oooooo ooooooooo 222oooooooo oCooCooC oooooo o o ooooooooooo ooooooo oooooo oooooo o o ooooooooooo 22222o2oooooooooo CoCoCCCC oooooooo Pooooooooo ooooooooooo ooooooo oooooooo oooooooooo 2o2o2o2oooooo CC o ooo Po ooooooooo oooooo oo o ooo o ooooooooo 22ooo C CoCoC ooooo PPoPMPoooooo ooo o oooo ooooo oooooooooo CCoC oooooooo Pooooooo oooo oooooooo oooooooo C CCCCC oooooo Po oo o oooooo oooooo o oo CCCoCCCCCCCC o ooooo oo oo ooooooooooooo o ooooo oo CCoC ooooooo oo PM oo oooo ooooooo oo oo oo CC ooooo MoMMMooooooooo oo ooooo ooooooooooooo CCCC oCC oooooo MoMM oooooo oooo oo oooooo ooooo oooooo NV8 2222ooooo CCCC o C oo M ooo ooooo ooooo o o oo o ooo 22222ooooo ooooo CCC CC o ooo ooooo ooooo ooo oo o ooo 24222o2oo2222oooooooooooooooooo CCC MCo MMMoCoMo ooooooooooooooooooo ooooo o o ooooo 42o2o2oo oo CC MoMPM oMooooMo oooo oo oo ooooo oooooo ooooooooo CCC CCoCoCCCC oMMPMoMN o ooooooooo ooo oo oooooo ooooooo o 222ooooooooooooooooo o CCCCCCCC CoCC CCoC o oNo ooooooooooooooooo o ooooo ooo oooo oooo o oo 2222o2o2o2oooooooooooo o CCCCo CCoCCCo Co CooCCCCoCCC o ooooooooooooooo o ooooo ooooooo oo ooooooooooo o 22222o2ooooooooooooooooo ooo C CCo CCCooCo CoCC C CCCCC MoMoMo oooooooooooooooooo ooo oo oo oooooo oooo o oooooooo oooo 2222o222oooooooooooo ooo C C C C CCCoCC ooMoMo ooooooooooooo ooo o o o o ooooo oooooo 22222o22o2ooooooooooo ooooo C CCoCCoCCCoC C CCCCoC Moo ooooooooooooo ooooo o oooooooooo o o oooo ooo 22222oooooooo C CCCCCoCC CCo CCCCo oCooo oooooooo o oooooooo oooooooo ooooo 22oo o CCC CoCCCCC CCooCoCC CMoMoMM oo o ooo ooooooooo oooooooo oooooo CC CCoCCCCCCoCC CCooCC oMM oo ooooooooooo oooo o ooo oCCCCCCC CCCCCCCCC o CCCCCoCC ooMMMMMooNoC ooooooooo oooo oooooo o ooooooo oooooooooooooo oCCCCCCCCCC ooooCCCCCoCCCCCCCCCCCCCC L CoCC MMoN oooooooooooo oooooooooooooooooo ooooo o ooo oooo C CCCCC Co o CCCoCCCCoCooL CCCC oo o oooooo oo o ooo oooooooo oooo oo CCCCoCCCC CCCCCCCCCCCCCCC CCCoCL ooCMoMooo ooooooooooo oooooooooooooooo ooooo oooooo CCCo CCCCCCCC CCCoCo ooooMoCo ooooo oooooooo ooooo ooooooo C C CCC MMMoC o o ooo ooooooo CCCC CoCo CL MCMCoCCCo oooo oo oo ooooooo CCCCCCCoCC CC CoCCCLN oCCooo oooooooooooo oo ooooo ooooo CCCCCCoCoCo C CooCCCNo oCC CC oooooooooo o oooooo oooo ooo CoC CCoCCoooooooCoooo CNCLooCLCCCC ooo oooooooooooooooo ooooo oooooo CCCCCCCooC oo o NNNNNNNNCNoNCoCNLo oooooooooooo oo o ooooooooooooooooo CooCC NNNoNNNCoLNN ooooo o oooooooo

CCC ooo

oCo oo oooo Cooo oo ooo ooCCoHCoC oooooo oCC oo o ooCCoHCHHCHC oooooooooo oooCoCHCCCH oooooooooo CooooCoHo Ho oooooo o CCooHo oT oooooo oo CCCoo ooooo ooooo ooooo oCoCCCoCoHCoC ooooTT ooooooooooo ooooo oCCCoCCCCC ooo o oPTTo ooooo o ooooooooo ooo o ooooo ooooo CCCCoC oooToo PoooPoPPPPoPoPoo Toooo oooooo oooooo ooooooooooooo ooooo oooooooPooToooT oPPPooToo oooooooooooooo o ooooooo CCC oo ooo ToooTo PPPoT PPP ooo ooo oo ooo ooooo ooo o ooo o C CoCCCH oooooooo TTo PPoooooooPPPP o oooo oooooooo oo oooooooooooo CCCooCHCo T ooTT T PPooPooooooooPTooo oo oooooooo o oooo o oooooooooooooo oo HCoCoCHoCCHH oooo oTTTo ooooooooooooo T ooo ooooo ooo ooooooo o oooo ooooo ooooooooooooo o ooo ooooo C CoHCoHoHCoHCCH oooo TT TT ooooooooooooo oPPToPTT o ooooooooooooo oooo oo oo ooooooooooooo ooooo oC CHCCHo HoCoHH ooooooTT TTTTTTT o ooPo oPooTo oo o oooo ooooo oooooooo ooooooo o ooo oooo CCC o o oToooo TTT T ooPoToTPo ooo o o ooooo ooo o oooooo CCCoCCooCo oo oo To ooTToTTTTTT ooooooooo oo oo o ooooooooooo CCCCoooCoHoH Too o ooPoooToToo TPTTTMoo oooooooooooo ooo o ooooooooooo ooooooo 2 o CooooCHoooo oooo o oooooooo PoPooooT PoooPMPTPTo o oooooooooo oooo o oooooooo ooooooo ooooooooo 22222o2oooooo CoCHoCoHoHo oooooooooooooooo TPPPoooo TPTPo ooooooo ooooooo oooooooooooooooo oooooooo oooo 232222o2ooooo Coo ooooooooooo o oMPoP PoPoPPoP oooooo oo oooooo ooooo o ooooo oooooo 5222o22ooooo CHo oooooooooo ooooPoo PT oooooo o oooooooooo oooooo o 2222oooo CHoHCoo ooo ooo ooo ooPT Poooooo o oooo oooo ooo ooo ooo ooo oooooo o 222oo2oooo CoCoCoC oooooo o o PPPooP ooooooooooo oooooo oooooo oooooo o o ooooo ooooooooooo 2222oo2o2ooooooo CoCoCoCCC oooooooo ooooooooooo ooooooooo oooooooo oooooooo ooooooooooo 22o2o2o2ooooooooo CCC ooo o Poooooooo ooooooooo ooo ooo o ooooooooo 222o2ooo C Co oooooo oPo ooooooo oooo o oo oooooo oo ooooooo o CCCoCo oooooo PPPPMPooooooo o ooooo oooooo ooooooooooo CCCCCC ooooooo PoP ooooo oooooo ooooooo oo ooooo CCCCCCCCCCC oooo oo ooo ooooooooooo oooo oo CCCCoC ooooo oo o oo ooooo ooooo oo o CC ooooooooo PoMo oooo oo ooooooooo oo oooo CCC oooooo oooo MMMMoooooooo ooo oooooo oooo oooooo oooooo 2222ooooo o CCCC oC o MoM ooooooo ooooo o ooooo o o o oo ooooooo 22222ooooooooooooo CCCCCCC C M oooo ooooooooooooo ooooooo o o oooo NV12 222oo222o oooo oooooo CC Mo ooo oo oooooo ooo oo 4224o2oo2o2oooooooo CC MCo oMooCMoMo oooooooooooo ooo o oooooooo ooo CC ooC MMoPMM oMoooo ooo oo ooo ooooo oooooo 2 oooooooooooooo CCC CC oCCCCC o PMoMN oM oooooooooooooo ooo oo oooooo o oooo oo 2222o2oooooooooooooooo CCCCo CCCC CCoCC o oNo ooooooooooooooooo ooooo o oooo ooooo o oo 22222o2ooooooooooo ooo CCC CoCCCCoCoCo CCoC CooCCCCoCCC o oooooooooooo oooo ooo ooooooooo oooo oooooooooo o 22222o2o2oooooooooooooooo ooooooooo C Co CCoC CoC C CCCCCC MoMoMo oooooooooooooooooo oooooo o o ooo ooo o ooooooo oooo 222222o2o2o2ooooooooooooo o C CC C C CCoC ooMoMo oooooooooooooooo o o oo o o oooo oooooo 22222ooooooooo o CoCCCoCoCCC CoCCCCoC NoMoo ooooooooo o oooooooooooo o oo oooo oooo 22oo o ooo CC C CoCCC CCCoCCCCo oCoMoo oo o ooo oo o oooo oooooooo ooooo CoCCCoCCCCCoCC CoCoCC CMooMM ooooooooooooo ooooooo oooooo oCCoC CoCCC oCCCC CCoCoCC ooMMM oooooo ooooo ooooo ooooo o oooooo CCCC C ooooCCCCCCCCC CoC C CCoCC ooMMMMMooNC oooooo o ooooooo oooooo oo o ooooo ooooooooooooo CCCCCCC oCCCCCoCCCCCCCCC L CC MMoN ooooooo ooooooooooooooo oooo o oo oooo CCCCCC CCo CCCoCCCCoCooL CCCCL oo ooooooo ooo oooooo ooooo ooooo oo CCCCCoCCCCC CCCCCCCCCCCCCCC CCCoC ooCMoMooo oooooooooooo ooooooooooooooooo oooo oooooo Co CCCCCCC CCCCoCo ooooMoCo oo ooooooooo oooooo ooooooo C C CCC MMMoC o o ooo ooooooo CCCCC CoCo CL MCMCoCCCo oooooo ooo oo ooooooo CCCCCCCCCoC CC CoCCCLNCoCooo ooooooooooo oo ooooo ooooo CCC CoCoCo CC CoooCCCNo oCCCCC ooo ooooo ooo oooooo oooo ooo Co CCoCoooooooCoooo CNCLooCCLCCC o oooooooooooooooo ooooo ooooooo CCCCCCCooCCooo o NNNNNNNNCNoNCoCNLo oooooooooooooooo o ooooooooooooooooo CooCC NNNoNNNCoLNN oooo o oooooooo

CCC ooo

Figure 4: t-SNE plots of the NV4, NV8 and NV12 representations of labelled and unlabelled virus genome sequences with points coloured by ICTV Order or genome length.

16 CCC CCCCCC CCCCCo CC HHH CooCCoooHoCHoooo CCCCooCHCHCooHoooHoooCoHCCCCCCCo CHCoHCHCHoCHoHoC H CCCCCooCCo H CC oHoCoCCH o CCCCCoCHCH CooCoH CCoCCooCCC Co oCooCC CCoCH o ooCCoCHHH CCoCCHoH C HCoCo CC HHCoCo C HoHooHooCoC C HCoCoooooC oHoCoHooCCCHCCC oCoCoCoH C C o CCooCoCCoHoC C ooooooo CCCCC ooooooooooo CCCC ooo oo P P o o o oooo o P P o oPoooooooooo oooo o P P oooPoooooooooooo oo o oo ooooPPoooPoooooooooooooooooPP o o oo o o o o ooooo o ooooMooTooooMPoooooooooooooooP ooo o oooooooooo o o o oooooooooooooooooooPoPooPTToTooPPoooPoooooooooPooPoo o ooooooo ooooooooooooooooo o oo o P oooooooooooooooooPPoToPTooTTTTTTTPPooPoPoooPooooPooPoo oooooooooooooooooooooooooooTo oT ooooo ooooooooooooooooPoooPPToToTToTTPTToToPooTooPMPPooooo ooooooooooooooooooooooooooTooTooooooT ooooooooooooooooooPooPoTPoTooTToTPToPoToPoooooPoooooMM o o oooo oooooooooooToooooTPooooooooPToPooooooooooooooTTooTPooo TPoPP Po oooooMoM M o oo o oo o ooToToooooooTooooPTooToPooooooooooooToooTToPooT M ooo oMoCo o o o oo T ToTooToTToooToTooTooooToPPTooPoooPPoooToTToPPo o C o MMMoPo o TTTTTTTTToToTToPoPTTToPTPPPPTPoPooooooT MMMMMoMMoMoMo oo o ooo TTTTTooTTTToTTTTPTPPPPPPooooo o P MMMMPMoooooooo ooooo ooooooo o oTTTTT PoPoPPPooooo MMMMMooNoPo o oooooooooo oPPPPoo oo MMMMMoMoMMMMooC oo oooooooooooooo o oo oo o oP ooo MMMoMMoMoMoNoo o ooooooooooooooo o oMMMoo oooooo ooooo oooooMM ooooooooooooooo oo o oCoCMM oooooooooo o oNMMM ooo ooooooooooooo ooo CC oo oCoMoMMMMM ooooo ooo ooC oC ooNMMMMMMM oo o o C CCCCCCoCC CoNoMMMMoC ooo oo CCCCCCCCCC oooooMMMCCC oo o CCC CCCCCCCCCCo CCooCooMMoo ooooooooooo o CCCC CCooMoMCMCC CCC oooooooo o CC CCCC oNNoMoCMMoC oooooo oooooo o CCCCC o o CooCoL oooooooooo CCCC oCoC oooooooo o CCCCCoCC C CCNCCNNCNNooL oooooooooooo CCCCoCCC C C C o CCCCNNNNNNoLoCC ooooooooooo CCC CCC C C CCCCCCoLCNNNNNN C ooo CCCCCCC C C C CCCC C CCCooNNNNoN C CCCCCoCCCoCCCCCCC CC CCCC C CCCCCCoCoCooo C CCC CCoCCCCCoCoCCCCCCCCCCCCCCCC C CCC NoCoCCLooLNNNN C CCCCC CCoCCCCCCCoCCCCooCCC CCCCCoCCCCCCCCCCCCCCCoCCCoC oooo CCCoCCCoCoCoCCCoCCCCCCCCCCCCCCCCCCCoCoCCCCCCCCC o oCCCCCCoCCCCCCoCCCCoCCooCCCCooCCCCC CCoCoCCCoCCCCCCCCCCCoCCCCoooCoCCoCCCC C CCCCC CCCCCCCCCoCCCCoCo CCCCC CCC CCC C oCCCCC CCCCoCCCCCC C CC CCC CCCCCCC C o C CCCCooCoCCooCo CCC CC CCCCC CC CC CCCoCoCCCCCC CC CC CCC C oooo CoC CCCCCC C o CCoCo C CCoCCC CCC C LL CC CC

Figure 5: t-SNE plot of the 6-mer NV representation of labelled and unlabelled virus genome sequences with points coloured by ICTV Order.

5.5 Sequence composition based on 2-letter alphabets: SW, RY, and MK Figure 8 shows the distributions of the ratio of characters in each alternative alphabet − Strong/S (G+C) to Weak/W (A+T), Purine/R (A+G) to Pyrimidine/Y (C+T) and Amino/M (A+C) to Keto/K (G+T) (Table 3) − for sequences with and without (”U”) a taxonomic class assignment; the eight box plots are arranged by increasing median Length and the horizontal red dashed line indicates a ratio of 1.0. The median values of these three ratios are useful guides for grouping and distinguishing between different subsets of ICTV Orders. For example, S/W > 1 for H (the reverse is true for the other 6 Orders), S/W < 1 for C and L, S/W < 1 and M/K < 1 for N and P, and S/W < 1, R/Y > 1 and M/K > 1 for M. Orders with the smallest and largest median values are as follows: S/W (L: min, H: max), R/Y (N: min, M: max), and M/K (N: min, T: max).

17 pcfi bto)cmrsintosfralvrsgnm seq genome virus gener all by for achieved tools ratio compression compression (bottom) of specific Histograms 7: Figure sequen genome genera virus by for achieved ratios tools Order. compression compression of (bottom) plots specific Box 6: Figure

Frequency Frequency DELIMINATE bzip2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 100 300 0 100 200 300 0.0 0.2 0.4 0.6 0.8 1.0 . . 0.8 0.4 0.0 . . . . . 1.0 0.8 0.6 0.4 0.2 0.0 T P M N L C H bzip2 compressionratio T P M N L C H DELIMINATE compressionratio ICTV Orders ICTV Orders

Frequency gzip 0 100 200 300 0.2 0.3 0.4 0.5 0.6 0.7 0.8 . . 0.8 0.4 0.0 T P M N L C H gzip compressionratio

Frequency MFCompress ICTV Orders 0 100 300 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . 1.0 0.8 0.6 0.4 0.2 0.0 T P M N L C H MFCompress compressionratio 18 ICTV Orders

Frequency xz 0 100 200 300 0.2 0.3 0.4 0.5 0.6 0.7 0.8 . . 0.8 0.4 0.0 T P M N L C H xz compressionratio ICTV Orders uences. lproe(o)adDNA- and (top) al-purpose

Frequency ICTV each to assigned ces LEON -ups tp n DNA- and (top) l-purpose 0 100 300 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . 1.0 0.8 0.6 0.4 0.2 0.0 T P M N L C H

zip

LEON compressionratio Frequency 0 100 200 300 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ICTV Orders . . 0.8 0.4 0.0 T P M N L C H zip compressionratio ICTV Orders atr fSWfrIT resHadCadMKfrNadTetn s u extend codon T and [12] and are categories findings N virus Our for specific least. rou on M/K the are is focus and C S/W that for and works C M/K symmetric most of and the pattern H is alp the R/Y whereas Orders and ratio, Order ICTV large ICTV a for towards most between S/W varies concentrated of sides are pattern both ratios on the dispersion Although of composition alphabet-based class. reduced taxonomic overall nucleob the Amino/Keto stratifies and Purine/Pyrimidine Strong/Weak, of a h ags ainewis n r iia.Different similar. are M and R whilst verti variance between are edge largest the Points the to has S coordinates. vertex S connecting ternary line using median sequence the a of content nucle g and alphabet-based sequences reduced genome virus the unlabelled of and Scatterplots Order-labelled 9: Figure nucleotid sequences. alphabet-based genome reduced virus the unlabelled of and plots labelled Box 8: Figure log(Length) iue9sosasatrlto h aua o rnfre v transformed log natural the of scatterplot a shows 9 Figure iue1 ipasterltosi ewe h vrl Str overall the between relationship the displays 10 Figure 8 10 12 14 o o o o o o o o o o o o o o o o o L o L o o C C C o C o o C o C o o C o o o o oo o C o o C C C C o o o o o o C C C C C o C C o C C o C C C o C o o L o C C C C C o o C C C C C C o C C C C C C o C C C C o C M C o o M o . . . . . 3.0 2.5 2.0 1.5 1.0 0.5 C C C C C o o C o o C C C o N o o C o C o C C o C C C C C C C M C C o C C C C C o M M o o C M C o C C C C C C M C C C C C C C C C C C C C o C o C C o o o o C P o M o o C L o o o o C o M o C C C C C C C o C C o o C C M C o P C C C C o C C C C o C o C C C C N M C C C C H C C P C o M C C C C C C C C C C C C C C C M C C C C o C C C C C C C C C C M o C o N o N o C P C C C C C L o C M C C C o C o C C C C C C C o C C C C C C C C C C C N C o C C C C C C C o C C C P C C o C o C o C o o C o P M M C C o C C P o C C o N C o C o P C o C C C M C C C C C o C C C C C C o C C C C o C o P M C o L o o C o C H C C o o P N C o C o C C C C C C P C C o o C o C C H C C P C C C C C o T C C C C o C P N M C o C o C C o M C C o C M o o o o o N o o C N M o C C N C o N o C T o C o T N o o N o o C o L C o o o C L L o o o o C L C o o N N C P o o o N o C o C C T C P N C T C C o C M o N o C o C C P o C C C C C o o o N C o C P N C C o C o M N C N o o MM o o o N P o N C C C o C o o o C C P M o C N o N N C o o o o o o o o N C N C o oo C M o o o o o o o o P o T o o o o C L P o P o o o M P N o M o o o N C C C o o o M o o C o o M C o o o o C o o M P C H C P o C M o C C M M o C C C C C o o T o C C o o o P C C P o o o C o C o o o P C C o T o o C o o C o o o C C o C o M o C C C C o P o C C o C P N C C C C o o N o o o C M C C N o o C C C C o C C o o o C o o o o P o C o C T C C o o C P o o o o C L C C o o o M P o C o T C o M C P o C o P o P o C M C o C C M o M o C C M N o o C o T C C C o M C C o o o o M M C C o o o C o o o C o C C o C o o o o N o Strong / Weak N o o o o o C o o o T o C C o M o o o o o M M C T M o M o P o o C C M o o o o o M o o o o o o C o C C N C o C o o C C C o H C o o C C C o P o C P o o C o o C o C C o o C C C o o o o o M o C C o P o o o P N o o M o P o P C C P o N M o o o o o o o o o M C oo o T T o C o C o o M o o o o o o C N o o T M o M o o o o o T o o C o o o o C o C o M o C o o T P T o o M C o o o o C o N o o C N o o o M o N o C o o C C C C C o o o o C C o o o o o o o M o o T o o H o o o o o C o M o o H C o M C o P o o o o o M C o o o o o o o M C C o o o C o T C o C o o o o o M C C M o C o C o C o o o T P M M C o o o N M T N C o o o o o o C C o C oo o C o M o C o o o T o o o C o o o P o o C o o o C o o M o o o o o o o M o o o N C N o o C o o C M C o o o P o o M M o o o C o M M C o o o M o C o C o o o o o o o C o o o o o M C C o M o o o C o o o o H o M M T o o o o o T o o o C P o o C C o o o M o H o o C o o CC o C o o o N o M o o o o o o o o o T T o o o o o o o o o o M M C P T o C o o o o o o o o o o o P o o o o C C C o M T T o o C o o M M C C o o o o H C o o C C C o P o o o o o C M o o o C C o o o o P o C o P o o o T P C o o o C P o o C o o o o M C o H P C o o C C o o o o o C oo N C M o o o o o M M N M T o o o C P o o o M o M P o o o o o M o o o o o o o N P o o o o o o o o C o o o o o P o o o o o o o o C o o C o oo C o o o o M o C C o o P C P o o C o o o o o o C o C o o o o o M C o o o o T P P o M o o o o o C o o C C P o o C P o o o o M o o o o o o o o o o o o o o o C o o P o o CC o N T o o o o o P o C C T C o P M o C C o o o o o o C o o o o M C P C o o o o o o o o C o C o o C C o o o o T C H o C o M o o o T T o o o C P o o M o M o M o o o C C o o o o C o o o o o M C o T o P C C C o o o T M o o o o C C M o C o o o o o M o o o o o o o M C o o o o M C o o C o C C o o M o N C o o C o o o T o o M o P o o o o C o P C C o M o o C o o o o o o T o o C o o o o o M P o P T o C o M P o o o C o M C o C C o M o H o o P P o o C o o T P o o o o C o o o P o o o o C o o o T C o o o o o o T o o P o o o o o o H o o o T o o C o o C P o C o o P o T T M P o C o C C P o o o C C P o o o P o T o o C o C o o o o o o o H o C C C C o oo C P C o o o o o o M o T o P o o T P o o o o o P M P o o o o P H o o M o o P o M o o P M T o o o M o C o o o o o H o M oo o o N C o o o C o o o C o C o o M o C o o T P o o o C o o H o P o o o o N o o T o o C C o M o o C P o o oo o o o C o o o o o o P M o o C o H o o o o o C o o o H o o P 0.5 1.0 1.5 2.0 2.5 3.0 o o o o o o T C T C o P N T o T H o o o o o H T T M o o o o o L o o T o P C o o o o o o P C C o o o o C M C o o C o C o o o o o o T C o M H C T o C o C T T o o o o o C C o C o o o o C P o T o T o C C o C o o o C o o o C o o C o o o o H o o o C C C T C o o o C o C o o o C P C T o M o T o o C o T M o C C o o o o o T o T o o o P o o o o o o o o T o o o T o H o o C o C o o o o o o o o o C o o C o o o o o o o o o P C C o o o o o o o o o o o o o C o o o C T o o C P C o o T o o o o o M o o o o o C C o o o T T o C o C o o o o T o o o o o o o o o P o o o C o C o C o o T T o o o C o o M o o o o o o C T o o o T C o T C C o o H o T o C o o o o o o P H C C T o C o M C C C C T C C o o o o C C T C o o o o o C o T o o o C o N o o P o o C C C o C C o o o o T C o C o o P o o C o C T o C o o C C C C C C C o o o C o o o o C o C C o C o C C o o P o C o o C C o M o o C C C C C o C C C C o o C C o C o o o C o o o o C C T C C C o N C M C o o o C C o T C T C o M o o C C o T o C o o o o o o o C C C o P o o o C o o o T o C N o o o N o o o o o H T o o o o o o C C o C o o C C o o C P o C C T o o o o o C o o o M o C o o H C o o o o T C C o C C C C o N C C C o M o o o T C o C o o o o C o T C o C T C C o o C o T T C o o o o o C o o o o o M o o o T M P H C C o T o C T o M C o o o o o o o T o o o o o o C C o o o C o C P P o o o o o o o o o C C o T o o o o H N o C o o o o o C o o o o C o o o o o o o o o C o C C o C C o o C C o o o C o C o o o o C C M N T PP o o o o C o o o C o o o o C T C C o C C o C o C o C o o T H C o o C C o o o T o o o o o o o H o o C C o o o C o o C T o H o o N o o o H o o o o C C o o T o o o C C o C o C o o C C o C o o o o C o o C o o o T o o o C o T C o o o o o H T o C o o o o M o o o T o o o o o H o C o o o o C o o C CC T o o o C o o o o C o C C C CC o o C o o o C o o C C o C o o o C C C o C o C o U T C N o o o C o o o o T o C o o o C H P o o o C o C o o C C o C o C C o o o H o o C C C T o C o P C C T C T T C o o o C o C o C o C C o C o H o o o o o o o o T o o o o o o o o C P o o o C C o o o C o C C ICTV Orders Strong/Weak C P C C C o P P C o C o C C o o C o C o o o o H o o o C T C H C C C C H C o C o o T o o o oo C o o o C C C C o o o o C H o T o o o o C P P o C o C C o C o C T o C C o o o o o C P o o o o T P o o C T o o o C T C T C o o C H H o C C o o C C C C o o o o C o P C o o o o C T C o o o C C o o o o C C o C o o H o C o C C o o C C C C C C P T C o C o oo o o H o T C o H T H C C C o C C o o C C o T o o o H C L N M P T o C C o C C o o o C C C C C C C C C H C C C C C C H C H C C C C o C C C C o o C C C o C C C C C C C C o o C C C C C C C H C o C C o C C C T o C o o H C C C C C C o C o o C C C o o o C H o C C o o T C C C C C o C C C C C C o C C C C C C o C o C C C T C C C o C C C C o C C C o C C C C C o C C C C C C C o C C C C C C C C o C C C C C C C C C C C C C C C C o C C o C C C C C C o C C C C C C C C C C C C C C C C C C C C C C C C o C C C C o C C C C C C o C C o C C C C C C C C C C C C C C C C C C o C C C C C C C C o C C C C C C C C C C C C C C C C C C C C C C o C o o C C C C C C C C C C C C C C C C o C C C C C C C C o o C C C C C C C C C C T C C C C C C C C C C C C C C H C o C C o C C C o C C C C C o H C C C C C C C C C C C C C C o C H H C o C o C C C C C C C C o C C C C CC C C C C o C C C C C C C H C H H H H H

log(Length) 8 10 12 14 Purine / Pyrimidine ...... 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.4 0.6 0.8 1.0 1.2 1.4 1.6 T T T T T U T T T T T T T T T T P P T T T T T P T T T T T T o P P T P T PP H C L N M P T P P P C o P P o o T P o P o C o o T C o o o P C C o o o o o o o C o o C C o o C C o o T C o o o T T o o C o C o C P o N C C C o C o Purine/Pyrimidine N C N P o o o T C o o o C T o o P T o C P o C C C P C o P C C C o o P C C C o C o T C N o o M o o o o o P o o N N o o N o M o C C P o P N o M o C T o o C o o o o N oo T o o o o o o T o o o o o N o o o o o o o ICTV Orders C o o o C T P o o o o o o o C N o C N o P C o C P C C o C C C o o o C o o H o C C o C C o o N o o P o N o N P o o o 19 o N o N T o o N o o N P o o C C N o o o C C C o o C C C o C C N P C N C o o o N o C o o o C C N oo o C o C N o T o o o N o T o L C o o T N C o o N o o N C o o o N C N C o o o o N T o C o C C o N o o o o T o o N C C o C o o C C N o o o o C N o C C C o L o o o o C P C o o C L P o C o o C C C P o o C o L o P C o C o C o C o P o T C o o T oo C C C N C C o o C o C P o o o C o o o o C o o o C C C C o o o o C C o o o P o C L L C C o N o o N o o o o o o C o o o o o C C C C C o o C C o o o N C P L o o o o o o o C C o T o o o C C C o C o o C T P o o H C o o H o C o o o o C o o C o o o o C C o o o P o o o C o C o C o C o C o C o P o o o C C o o o N P o N o o o o o o C o C P C C o o o H o C o M C o P C C C o C C o C o C P P o P o C C C C o o o C o H o C o o C C o C o o N N C C C o C C C o oo o C C C H N C C C C C o C o o o N C C o o o o C o P C o C C o C o C o C C N o C C C C o C C C o N o o o o o C C C H C C o o C o C o C C o H o P N C C C C C o C C P C C P T C P o C o C C H C o C C o o T H P o o o C C C C o o C o C C C o o o C o o C C o C C o o C o o o H P o o o o C P C o o o C T o C C C N C o C C C o C o o C C C H C o o H C C o o H C C o P H o o o o o C o C o o H C o o o C o H o o C C o H C C C P o C o H C o C o N C H o o o o C C C o o o o C o o H o o o C H H H C o o H C N o o C C oo o C C o o o C o o o o o o o C o o o C H C o C C o H C o o o C o o H o N H C o o o C C o C C C N C o C C o H M o C o C C C o C C o C o o H C C o C C C o o o C H o C C o T C C o C o o o o o o C o o o o C o C o o C o o C C C H T o o H C C C C o o N C o o C C o o o C o o H H o C o o o H o C o o o o o C H N C C o H o o H C o o C o o H o o C o o o C C o o o o o H o o o o o o o o o o C o o o H C o H o o o o o o C C o o o C N o o T C o o o o T H o o C o o o o C C o C o o o o o o C o C o o o C o C o o o o o C P C C oo o o o o C C H H o C C o o C o o o o C o P C o o C o H o o o o o o C C o H o o o o N C C o o o o o o o o o o o C o H o C o o o C C o o o o o C o H o o C C N C o o C o M C C o P o o o C o o C M o o o o o o o N C o o o o o C o C o o o o o H C o o o C C C H C C C C o o o o CC N N o o o o T C o o M C o o o H C o C o C o o C o o o C C o o o o P C o o H C o C C C o C o o o o o H C o C o o N C C C M C o C o o o C o H o T C P C C o C o o H C C o o o C o C oo o C o o C C o C o o C C H T o C C C C C o o C C P C o C C o C C P o C C o H o H C C o o C o C o C o C P o P o o o H C C C o C o C o C o o o o C C C C o C o C o o o o C o o C C C C o C o M C C o P o o o o o o o o C o C M o C o o o C H o C o o o C C o C C o o o o o C o N C CC C C M C o o C C C o C o o C o o o o CC C C T C C o P o o o C C o C o P o o C C C o o C M C C o o o C o o o C C o C C o o C o C o C o o o o o o C C C C C o o o T P o o C C C C o C H P C C o o o C C C o P o C o M P N o o C C o C o C C o C C C o C o o C P P o C o C o T o C C o o C o C P o o o o C o o C o o o C C o o T o T MM C o o o o C o C C C o o C C C o C C T C C M C C P o o C o C o C o C C o o C o C o C o o C C C o T C C C o H o T C P T C o C o C C C C C C o o C C o o C C C C o C o C T o C L o o C C o o C P CC o o o M o o C C o C C M P M C o C o o C C T o M C P o o o o o P o M P o L o o o C o C o o M C C P o o C T C o o C C o o o o C C C o o C M C C C o o C o o o M o T o o M C o o C o o o o o C P o o o C o o C o M C o C C o C T o M o o o o T C o o o C o M o C C C o C C C o C o C o C o C o T C o C o C C M P o C C C C C L C C C o C C o T C o o C C o M o C o o L P C L o o o o C C C T o C o C C C P o o C M o C C o o o C C o C o C C o M T o o M C C C o C C C M T M C M o o C C o o C C o T C C T o C o o o o T C o o o o C C C T C C C C o o o H C C M C C C C M C C o M C P C C o o C o C C C o M C C o C C T C o o C o P o o M C C C o C C C C C o C C T P o N o M C o C C M C C o o C C C C C o o o C C o C C C o C C T C C o C o M o T o C C C C o C o C C o o o C P o C C C o C o M C C T o o P C C o C o C C C T o P C C C T C C C M M C C C C o C o o o o o C C o o o o o C C o o C o o C C C o C C M o C C o P P C o C o C M o T C o C C o P C o C P C o M M C o C M C o C M o C o C C C C o C C C o C o C M M C o o C P o o C C C o C P o o o C o C o o o T C o C o o o o C M P C o T o T M o C o o M C o C o o T o o o C C o o C o o C C o T o T o C C C C C T o C T o C o o o o C o o C C C C o M o T o o C C o o o C o C o C C o o T o C o o C M C o o M C P C o P C o o C oo C o o M C o o o o o o o o C C o C C C C C C o P o o C C o C o M M C o o C o o o M o C o C C o C C o o C o o o o o C C o C o C M C C o M C C o o o C C C C C C P C o o P C o o M o o C o T M o o C M o o o o o o T C C C o o o M C o o o o o o o C M o P o C o o C M o P o o T o o o P C C oo o o o M o P C C C o M o M o C M C o C C C o o o C C P o C C C o C C C o o o C C o o M o o o o C o o C o C M C o o o M o o C o o o C o C o C C C C C M M o o o M C o C C o T o o o C o o C C o o C o T C o M C o C o o C C o C o o P o o C P M o C o o o C o o o C P o o o o o o o o C o o o P o C o o T o o N M o C M o o o o M C o o o o o C o C o C T C o o o o M o o o M C C o o C o o o o C C C o o o C C C o o o o o o T o C C C P o C o o C C P o C o P o M o o C o o o o o C C o o o o C M C o o M C P o o o o o P o o M C T C o o C o o o o C o P M M o o P o o C C C C o o o o o o o o C M o C o o M P C M o o M o C o M o C o o o o o P o o M o C M T o o T o T T M M o o o o o M o o C M o C C o o o o C o C o o o o C C M P o o o T o M o o o o C C o o T o o o o o C o o o M o P o C o o o o M M o C C P C M o T C o C o o C T o C o o M o M o C P oo C o C o o C o M o C o M M C o C o C o o C o o C o o o C o C o o o o o M C C o o o o o C C C M C L o T o C o C o C o C M o C C C o T o o C C o o M C o C o C o o M o o C T C C C P o o C C M T o M C C o o C M C C o o o o C C o o o M o o C C o C C C M P C C o o T T o o o C o C C M o M C o o o o o C T o C o C T o o C o o C M C C o o C T o M T o o T o C o o T C C o o C C o o o C o C o o o o T o o o C o o T o M o C C o o o o C C C o o C T C o C o o o C o o o o o o o o o C o C C o o o C C T o o C C P C C o o o C o C o C o C o o o C o M C M C o o C o o C o o C T M o o o o o C C o o T o C o o C C o C o o T C C o C C C M o o T o C o o o C o C o C o M o o o o C o o o o o C C o o C T o o o M o T o P o o M C o o o o C C o o o o o o o C o o o C o M C C o C o o o C M o o o o o C o o o C o C o o o o o C M o C o C C C o o o o o o o o C C C o o o C o o o o o C C C C M C C o C o C C o C o C o C M o o C o o C C o o o C C o C C C C C C C C o o o o o o o o C o o C C o o C o o C C C o o C C o C o o o o o C o C o C o o o C o C C C o C C o o C T o C C o M o C C o C C o o C C o C C o C o o C C C C o C C M C C o C o C C C C o o C o o M C o o C C T o o C C C o o C o o C o o o M C o o C o C o o o o o T o T o o o o o o o o o C o o C C C o C o C o o C o o o o o o o o o o o o M C o C T o o o M o C C C C o o M C o C M C o C o C C o o C o o C o C o C C o o o o M C o o o o o o o o o C o o C C C o C C M C C o C o o o o o o o o o o o o o o Amino / Keto fvrsgnm eune by sequences genome virus of

nm length. enome 0.6 0.8 1.0 1.2 1.4 1.6 1.8 ae[6 68]. [16, sage opsto CVOrder- ICTV composition e e n niaigthat indicating M and R ces td opsto fICTV of composition otide log(Length) leo eghadteratio the and Length of alue ssi eune Length sequence. a in ases ocnrtdmil along mainly concentrated yaon n,tedegree the one, around ly U 8 10 12 14 one. about symmetric ghly ae.Freape the example, For habet. n,Prn n Amino and Purine ong, H C L N M P T C C C C C C C C C C C C C C C C CC C C CC ossetwt previous with consistent C ...... 1.8 1.6 1.4 1.2 1.0 0.8 0.6 o o lse edt favour to tend classes o N N oo o o N o N o N N N N N o N N o N N N o o o o o P o N N o N o N N M o P N P N o N o N o N N o o o o o o N N P P N o o o o o o o o o N o o o o o N P o P o o o o o o N o o o o o o N o P o N o o o P o o o o o o o o o o o P o P o o P o o o o o P o o o T P o o o o o o o o o o N o o o o o C o o o o P C o o C o C o o o C o o o o o o o o o o o o T o o T o o o P o o o o o o o o o o o o o o N T P o o P o o o o o P o o P o o o o o o o P o o N o o o o o o M o o o o o o o o o o o o o o o o o o P o P P o o o T o o o o N C o o o o T o o o o N o o o o o o o o P o o T o o o o o o T o o o o o o C o o o o o o o N P T o P o o o o o P o o o o o o o o P o oo o o o o o o M o o o o oo T o o o o o o o o o o o o o o C o o o o o o o o o o T P o T o o o o o T P oo o o o P o o o o o o o P o o o o o o o o o T o o o o o o o N o o o P o o o o o o o o o P o o o C o o o o P o N o o o o o o T o o o o o o N o o o P o o o o o o o o o o C C o P o o P o o o o o o o o o o C o P o o o o o o o o C o o o o T o T o P o T o o o o o o o o C o o C o o o o o T T P o P C o C o o o T o o C o o o C C o P o o o T o T P C o C o o C o P P P P o o C o o T o o o C o o o o o o o o T o o o o C C C C o N o o C o o o o o o C o C T o C o oo T o C o C o o o P o P o o o o o C o o o C o o o C T P o o o C o C o o C T o C o o C o o o o o o o T T o P T o C o o o o o o C T o C o o C o C C o C o o o C M C T N o N o o C P o o P o o T o o C o o o o C C C C o o T P C T o T T o T o C C C C o o C C C o o o o C o P P C C o T o o o C o o o o C C o o o C o C C o C o o C o C o o T o o o o C o o o C o o o o o o o o o P o o C P o C C o C C o o P o C o o o o o T o oo C o C C o P o C T C C o o o C o o C C T o o o C o o C C o o C T o o o C C C o C o H C C T N o C C o C o o C C C o C o o C T o C P C C C C C C C C C C o C C C C C T C o T C C o C C o C C C C C o P C C C o o C C C o C H C C o C o C C C C o C C o o C o C o o o o C o L T M o o C C C o N o C o C o o C P C o C o C T P C C o o C C C o C C C o C C o C o o H o o C C C C C o C P C C o C o C C o C C C C C T C o C T o o o o C C P C P P C C o C o o C o T o o C H C C C C o C C o o o C C C C C C C o C C o L o o C C o C C o C C C o o C C o H C C o C C o o o C C o o M C C o C C P C o o T C C C C T C o o o o o C C H C C C C o H C C C ICTV Orders C C o C o o o H o o P o C C L C o C C o H C C o C C H C o o H C o T o M o H C C C C o o o C C o o C o C C C o o C o C o T o o C T o C C C o o o o C C o o o o H C C C C T o C C C C o H C H C o o o o o o C o C C o C C o C o o C C C o o C C C o o C C o C C C C C P o o o o o CC C C o o C C o C o o o o C C M C T H C C C N C C o o o o o C o o o C o C C C C C C C o C C o C C o L C o o C C C C C C o C P C C C o C C C C o C o C C C C H C H C C C o o o C o o H C C C o o C C C o C C o o L o o o o o o C C o C C o C C C P o C C C o C P o C C P o C o C C C C C o o C C o C o o C H o C o C P o C C C C C o o C C o o C C C C C C C C o H o C o C o o C C o C T o o o o o o C C o o C C o C C o P o C C C o o C o C C C C o o o o C o H C C CC o o C o o C o H C o C C C C o C o H C C o o M o C L o o o C o C o C C C C N H C C C o o C C C o C o o o C C C o o C C oo C C o o o N C o o o H C N o H C o M Amino/Keto C o o C o C C C o o o C o o o C o C C H C C o o o o C C o o M C C C M o C C C o C o o C M C o o C C o o o C C o C C C o C C C C o o o C o o C C C C C C C C H C C C C C C M o C o C C C o C o o C o C C o o o C C C o o C o o o C C o C o H C L o o o C o M M o C o o o H H C o o C C C P o o o C o o M o C o C o C o o C o C o o o C C o o o C o o o C C o o o o C C C o C o C C C o C M C o C o o o o o C o o C o o C C C C o o C C C o o o C C C M o C C o o o o o o T o o o o C o C C C C C o C C T C C C C o C C C H C o o C o C C C o o o o H C P o C C C o C C C C C C C H C o o o o C o o o C o o o o C C C o o C M C C C o H C o CC M o C C C C C C o C o o o o H C o M o o C C C C o o C o C o P o C o o o H C o C o o o C C C o C C o o o C o C T o o C C C C C o o o P C C C o o C C C C o o o o C o C C C C C o C o C C H C C C C C C o o C o P C C M M C o C C C o o C H o C C o M o C H M o o o o C N o L o o H C o o C o C C o o C C C H C C o o o C o o C C C C o o H C C C C o C o o C o C o C C C C C C C T o M o o C C o C o H C M o H C o o o o L o C M o L C o o C o C o C C o C o o C o o P o H M C o M C C C o C o o C o o C C C o o o C C C C P C o o o o C o C o o C o o C M o C o C C C o C C C C C o C C C o o o C C C C C o o o C C o o H C M o o o C o o C o C o C o o C C o H C C o o H C C o M o o P C C C C P o C C o C C o P C C o C C o o C o o P o o C C M o C o C C o C C C C C C C o o C o o o C o C C o C C C M o C C o C o C M o M o o o C T o C C M o C C o o o C o C o o C C C C C C C C C C N P C o o o C o o C o C o C o o H o P o C o C C C C L o P o o o C o C C oo C o o C C C o o C M C o C L o C o o M o o C C o C o C o H C M C C C P C C C o C M o C o C o C o o M o C C o C C C o C P o o o H o C M o C o o o M C o C C C C C o C C o C H o o C P o C C o o C C o C C o L C C C C C M o C o M C C C C o C o o M o o o o o o o C H C C o C o C o C C o P C C o o o C C o P C C C M C o C C o C C o C C P C C o C C C o C o C o P o C M C o C M C o C C o C P o o C C C o o C o H C C C C o C C C C H o C C C o C M o C C C C M o o C M o C C o o C o C C M o C o o H o C P o o C C M C o o o C M o o o o C P C C o o C P P o o C o o o C C M o o o M C C o C M M P o C C o C o M o o M P C C C C C C o o M H o o C M o C C o M C C C N M o o oo C M o o C C C C C C C C C C C o C C C o o M C C C o C P C o o o N o o P o C C o o C M C C C o o o C C o o P o o o C C o C o o C C T C C C C o o C C o C C M C M M o C C o o T C N H o o C o o o P M C M C o C o M o C C M o C o C P o M C o o o C C C C C C C C o H C M C o C C H o C o C M o C C C o C o o C M C o H C o M C C o o H o M C C C C o C C o o C o C T o C C o o C C C o o C C P M C o C C o H o C C o C o C M C C o o o MM C o C C C o C M o o o C C M o M o C C C P o P C o C M C H C o C C o C C M C C M o o o P C M C o o C C o o M M C C C o o o M o o C C C o M C M o C o C P o o C C o C C C C o C o C o o M C C C C C o o o C C C o C M C o C o C C C PP o o M C o o C C M C o N C o o o o C o o o o o o o C o o C P C o o C o C o C C P C M o C o C C M o o o C C C C o o C C o o C C C C M o o o C C o C C C C o C C C o o C C o o M C M C o C o C C P o C o o C M M o C o C C o C M C o o o M P C o C o o o o o P o C M C o o M M o C C o o T M M M o o o M o o o M C C C o o o o C o o o C o C C o M C M o o C C o C M C C o C o C C C o o M M C o o o o C o o o C o C C o M o o o o o o o C M M o o o T P o o C M o o o P o o M P o o o o M o T o P o o o o o C o o o o C M o C o o o o C o o o o o o o o C o o o C C o o C o M C C o o o C o o o C M o o C C o M M C C C M o o o C o o o C o C o C o C o o C o o o o o P o o o o o o o T o o o o P o o o T C o o o P o o o o o o o o P o o o o T o o o M o o o C o o o C o M C o o o o o o o o T o o T o o o o o M o o o o o M o o o M o o o o o o o o T o o o o o o o o o o o o o o o o o N o o C o T o T T o o C o o o o o o M o o M o o M o o o o M o o o o M o o C C o o T o o o M o o o o o o o M T M o N o o o o o o T o N o T T o o o M T o o C o o T T o o o o P T o T T o T T o o o o o o o T o o o o T o T o N T o T M ignficantly M o M o o T T o M T o N o o T o o o T o T P o o o T o o o o T T P T o o o o o T T T o o T o o T N T N o o T T o o o o o T o o o o T T o T T T o o o o T T T T T N o T T o o T T o o N o o T o o o o o o o o T o T T o T T T o oo T o T o o o N T T o o T T different ratios and display different patterns. For example, the region M > S > R is dominated by ICTV Order T. ICTV Order H is concentrated at the median line and spans a large range, whereas M and N favour smaller S.

R 80

20

60

CC 40 CCCo o ooo oMCCoCCCCoo o oooCM C o o o CMCMCCoCo o ooToCoCoCCCCooCoMCC o o o oo PCoToCToCCoCCPoCoCoCCoMCoCMoCCoooo oooo ooooCPooPoCoNoCoPCToCoCoTMCoCPCooCMCoCCCCoCCooCoooLoo oooTPoNoNooNToPooTPoCCoToCMCoToPCMooCoPoCMooCToCoCoLCoCMCoCCCMooMCoooLoCooCo oo oooToToToTooTNoNoPTNoooNMoNPoPTMoCMoNoToNCoPMoCNooMCooPCoCoMCoPMooCMoMMCooCooooCMooCoo 40 oo ooTooooPToPoTNTNoTPooCTCoTNoMCoPNoPCoMNoTMooNoMoCTPMoCoTPCMoCoMCooPCMoPCoPMoCPCMPoCCoCoCoooooooCoMo oo oooooooTooooTooCoNToooTCooPNTooMToCoMCNoTPoCPoNCoPToCMoPCMoCTMoTCoPToCPMoCPoLMooCooMCoCoCooCoCoCMoCCoo o oo oTooooToNoPToCoNooCoPoCTPoToPooCoToMCoPTMoCPoMCoPoMNToCPMoCoMoTCooPNCoCoMCoooooPCoHoMPCooLCoMCoLMCoMCoCLCoLHoCoo C CCC ooooCoCCCooCTPooCooCMToCToPoooPMoCoTPoCoMoCPMoCoMCoMoCPoMNCoCooMPCoooMCoMCoCMLooMCoMoCNooCoMCHoCHoo C o o o ooooooCoCooCoCoCoCoooCoPoooCPoCoPCPCoLMoCPoNMPCoNMoCHoMCoMPoCooHCMTHCPooPMCoTPNMHoCMPooMCoCoCooCooNCNoooooo o ooooCCooCCoCoCooCoCPoCooCToCPCooPCoHMCoHPCoPTHoCPHoCoHCoMoCoMCPHoCMoHCPMoCNPMCooCoCPCCoTCCMoCMCTC No oo o oo ooooooCoooCoCoCNoHCoCoCoCooCPoCooCoCoMHoPoCoCoPoHMoCTCHoMHPoooPCNTPooPCoCTCCoCooC Co ooCooo oooooCooCoCooCoCoMCoCoCoHCooCCHoCoMoPooPPoPMCoooMoooCoNoTooTMoN ooo CooMCoCCoNCoMoCooCHCooHooPCoooMoPoTooooCTCC N 60 ooooCCCCoCCoCCoCoCoHoCoNoCoHCooCoHooPoNoCoCCPoooCooPNooooooTTo o o oCCoCCCCCCCCoCoCoHCoHoHCoCHoMCoCoHooCooPCooNPoCoTooooToTCoTTooT o CCoCoCCHoCHoCHCCCHoooCoCHCoooooCHoNoNooPoooToo T oTTo o oCCoCoCCCCoHCoCoHoCoCoCooCooooo o oTPToTToTTo N o CoCCCoCoCoCCoCoCHCoCCCooC No oo T N N CoCCHoCCoCCCCCCoCCCoCCCoo o o o T T CCCHoCCoHCoCCCCCoooooo P P PPo oTTTo C HC o PTPPTT T T H C P oTooTooTT HH C P T T 20 HH o TPPPT T T C T T T TTT T TTT T T T T T 80 S 80 60 40 20 M

Figure 10: Ternary plot of the reduced alphabet nucleotide content of ICTV Order-labelled and unlabelled virus genome sequences .

6 Classification experiments

6.1 Methods We conducted three different types of experiments. Common to all three experiments was the following preprocessing and parameter selection procedure. To improve statistical stability and performance, we normalised features so that each variable in the feature vector has zero mean and unit variance. To tune the parameter(s) of a classifier, we employed 5-fold Cross-Validation over a range of values (Table 6) in a grid search fashion. The combination of values giving the lowest mean error rate on a set of feature vectors was selected. First, to assess the classification performance of a given combination of feature vectors and classifiers, the 1,865 labelled viruses in the dataset were divided at random into training

20 Table 6: Range of parameter values used for cross-validation.

Classifier Range of parameter values k-NN k ∈ {1, 2, 3, · · · , 10} RBF-SVM C ∈ {25, 26, 27, · · · , 217} γ ∈ {2−18, 2−16, 2−14, · · · , 213} and testing sets in the ratio 75% : 25%. The training set was used to tune the parameters of the classifier using 5-fold cross validation and the classifier with optimised parameter values is used to predict the ICTV Order of the testing set. The result is an error rate for this particular split of the dataset. This procedure is repeated 10 times with random seeds, each giving a different training/testing set division. We report the mean and standard deviation of the testing error rates over 10 different 75% : 25% partitions of the dataset. We plot the confusion matrix of the classification results in order to examine the performance of individual classes. The results are reported in sections 6.2, 6.3 and 6.4. Second, to identify viruses that are difficult to classify, we conducted a leave-one-out experiment where the training set contains all the labelled viruses except for the one to be tested, and the testing set contains one single virus. The experiment is repeated until every labelled viruses has appeared in the testing set. The virus is considered to be difficult to classify if the predicted label does not match the true label. The results are reported in section 6.5. Third, to predict the taxonomic class of currently unlabelled viruses, we identified the combination of feature vector and classifier with the smallest mean test errors. We constructed a training set using all 1,865 labelled viruses and used it to learn a classification model. The learned model was used to predict the ICTV Order of every one of the 1,834 unlabelled viruses. The results are reported in section 6.6.

6.2 Nucleotide-based features 6.2.1 Overall classification performance Table 7 shows the mean and standard deviation of the error rate for every combination of classifier and feature vector representation in the task of predicting the ICTV Order of a virus. For a given feature vector, the classifier achieving the lowest mean error rate is highlighted in bold. For both classifiers, the error rate decreases from Length to NV12. Length performs respectably, NV4 is a significant improvement over Length but NV8 and NV12 are only marginally better than NV4 − the benefit of adding more complex features (mean position and normalised variance) to nucleotide count is limited. Although k-NN is better with respect to Length, NV4 and NV8, the best classification performance is achieved by the combination of RBF-SVM and NV12.

21 Table 7: Classification performance of different nucleotide-based feature vectors and clas- sifiers.

Feature Classifier vector k-NN RBF-SVM Length 0.137 ± 0.013 0.144 ± 0.012 NV4 0.059 ± 0.010 0.073 ± 0.009 NV8 0.051 ± 0.011 0.053 ± 0.009 NV12 0.049 ± 0.007 0.038 ± 0.008

6.2.2 Simple features can give respectable classification performance The EDA supports the classification results (Table 7). The plots of Length (Figure 1 and Figure 9) as well as the composition- and location-based statistics underlying the NV and its derivatives (Figure 2) show these features are fairly good at separating ICTV Orders. The improvements of NV8 or NV12 over NV4 is explained by the scatter plots between Length, counts, mean position and normalised variance (Figure 3): all pairs exhibit a strong linear correlation with that between mean position and normalised variance being the strongest. The t-SNE plots (Figure 4) show that even in an unsupervised setting, distinct clus- ters corresponding roughly to ICTV Orders are present. Length plays an important role in the embedding procedure. Classes having similar Length distributions (Figure 1) are neighbours in the t-SNE plots, with a smooth transition from the shortest to the longest sequence. The similarity between the NV8 and NV12 embeddings indicates that nor- malised variance adds a negligible amount of information to mean position, most likely because these features are strongly correlated (Figure 3). Thus, total length conveys a large amount of information about a virus genome sequence and its decomposition into the count of each nucleobase underlies most of the discriminatory power of NV12.

6.2.3 Small classes tend to be confounded with large ones with similar Length distribution Figure 11 breaks down the classification errors (Table 7) by taxonomic class. In each subplot, the x-axis and y-axis represents the true and predicted ICTV Order respectively. Colours represent the percentage of viruses from one class (column) being predicted as a member of another class (row). Diagonal entries represent rates of correct classification (true class equals predicted class) whilst off-diagonal entries are rates of misclassification. From left to right and bottom to top, classes are ordered according to decreasing size (Table 5). The size (number of members) and error rate of a class are clearly correlated. When using Length with k-NN or RBF-SVM, the accuracy of the larger ICTV Orders C, M,

22 T, P are consistently above 0.7 whereas those of the smaller ICTV Orders N, H, L are generally below 0.5. However, similarly sized classes can have significantly different errors. For example, ICTV Orders H and N have 66 and 67 members respectively but H has a much higher chance of being misclassified. This may be caused by the Length distribution (Figure 1) since viruses from H have a higher chance of being misclassified as C. Class size and Length distribution are key contributors to classification performance. Errors arise mainly from small classes, which tend to be confounded by larger ones with similar Length distribution. However, viruses from large classes are unlikely to be misclas- sified into small classes with similar Length distribution because all the confusion matrices are non-symmetric (the bottom right has higher values than the upper left). In summary, larger classes have higher accuracy and higher false positive rates whilst smaller ones have lower accuracy and higher false negative rates. These results echo the proximity of different classes in the t-SNE plots (Figure 4): large classes such as the ICTV Order C are well separated from the other classes but the ICTV Order H and L are largely confounded by C (indicating higher errors in the confusion matrices).

6.2.4 Performance improves from NV12 more for small classes than larger classes When comparing the confusion matrices from Length to NV12 (Figure 11), the values along the diagonals increase (an increase in correct classification) and values in off-diagonal regions decrease (a decrease in misclassification). Like overall classification errors, the improvement is noticeable going from Length to NV4 and modest thereafter. However, the behaviour of small classes (for example, ICTV Orders N, H, and L) and large classes (for example, ICTV Orders C, M, T, and P) differ. For small classes, the change is dramatic from Length to NV4, and still noticeable from NV4 to NV12, especially when using the RBF-SVM. For large classes, the improvement from Length to NV4 exists but is much less noticeable than for small classes, and almost negligible from NV4 to NV12. This suggests that a more complex feature vector improves the performance for small classes in particular, but for large ones only marginally so. Owing to the bias in size, the extra performance obtained from small classes seems negligible in terms of the overall classification errors (Table 7).

6.3 Word- and compression-based features This section extends the previous section on features based on single nucleotides to features derived from k-mers and the entire sequence. The aim is to investigate the predictive power of different features and classifiers and identify the best combination for the task of virus taxonomy.

23 k−NN SVM L L H 0.8 H N 0.6 N P P 0.4 Length T T M 0.2 M C C 0.0 CMTPNHL CMTPNHL

L L H 0.8 H N 0.6 N

NV4 P P 0.4 T T M 0.2 M C C 0.0 CMTPNHL CMTPNHL

L L H 0.8 H N 0.6 N

NV8 P P 0.4 T T M 0.2 M C C 0.0 CMTPNHL CMTPNHL

L L H 0.8 H N 0.6 N P P NV12 0.4 T T M 0.2 M C C 0.0 CMTPNHL CMTPNHL

Figure 11: Confusion matrices for the classification results of different combinations of nucleotide-based feature vectors and classifiers.

24 6.3.1 Overall classification performance Table 8 shows the best performance achieved for feature vectors constructed from vari- ous word- and compression-based features. The best combinations of feature vector and classifier are listed and that with lowest mean error rate is highlighted in bold. Richer alphabets improve classification performance. The k-mer counts of the 4-letter alphabet (“Counts of ACGT”) performs better than that of the concatenated three 2-letter alphabets (“Concat-counts of SW, RY, MK”) which are better than that of the individual 2-letter alphabets (“Counts of SW,” “Counts of RY,” and “Counts of MK”) which in turn are better than the 1-letter alphabet (Length). When ACGT is the alphabet, k-mer counts outperform RTD, possibly because RTD has lost crucial information on genome length. Although the performance of a feature vector constructed by concatenating k-mer counts and RTD (“Concat-counts and RTD of ACGT”) is better than RTD, it is still worse than k-mer counts. The k-mer NV of ACGT significantly outperforms NV12 (1-mer NV) (Table 7). Whilst longer k-mers contain richer information, they do not always improve performance because the 6-mer feature vector is not the best. For k-mer NV of ACGT, as k increases, error rates follow the same trend in that they first decrease and then increase. The cause may be redundancy and noise. Likewise concatenating statistics of subwords of a k-mer (“Concat- k-mer NV of ACGT”) does not improve performance over k-mer NV; k-mer counts of ACGT outperform k-mer NV of ACGT. By analysing SVM weights of “Concat-k-mer NV of ACGT,” we find that counts of the largest k-mers are associated with higher weights. This agrees with our findings in nucleotide-based features that counts play a key part in classification while mean position and variance are far less important. These results are encouraging because the dimension of the feature vector can be reduced by more than a third from the concatenated k-mer NV to the simpler k-mer counts. Compression-based feature vectors underperform k-mer based feature vectors as their features are relatively coarse, but their error rates are still respectable. Compression ratios derived from DNA-specific tools outperform those from general-purpose ones.

6.4 Using the best feature vector – classifier combination to address biological questions A feature vector constructed using 4-mer counts of ACGT and the RF-SVM classifier gives the lowest error rate (Table 8). This feature-classifier combination is used in our experiments in the following sections.

6.5 Identifying viruses that are difficult to classify Table 9 lists the 8 out of 1,865 analysed viruses that are misclassified (the true label does not match our predicted label) and of these, 6 are from small classes (H, N and L). Table 7

25 Table 8: Classification performance of different word- and compression-based features and classifiers.

Features Best combination Error rate Length Length, k-NN 0.137 ± 0.013 Counts of SW 6-mer, k-NN 0.052 ± 0.012 Counts of RY 6-mer, RBF-SVM 0.030 ± 0.006 Counts of MK 6-mer, RBF-SVM 0.044 ± 0.008 Concat-counts of SW, RY, MK 6-mer, RBF-SVM 0.016 ± 0.006 Counts of ACGT 4-mer, RBF-SVM 0.006 ± 0.002 RTD of ACGT 3-mer, RBF-SVM 0.014 ± 0.004 Concat-counts and RTD of ACGT 4-mer, RBF-SVM 0.009 ± 0.004 k-mer NV of ACGT 4-mer, RBF-SVM 0.007 ± 0.002 Concat-k-mer NV of ACGT 4-mer, RBF-SVM 0.007 ± 0.003 General-purpose compression CRGP, RBF-SVM 0.092 ± 0.007 DNA-specific compression CRDNA, RBF-SVM 0.089 ± 0.008 Compression-based CRA, RBF-SVM 0.073 ± 0.011 and Figure 11 provide an explanation in terms of both overall error rates and those of small classes. Table 9: Difficult viruses.

Accession No. Name Label Predicted NC018874.1 Abalone herpesvirus H C Victoria/AUS/2009 NC005830.1 Acidianus filamentous virus 1 L M NC009965.1 Acidianus rod-shaped virus 1 L C NC024709.1 Ball python nidovirus N C NC001609.1 Enterobacteria phage P4 C T NC002515.1 Mycoplasma phage P1 C P NC005881.2 Ostreid herpesvirus 1 H C NC019413.1 Sulfolobales Mexican L C rudivirus 1

6.6 Predicting the ICTV Order of unlabelled viruses We used the best combination of feature vector (4-mer counts of ACGT) and classifier (RBF-SVM) to train a model from all labelled virus genome sequences. We employed the learned model to predict the ICTV Order of all unlabelled virus genome sequences.

26 Table 10 gives the number of currently unlabelled viruses predicted in each taxonomic class. Table 10: Predicted ICTV Order for unlabelled viruses.

CHLMNPT 273 80 0 85 12 403 981

7 Discussion

7.1 Overview of this work In this work, we performed a systematic study aimed at comparing and contrasting the predictive power of various features of virus genome sequences and classifiers in a virus taxonomy task. We find that for nucleotide-based features, the simple Length already gives respectable performance, with a mean error rate of 0.137. NV4 gives a significant im- provement but including higher order moments reduces overall error rates only marginally. Given the importance of Length, smaller classes tend to be confounded with larger classes that have similar Length distributions; performance on these classes can be improved with more sophisticated features and classifiers. We find that a richer alphabet improves performance, as well as a suitable k-mer, but in- cluding higher order moments or using longer k-mers is not always beneficial. Compression- based features perform worse than most k-mer features, with those based on tools tailored for DNA sequences outperforming those based on general-purpose ones. The best feature- classifier combination in our study is 4-mer counts and RBF-SVM (the lowest mean error rate of 0.006). Using the best experimental settings determined in the study, we identified 8 out of 1,865 viruses as difficult to classify. These may warrant further attention from virologists (for example, it is possible that the “true” label present in the RefSeq file may be incorrect). We also predicted the ICTV Order for unlabelled sequences.

7.2 Validation of ICTV Order predictions using an updated dataset The NCBI RefSeq dataset is continuously updated as new genomes are sequenced, taxon- omy criteria are updated and scientific discoveries are made. A number of changes were made to the dataset since we started the work. A version downloaded on 20th August 2017 contains 3,721 non-satellite non-segmented viruses, an increase of 22 compared to the dataset used in this paper. This more recent version of the dataset contained 101 new viruses but 79 viruses in our version had been removed. We found that 17 out of 1,834 viruses that had no ICTV Order in our earlier version of the dataset had an ICTV Order assignment in the later version: 16 from Caudovirales and 1 from Tymovirales. Of these,

27 we had correctly predicted 16 and misclassified one. The taxonomic labels for the misclas- sified virus (Xaxnthomonas phage Cf1c) had changed significantly: the later version of the dataset had assigned it to a completely different ICTV Family and Genus. The changes to the later dataset are small compared to the one we used for the study so the classification models and findings remain relevant. However, the experimental frame- work can handle significant changes in the dataset such as the addition of two new Orders by the ICTV.

7.3 Imbalanced classes The virus genome sequence dataset is imbalanced, with larger classes having significantly more samples than smaller ones. Small classes consistently give higher error rates, a general issue associated with almost any classifier as their aim is to minimise the overall error rate. From a statistical perspective, a small number of samples may not be representative of their population, hence models built on them are less able to generalise to unseen samples. Efforts were made to improve performance for small classes. We tried resampling methods [15, 29] but found that although they can be beneficial to small classes, they are harmful for large ones and degrade overall classification performance. However, we found that more sophisticated features and classifiers are useful especially for small classes; in particular, we observe noticeable improvements when the features are changed from Length to NV12, then from NV12 to k-mer features. Nonetheless, insufficient data is an inherent disadvantage of small classes that is unlikely to be eliminated completely. The same problem also holds for human experts, who are more likely to misclassify unfamiliar or rare viruses. A practical consequence of applying classifiers trained using a biased data set is that label predictions for a small class imply a high level of confidence in the assignment, whereas predictions of a large class do not. Hence, for the currently unlabelled viruses, we would be more confident in the predicted class membership of viruses if they are classified into a small class such as ICTV Order N, H or L.

8 Future work

8.1 Word-based feature vectors We studied features based on the k-mer statistics where a precise match was required between a sequence of k consecutive nucleotides and a given k-mer. A natural extension is to allow mismatches with gapped k-mers [51, 48] which are short sequences similar to the given k-mers but containing different patterns or having different lengths.

28 8.2 Hierarchical classification The current classification is performed only at the Order level. A natural extension is to identify the entire taxonomic path of a virus by classifying from Order to Species. The challenge of classification at deeper levels is that the number of classes is large and the number of samples per class is small. Hierarchical classification is a potential solution as it exploits the hierarchical relationship between classes and thus improves classification performance; however, while there are examples of its application to various biological sequences such as gene expression and protein sequences [63, 28], there is no explicit study demonstrating its use in the classification of viruses into ICTV taxa based on reference genome sequences. Related works, including [78, 31, 33] typically use flat classification for individual ICTV taxonomic levels without accounting for hierarchy.

8.3 Distributed representations of complete viral genome sequences: word and sequence embeddings In our current work, the feature variables were all manually designed. In the future, we can explore the efficacy of feature representations generated by automated methods. For instance, the word2vec methodology [53], originally proposed for NLP, generates vector representations for words using neural network language models. A similar model has been successfully applied to construct features for biological sequences [10]. In this work, the features used to represent complete viral genome sequences were all hand-crafted. In the future, we can explore the efficacy of features produced using auto- mated methods for a variety of virus taxonomy-, phylogeny-, and genome biology-related tasks. For instance, the word2vec methodology trains a (shallow) Neural Network over a large text corpus in order to generate a mapping of words or phrases in the lexicon to vectors of real numbers in a low-dimensional space such that words with similar linguistic context are close by in the Euclidean space. Originally proposed for Natural Language Pro- cessing, the word2vec framework has been used to construct features for biological (BioVec) and protein (ProtVec) sequences [10]. seq2vec, an extension of word2vec that estimates a representation of a complete protein sequence as opposed to k-mers, outperforms ProtVec on protein classification problems [39]. Embedded representations of proteins have been used for protein property prediction tasks [76]. For sequence retrieval and classification tasks, a Mahalanobis distance metric in the embedded feature vector space may be better than the default Euclidean metric [38].

8.4 Graph-based semi-supervised learning The dataset we analysed contains a large proportion of unlabelled virus genomes (Table 5). Because we frame virus classification as a supervised learning (SL) problem, these samples were discarded when training the k-NN and SVM models. Semi-supervised learning (SSL) [81] uses unlabelled data in conjunction with a small amount of labelled data to yield

29 models whose classification performance can surpass that which could be obtained either by discarding the unlabeled data (SL) or by neglecting the labels (unsupervised learning). Graph-based SSL [81] (GSSL) uses a graph reflecting some notion of closeness between samples to infer the class membership of unlabelled samples. In this way, GSSL may be seen to propagate the limited number of initial labels present in the training data through the graph. SSL has been applied to different biological sequences and used for tasks such as protein function prediction [52] and gene function prediction [59]. To date, it has not been applied to virus genome sequences. We suspect the S3VM [17], an SSL version of the SVM based on self-training, will be better at predicting ICTV Order than our SVM with the lowest error rate. We propose that GSSL provides a way to generate new insights about virions from a graph whose topology is informed by a large number of sequence relationships and whose discrete- and/or real-value node labels are informed by sparser information on virus bi- ology. Specifically, consider the following network of objects and their relationships: a node denotes a virus, an edge specifies the phylogenetic distance between the genomes of the connected viruses (for example, a divergence metric between low-dimensional vectors constructed from nucleotide-, word-, compression- or embedding-based features), and a node label indicates a phenotypic property such as taxon (for example, ICTV Order or Baltimore Class), host (eukaryote, archaeon or bacterium), and capsid type (for example, helical or icosahedral). A GSSL-based framework makes it easy to incorporate, integrate and exploit diverse incoming data: the flood of genomic sequences adds new nodes to the graph whilst the trickle of biological properties adds to the small but growing pool of node labels. Together, the two permit inferences to be made about viruses for which there is no or limited knowledge apart from their complete genome sequences. Furthermore, the graph’s community structure provides the basis for learning a data-driven taxonomy of the virosphere where taxa are organised not as a tree but as a network whose structure (the number of nodes and the pattern of edges between individual as well as groups of related viruses) can evolve in light of new data. Acknowledgements: The authors gratefully acknowledge Dr K. Bryson and Dr D. Fernandez-Reyes from University College London for discussions.

References

[1] class: Functions for classification, Augest 2015. Retrieved Augest 11, 2015, from https://cran.r-project.org/web/packages/class/index.html.

[2] e1071: Misc functions of the department of statistics, probability theory group (for- merly: E1071), tu wien, Augest 2015. Retrieved Augest 27, 2015, from https://cran.r- project.org/web/packages/e1071/index.html.

30 [3] ggtern: An extension to ’ggplot2’, for the creation of ternary dia- grams, May 2015. Retrieved November 19, 2015, from https://cran.r- project.org/web/packages/ggtern/index.html.

[4] NCBI ftp site, September 2015. Retrieved September 18, 2015, from ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/.

[5] A quick benchmark: Gzip vs. bzip2 vs. lzma, May 2015. Retrieved May 19, 2016, from https://tukaani.org/lzma/benchmarks.html.

[6] The r graphics package, May 2015. Retrieved November 19, 2015, from https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/00Index.html.

[7] Rtsne: T-distributed stochastic neighbor embedding using barnes-hut imple- mentation, May 2015. Retrieved November 19, 2015, from https://cran.r- project.org/web/packages/Rtsne/index.html.

[8] J. S. Almeida. Sequence analysis by iterated maps, a review. Briefings in Bioinfor- matics, 15(3):369–375, 2014.

[9] N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regres- sion. The American Statistician, 46(3):175–185, 1992.

[10] E. Asgari and M. R. K. Mofrad. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE, 10(11), 2015.

[11] Ehsaneddin Asgari, Philipp C M¨unch, Till R Lesker, Alice Carolyn McHardy, and Mohammad RK Mofrad. Nucleotide-pair encoding of 16s rrna sequences for host phenotype and biomarker detection. bioRxiv, page 334722, 2018.

[12] Prasert Auewarakul. Composition bias and genome polarity of rna viruses. Virus Research, 109(1):33–37, 2005.

[13] N. S. Bakr and A. A. Sharawi. DNA lossless compression algorithms: Review. Amer- ican Journal of Bioinformatics Research, 3(3):72–81, 2013.

[14] Y. Bao, V. Chetvernin, and T. Tatusova. Improvements to pairwise sequence compar- ison (PASC): a genome-based web tool for virus classification. Archives of Virology, 159(12):3293–3304, 2014.

[15] M. Bekkar and T. A. Alitouche. Imbalanced data learning approaches review. In- ternational Journal of Data Mining and Knowledge Management Process, 3(4):15–33, 2013.

[16] Ilya S Belalov and Alexander N Lukashev. Causes and implications of codon usage bias in rna viruses. PLoS One, 8(2):e56642, 2013.

31 [17] Kristin P Bennett and Ayhan Demiriz. Semi-supervised support vector machines. In Advances in Neural Information processing systems, pages 368–374, 1999.

[18] G. Benoit, C. Lemaitre, D. Lavenier, E. Drezen, T. Dayris, R. Uricaru, and G. Rizk. Reference-free compression of high throughput sequencing data with a probabilistic de bruijn graph. BMC Bioinformatics, 16(288), 2015.

[19] Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Com- munications of the ACM, 13(7):422–426, 1970.

[20] I. Borozan, S. Watt, and V. Ferretti. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification. Bioinformatics, 31(9):1396–1404, 2015.

[21] Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. 1994.

[22] C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:3:27, 2011.

[23] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.

[24] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge, 2000.

[25] M. Deng, C. Yu, Q. Liang, R. L. He, and S. S.-T. Yau. A novel method of characterizing genetic sequences: genome space with biological distance and applications. PLoS ONE, 6(3), 2011.

[26] P. Deutsch. Deflate compressed data format specification, May 1996. Retrieved from November 1, 2016, from https://tools.ietf.org/html/rfc1951.

[27] R. C. Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792–1797, 2004.

[28] A. A. Freitas and A. Carvalho. A tutorial on hierarchical classification with appli- cations in bioinformatics. In D. Taniar, editor, Research and Trends in Data Mining Technologies and Applications, pages 175–208, 2007.

[29] H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284, 2009.

[30] P. He and J. Wang. Numerical characterization of DNA primary sequence. Internet Electronic Journal of Molecular Design, 1(12):668–674, 2002.

32 [31] T. Hernandez and J. Yang. Descriptive statistics of the genome: phylogenetic classi- fication of viruses. Archive of Cornell University Library, Sep 2013. [32] M. Hosseini, D. Pratas, and A. J. Pinho. A survey on data compression methods for biological sequences. Information, 7(56), 2016. [33] H.-H. Huang. An ensemble distance measure of k-mer and natural vector for the phylogenetic analysis of multiple-segmented viruses. Journal of Theoretical Biology, 398:136–144, 2016. [34] H.-H. Huang, C. Yu, H. Zheng, T. Hernandez, S.-C. Yau, R. L. He, J. Yang, and S. S.- T. Yau. Global comparison of multiple-segmented viruses in 12-dimensional genome space. Molecular Phylogenetics and Evolution, 81:29–36, 2014. [35] David A Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952. [36] T. Hyypi, T. Hovi, N. J. Knowles, and G. Stanway. Classification of based on molecular and biological properties. Journal of General Virology, 78(1):1– 11, 1997. [37] K. Katoh, K. Misawa, K. Kuma, and T. Miyata. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30(14):3059–3066, 2002. [38] Dhananjay Kimothi, Ankita Shukla, Pravesh Biyani, Saket Anand, and James M Hogan. Metric learning on biological sequence embeddings. In Signal Processing Ad- vances in Wireless Communications (SPAWC), 2017 IEEE 18th International Work- shop on, pages 1–5. IEEE, 2017. [39] Dhananjay Kimothi, Akshay Soni, Pravesh Biyani, and James M Hogan. Distributed representations for biological sequence analysis. Archive of Cornell University Library, 2016. [40] A. M. Q. King, M. J. Adams, E. B. Carstens, and E. J. Lefkowitz, editors. Virus tax- onomy: classification and nomenclature of viruses: Ninth Report of the International Committee on Taxonomy of Viruses. Elsevier/Academic Press, London, 2012. [41] P. Kolekar, M. Kale, and U. Kulkarni-Kale. Alignment-free distance measure based on return time distribution for sequence analysis: Applications to clustering, molecu- lar phylogeny and subtyping. Molecular Phylogenetics and Evolution, 65(2):510–522, 2012. [42] P. S. Kolekar, M. M. Kale, and U. Kulkarni-Kale. ’inter-arrival time’ inspired algorithm and its application in clustering and molecular phylogeny. AIP Conference Proceedings, 1298:307–312, 2010.

33 [43] P. S. Kolekar, M. M. Kale, and U. Kulkarni-Kale. Genotyping of mumps viruses based on sh gene: development of a server using alignment-free and alignment-based methods. Immunome Research, 7(3):1–7, 2011.

[44] Andrei N Kolmogorov. On tables of random numbers. Sankhy¯a: The Indian Journal of Statistics, Series A, pages 369–376, 1963.

[45] M. Krupoviˇcand D. H. Bamford. Order to the viral universe. Journal of Virology, 84(24):12476–12479, 2010.

[46] R. Kumar, B. K. Mishra, T. Lahiri, G. Kumar, N. Kumar, R. Gupta, and M. K. Pal. Pcv: An alignment free method for finding homologous nucleotide sequences and its application in phylogenetic study. Interdisciplinary Sciences, pages 1–11, 2016.

[47] M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P. A. McGettigan, H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson, T. J. Gibson, and D. G. Higgins. Clustal w and clustal x version 2.0. Bioinformatics, 23(21):2947–2948, 2007.

[48] Chris-Andr´eLeimeister, Salma Sohrabi-Jahromi, and Burkhard Morgenstern. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinfor- matics, 33(7):971–979, 2017.

[49] M. Li, J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. An information- based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17(2):149–154, 2001.

[50] X.-Q. Li and D. Du. Variation, evolution, and correlation analysis of c+g content and genome or chromosome size in different kingdoms and phyla. PLoS ONE, 9(2), 2014.

[51] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classi- fication using string kernels. The Journal of Machine Learning Research, 2:419–444, 2002.

[52] J. Metz and A. A. Freitas. Extending hierarchical classification with semisupervised learning. In Proceedings of the UK Workshop on Computational Intelligence, pages 1–6, 2009.

[53] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word represen- tations in vector space. Archive of Cornell University Library, September 2013.

[54] M. Mohammed, A. Dutta, T. Bose, S. Chadaram, and S. Mande. Deliminate - a fast and efficient method for loss-less compression of genomic sequences: Sequence analysis. Bioinformatics, 28(19):2527–2529, 2012.

34 [55] B. M. Muhire, A. Varsani, and D. P. Martin. SDT: a virus classification tool based on pair-wise sequence alignment and identity calculation. PLoS One, 9(9), 2014.

[56] A. Pinho and D. Pratas. Mfcompress: A compression tool for fasta and multi-fasta data. Bioinformatics, 30(1):117–118, 2013.

[57] Armando J Pinho, Paulo JSG Ferreira, Ant´onio JR Neves, and Carlos AC Bastos. On the representability of complete genomes by multiple competing finite-context (markov) models. PLoS ONES, 6(6):e21588, 2011.

[58] D. Pratas. Compression and analysis of genomic data. PhD thesis, Universidade de Aveiro, 2016.

[59] A. Santos and A. Canuto. Applying semi-supervised learning in hierarchical multi- label classification. Expert Systems with Applications, 41(14):6075–6085, 2014.

[60] M. Sardaraz, M. Tahir, and A. A. Ikram. Advances in high throughput sequence data compression. Journal of Bioinformatics and Computational Biology, 14(3), 2016.

[61] I. Schwende and T. D. Pham. Pattern recognition and probabilistic measures in alignment-free sequence analysis. Briefings in Bioinformatics, 15(3):354–368, 2014.

[62] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cam- bridge University Press, Cambridge, 2004.

[63] C. N. Silla and A. A. Freitas. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1):31–72, 2011.

[64] P. Simmonds. Methods for virus classification and the challenge of incorporating metagenomic sequence data. Journal of General Virology, 96(6):1193–1206, 2015.

[65] Peter Simmonds, Mike J Adams, M´aria Benk˝o, Mya Breitbart, J Rodney Brister, Eric B Carstens, Andrew J Davison, Eric Delwart, Alexander E Gorbalenya, Bal´azs Harrach, et al. Consensus statement: Virus taxonomy in the age of metagenomics. Nature Reviews Microbiology, 15:161–168, 2017.

[66] K. Song, J. Ren, G. Reinert, M. Deng, M. S. Waterman, and F. Sun. New develop- ments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Briefings in Bioinformatics, 15(3):343–353, 2014.

[67] L. J. P. van der Maaten and G. E Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, Nov 2008.

[68] Formijn van Hemert, Antoinette C van der Kuyl, and Ben Berkhout. Impact of the biased nucleotide composition of viral rna genomes on rna structure and codon usage. Journal of General Virology, 97(10):2608–2619, 2016.

35 [69] V. N. Vapnik. Statistical learning theory. Wiley-Interscience, New York, NY, Septem- ber 1998.

[70] J. P. Vert, K. Tsuda, and B. Sch¨olkopf. A primer on kernel methods, pages 35–70. MIT Press, Cambridge, MA, 2004.

[71] S. Vinga and J. Almeida. Alignment-free sequence comparison – a review. Bioinfor- matics, 19(4):513–523, 2003.

[72] Vladimir V V’yugin. Algorithmic complexity and stochastic properties of finite binary sequences. The Computer Journal, 42(4):294–317, 1999.

[73] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1(4):337–348, 1994.

[74] J. Wen, R. H. F. Chan, S.-C. Yau, R. L. He, and S. S. T. Yau. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene, 546(1):25–34, 2014.

[75] J. Wen, Y. Zhang, and S. S.T. Yau. k-mer sparse matrix model for genetic sequence and its applications in sequence comparison. Journal of Theoretical Biology, 363:145– 150, 2014.

[76] Kevin K Yang, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. Learned protein embeddings for machine learning. Bioinformatics, 1:7, 2018.

[77] C. Yin and S. S. Yau. An improved model for whole genome phylogenetic analysis by Fourier transform. Journal of Theoretical Biology, 382:99–110, October 2015.

[78] C. Yu, T. Hernandez, H. Zheng, S.-C. Yau, H.-H. Huang, R. L. He, J. Yang, and S. S.-T. Yau. Real time classification of viruses in 12 dimensions. PLoS ONE, 8(5), 2013.

[79] C. Yu, Q. Liang, C. Yin, R. L. He, and S. S.-T. Yau. A novel construction of genome space with biological geometry. DNA Research, 17(3):155–168, 2010.

[80] R. Zhang and C. T. Zhang. A brief review: the Z-curve theory and its application in genome analysis. Current Genomics, 15(2):78–94, 2014.

[81] X. Zhu and A. B. Goldberg. Introduction to Semi-Supervised Learning. Morgan and Claypool, San Rafael, CA, 2009.

[82] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compres- sion. IEEE Transactions on Information Theory, 23(3):337–343, 1977.

36