Reference-Based Sequence Classification

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 00.0000/ACCESS.2020.DOI Reference-Based Sequence Classification ZENGYOU HE1, GUANGYAO XU1, CHAOHUA SHENG1, BO XU1, QUAN ZOU2. 1School of Software, Dalian University of Technology, Tuqiang Road, Dalian, China 2Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology, Chengdu, China Corresponding author: Zengyou He (e-mail: [email protected]). This work was partially supported by the Natural Science Foundation of China under Grant Nos. 61972066 and 61572094, and the Fundamental Research Funds for the Central Universities (No. DUT20YG106). ABSTRACT Sequence classification is an important data mining task in many real-world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art sequence classification algorithms. INDEX TERMS Sequence classification, sequential data analysis, cluster analysis, hypothesis testing, sequence embedding I. INTRODUCTION essential task in numerous bioinformatics applications is to N many practical applications, we have to conduct data classify biological sequences into existing categories [2]. I analysis on data sets that are composed of discrete se- To tackle the sequence classification problem, many ef- quences. Each sequence is an ordered list of elements. fective methods have been proposed from different aspects. For instance, such a sequence can be a protein sequence, Roughly, existing sequence classification methods can be where each element corresponds to an amino acid. Due to divided into three categories [3]: feature-based methods, the existence of a large number of discrete sequences in distance-based methods and model-based methods. Feature- a wide range of applications, sequential data analysis has based methods first transform sequences into feature vectors arXiv:1905.07188v2 [cs.LG] 14 Dec 2020 become an important issue in machine learning and data and then apply existing vectorial data classification methods. mining. Compared to non-sequential data mining, sequential Distance-based methods apply classifiers such as KNN (k data analysis is confronted with new challenges because of Nearest Neighbors) to solve the sequence classification prob- the ordering relationship between different elements in the lem, in which the key issue is to specify a proper distance sequences. Similar to the analysis of non-sequential data, function to measure the distance between two sequences [3]. there are different sequential data mining problems such as Model-based methods generally assume that sequences from clustering, classification and pattern discovery. In this paper, different classes are generated from different probability we focus on the sequence classification problem. distributions, in which the key issue is to estimate the model The task of classification is to determine which predefined parameters from the set of training sequences. target class one unknown object should be assigned to [1]. As In this paper, we focus on the feature-based method since a specific case of the general classification problem, sequence it has several advantages. First of all, various effective classi- classification is to assign class labels to new sequences based fiers have been developed for vectorial data classification [4]. on the classifier constructed in the training phase. In many After transforming sequences into feature vectors, we can real-world applications, we can formulate the data analysis choose any one of these existing classification methods to task as a sequence classification problem. For instance, the fulfill the sequence classification task. Second, in some pop- VOLUME 4, 2016 1 ular feature-based methods such as pattern-based methods, new methods developed under the proposed framework are each feature has a good interpretability. Last but not least, the capable of achieving better classification accuracy than tra- extraction of features from sequences has been extensively ditional sequence classification methods. This indicates that studied across different fields, making it feasible to generate such a reference-based sequence classification framework is sequence features in an effective manner. promising from a practical point of view. The k-mer (in bioinformatics) or k-gram (in natural lan- The main contributions of this paper can be summarized guage processing) is a substring that is composed of k con- as follows: secutive elements, which is probably the most widely used • We present a general reference-based framework for feature in feature-based sequence classification. Such a k- feature-based sequence classification. It offers a unified mer based feature construction method is further generalized view for understanding and explaining many existing by the pattern-based method, in which a feature is a sequen- feature-based sequence classification methods in which tial pattern (a subsequence) that satisfies some constraints different types of sequential patterns are used as fea- (e.g. frequent pattern, discriminative pattern). Over the past tures. few decades, a large number of pattern-based methods have • The reference-based framework can be used as a general been presented in the context of sequence classification [5]– platform for developing new feature-based sequence [30]. classification algorithms. To verify this point, we design In this paper, we present a reference-based sequence clas- new feature-based sequence classification algorithms sification framework, which can be considered as a non- under this framework and demonstrate its advantages trivial generalization of the pattern-based methods. This through extensive experimental results on real sequen- framework has several key steps: candidate set construction, tial data sets. reference point selection and feature value construction. In The rest of the paper is structured as follows. Section II the first step, a set of sequences that serve as the candi- gives a discussion on the related work. In Section III, we date reference points are constructed. Then, some sequences introduce the reference-based sequence classification frame- from the candidate set are selected as the reference points work in detail. In Section IV, we show that many existing according to certain criteria. The number of features in the feature-based sequence classification algorithms can be re- transformed vectorial data will equal the number of selected formulated within the reference-based framework. In Sec- reference points. In other words, each reference point will tion V, we present new feature-based sequence classification correspond to a transformed feature. Finally, a similarity algorithms under this framework, which are effective and function is used to calculate the similarity between each quite different from available solutions. We experimentally sequence in the data and every reference point. The similarity evaluate the proposed reference-based framework through a to each reference point will be used as the corresponding series of experiments on real-life data sets in Section VI. feature value. Finally, we summarise our research and give a discussion on The reference-based sequence classification framework is the future work in Section VII. quite general and flexible since the selection of both reference points and similarity functions is arbitrary. Existing feature- based methods can be regarded as a special variant under II. RELATED WORK our framework by (1) using (frequent or discriminative) In this section, we discuss previous research efforts that are sequential patterns (subsequences) as reference points and (2) closely related to our method. In Section II-A, we provide a utilizing a boolean function (output 1 if the reference point categorization on existing feature-based sequence classifica- is contained in a given sequence and output 0 otherwise) tion methods. In Section II-B, we discuss several instance- as the similarity function. Besides unifying existing pattern- based feature generation methods in the literature of time based methods under the same umbrella, the reference-based series classification. In Section II-C, we present a concise dis- sequence classification framework can be used as a general cussion on reference-based sequence clustering algorithms. platform for developing new feature-based sequence classi- In Section II-D, we provide a short summary on dimension fication methods. To justify this point, we develop a new reduction and embedding methods based on landmark points. feature-based method in which a subset of training sequences are used as the reference points and the Jaccard coefficient is A. FEATURE-BASED METHODS used as the similarity function. In particular, we present two 1) Explicit Subsequence Representation without Selection instance

Reference-Based Sequence Classification

Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Sequence-Based Microrna Clustering

Representative Based Protein Sequence Clustering

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Ultrafast and Sensitive Sequence Search and Clustering Methods in the Era of Next Generation Sequencing

VIRMOTIF: a User-Friendly Tool for Viral Sequence Analysis

Spclust: Towards a Fast and Reliable Clustering for Potentially Divergent

De Novo Clustering of Long Reads by Gene from Transcriptomics Data

Application of Subspace Clustering in DNA Sequence Analysis

Downloaded Without User Conclusions Registration At: and Additional Informations in Supplementary Material

Scalable Clustering for Immune Repertoire Sequence Analysis

The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: a Prototype Integrated Resource for Functional and Computational Genomics