Information Granules Filtering for Inexact Sequential Pattern Mining by Evolutionary Computation

Information Granules Filtering for Inexact Sequential Pattern Mining by Evolutionary Computation Enrico Maiorino1, Francesca Possemato1, Valerio Modugno2 and Antonello Rizzi1 1Dipartimento di Ingegneria dell’Informazione, Elettronica e Telecomunicazioni (DIET) SAPIENZA University of Rome, Via Eudossiana 18, 00184, Rome, Italy 2Dipartimento di Ingegneria Informatica, Automatica e Gestionale (DIAG) SAPIENZA University of Rome, Via Ariosto 25, 00185, Rome, Italy Keywords: Granular Modeling, Sequence Data Mining, Inexact Sequence Matching, Frequent Subsequences Extraction, Evolutionary Computation Abstract: Nowadays, the wide development of techniques to communicate and store information of all kinds has raised the need to find new methods to analyze and interpret big quantities of data. One of the most important problems in sequential data analysis is frequent pattern mining, that consists in finding frequent subsequences (patterns) in a sequence database in order to highlight and to extract interesting knowledge from the data at hand. Usually real-world data is affected by several noise sources and this makes the analysis more challenging, so that approximate pattern matching methods are required. A common procedure employed to identify recurrent patterns in noisy data is based on clustering algorithms relying on some edit distance between subsequences. When facing inexact mining problems, this plain approach can produce many spurious patterns due to multiple pattern matchings on the same sequence excerpt. In this paper we present a method to overcome this drawback by applying an optimization-based filter that identifies the most descriptive patterns among those found by the clustering process, able to return clusters more compact and easily interpretable. We evaluate the mining system’s performances using synthetic data with variable amounts of noise, showing that the algorithm performs well in synthesizing retrieved patterns with acceptable information loss. 1 INTRODUCTION that the choice of an adequate dissimilarity measure becomes a critical issue when we want to design an Nowadays, sequence data mining is a very interest- algorithm able to deal with this kind of problems. ing field of research that is going to be central in the Handling sequences of objects is another challenging next years due to the growth of the so called “Big aspect, especially when the data mining task is de- Data” challenge. Moreover, available data in differ- fined over a structured domain of sequences. Think- ent application fields consist in sequences (for exam- ing data mining algorithms as a building block of a ple over time or space) of generic objects. Gener- wider system facing a classification task, a reason- ally speaking, given a set of sequences defined over able way to treat complex sequential data is to map d a particular domain, a data mining problem consists sequences to R vectors by means of some feature in searching for possible frequent subsequences (pat- extraction procedures in order to use classification terns), relying on inexact matching procedures. In techniques that deal with real valued vectors as in- this work we propose a possible solution for the so put data. The Granular Computing (GrC) (Bargiela called approximate subsequence mining problem, in and Pedrycz, 2003) approach offers a valuable frame- which we admit some noise in the matching process. work to fill the gap between the input sequence do- d As an instance, in computational biology, searching main and the features space R and relies on the so- for recurrent patterns is a critical task in the study called information granules that play the role of indis- of DNA, aiming to identify some genetic mutations tinguishable features at a particular level of abstrac- or to classify proteins according to some structural tion adopted for system description. The main ob- properties. Sometimes the process of pattern extrac- jective of Granular modeling consists in finding the tion returns sequences that differ from the others in a correct level of information granulation that best de- few positions. Then it is not difficult to understand scribes the input data. The problem of pattern data 104 Maiorino E., Possemato F., Modugno V. and Rizzi A.. Information Granules Filtering for Inexact Sequential Pattern Mining by Evolutionary Computation. DOI: 10.5220/0005124901040111 In Proceedings of the International Conference on Evolutionary Computation Theory and Applications (ECTA-2014), pages 104-111 ISBN: 978-989-758-052-9 Copyright c 2014 SCITEPRESS (Science and Technology Publications, Lda.) InformationGranulesFilteringforInexactSequentialPatternMiningbyEvolutionaryComputation mining is similar to the problem of mining frequent thus better addressing the curse of dimensionality item sets and subsequences. For this type of problem problem. many works (Zaki, 2001) (Yan et al., 2003) describe This paper consists of three parts. In the first part we search techniques for non-contiguous sequences of provide some useful definitions and a proper notation; objects. For example, the first work (Agrawal and in the second part we present FRL-GRADIS as a two- Srikant, 1995) of this sub-field of data mining is re- step procedure, consisting of a subsequences extrac- lated to the analysis and prediction of consumer be- tion step and a subsequences filtering step. Finally, in haviour. In this context, a transaction consists in the the third part, we report the results obtained by apply- sale of a set of items and a sequence is an ordered ing the algorithm to synthetic data, showing a good set of transactions. If we consider, for instance, a se- overall performance in most cases. quence of transactions ha;b;a;c;a;ci a possible non- contiguous subsequence could be ha;b;c;ci. How- ever, this approach is not ideal when the objective is to extract patterns where the contiguity of the com- 2 PROBLEM DEFINITION ponent objects inside a sequence plays a fundamental = f g domain objects role in the extraction of information. The computa- Let D ai be a of ai. The objects tion biology community has developed a lot of meth- represent the atomic units of information. A sequence S n ods for detecting frequent patterns that in this field are is an ordered list of objects that can be represented called motifs. Some works (Sinha and Tompa, 2003), by the set of pairs (Pavesi et al., 2004) use Hamming distances to search S = f(i ! bi) j i = 1;:::;n; bi 2 Dg; for recurrent motifs in data. Other works employ suf- where the integer i is the order index of the object bi fix tree data structure (Zhu et al., 2007), suffix array within the sequence S. S can also be expressed with to store and organize the search space (Ji and Bai- the compact notation ley, 2007), or use a GrC framework for the extrac- S ≡ hb ;b ;:::;bni tion of frequent patterns in data (Rizzi et al., 2013). 1 2 Most methods focus only on the recurrence of pat- A sequence database SDB is a set of sequences Si of terns in data without taking into account the concept variable lengths ni. For example, the DNA sequence of “information redundancy”, or, in other words, the S = hG;T;C;A;A;T;G;T;Ci is defined over the do- existence of overlapping among retrieved patterns. In main of the four amino acids D = fA;C;G;Tg. A sequence S = hb0 ;b0 ;:::;b0 i is a subsequence this paper we present a new approximate subsequence 1 1 2 n1 of a sequence S = h 00; 00;:::; 00 i if n ≤ n and mining algorithm called FRL-GRADIS (Filtered Re- 2 b1 b2 bn2 1 2 inforcement Learning-based GRanular Approach for S1 ⊆ S2. The position pS2 (S1) of the subsequence S1 DIscrete Sequences) aiming to reduce the information with respect to the sequence S2 corresponds to the or- redundancy of RL-GRADIS (Rizzi et al., 2012) by ex- der index of its first element (in this case the order 0 ecuting an optimization-based refinement process on index of the object b1) within the sequence S2. The the extracted patterns. In particular, this paper intro- subsequence S1 is also said to be connected if 0 00 duces the following contributions: b j = b j+k 8 j = 1;:::;n1 1. our approach finds the patterns that maximize the where k = pS2 (S1). knowledge about the process that generates the se- Two subsequences S1 and S2 of a sequence S are over- quences; lapping if 2. we employ a dissimilarity measure that can ex- S1 \ S2 6= 0/: tract patterns despite the presence of noise and In the example described above, the complete nota- possible corruptions of the patterns themselves; tion for the sequence S = hG;T;C;A;A;T;G;T;Ci is 3. our method can be applied on every kind of se- S = f(1 ! G);(2 ! T);(3 ! C);:::g quence of objects, given a properly defined simi- and a possible connected subsequence S1 = hA;T;Gi larity or dissimilarity function defined in the ob- corresponds to the set jects domain; S1 = f(5 ! A);(6 ! T);(7 ! G)g: 4. the filtering operation produces results that can be Notice that the objects of the subsequence S1 inherit interpreted more easily by application’s field ex- the order indices from the containing sequence S, so perts; that they are univocally referred to their original po- 5. considering this procedure as an inner module of sitions in S. From now on we will focus only on con- a more complex classification system, it allows to nected subsequences, therefore the connection prop- further reduce the dimension of the feature space, erty will be implicitly assumed. 105 ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications w1 w2 w3 w4 w5 W ::: ::: b1 b2 b3 b4 b5 b6 b7 b8 S C Figure 1: Coverage of the pattern W over the subsequence C ⊆ S with tolerance d.

Load more