Formal Concept Analysis Based Association Rules Extraction
Total Page:16
File Type:pdf, Size:1020Kb
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 4, No. 2, July 2011 ISSN (Online): 1694-0814 www.IJCSI.org 490 Formal Concept Analysis Based Association Rules Extraction Ben Boubaker Saidi Ourida1 and Tebourski Wafa2 1 Computer Science Department, High Institute of Management, University of Tunis, Bouchoucha, Bardo-Tunis, Tunisie 2 Computer Science Department, High Institute of Management, University of Tunis, Bouchoucha, Bardo-Tunis, Tunisie ABSTRACT In this paper, we introduce a novel approach of association rules mining based Generating a huge number of association rules on Formal Concept Analysis. reduces their utility in the decision making process, done by domain experts. In this The remainder of the paper is organized as context, based on the theory of Formal follows. We outline in Section 2 the Concept Analysis, we propose to extend the association rules derivation problem. notion of Formal Concept through the Section 3 introduces the mathematical generalization of the notion of itemset in order background of FCA and its connection with to consider the itemset as an intent, its the derivation of association rule bases. We support as the cardinality of the extent and present, in Section 4, an heuristic algorithm its relevance which is related to the confidence to calculate the optimal itemsets from rows of rule. Accordingly, we propose a new of data. Section 5 describes the results of approach to extract interesting itemsets the experimental study. Illustrative through the concept coverage. This approach uses a new quality-criteria of a rule: the examples are given throughout the paper. relevance bringing a semantic added value to Section 6 concludes this paper and points formal concept analysis approach to discover out future research directions. association rules. 2. ASSOCIATION RULES KEYWORDS DERIVATION Association rules, formal concept analysis, quality measure. Commonly, the number of the generated association rules grows exponentially with the number of data rows and attributes. 1. INTRODUCTION This can reach hundreds’ of thousands using only some thousands of data Given the information density and mass rows. So, their comprehension and their accumulation of data, it was crucial to interpretation become a hard task. explore this information in order to extract To remedy to this problem, several methods meaningful knowledge. were proposed [17]. The “Concept” is a couple of intent and The commonly generated thousands and extent aiming to represent nuggets of even millions of rules – among which knowledge. Recently, researchers have been many are redundant (Bastide et al., striving to build theoretical foundations for 2000; Stumme et al., 2001; Zaki, 2004) data-Mining based on Formal Concept –[5, 6, 7, 8] encouraged the proposal of Analysis [1,2,3]. Several interesting more discriminating techniques to reduce proposals have appeared, related to the number of reported rules. association rules [4]. This pruning can be based on patterns defined by the user (user-defined IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 4, No. 2, July 2011 ISSN (Online): 1694-0814 www.IJCSI.org 491 templates), on Boolean operators (Meo et 3. MATHEMATICAL al., 1996; Ng et al., 1998; Ohsaki et al., BACKGROUND 2004; Srikant et al., 1997) [9,10,11,12]. The number of rules can be greatly reduced We recall some crucial results inspired through pruning focusing on additional from the Galois lattice-based paradigm information namely the taxonomy of items in FCA and its interesting applications (Han, & Fu, 1995) or on a metric of specific to association rules extraction. interest (Brin et al., 1997) (e.g., Pearson’s correlation or χ2-test) [13, 14]. More 3.1.Preliminary notions advanced techniques that produce only In the remainder of the paper, we use lossless information limited number of the the theoretical framework presented in entire set of rules, called generic bases [20]. (Bastide et al., 2000). The generation of Let O be a set of objects, P a set of such generic bases heavily draws on a battery of results provided by formal properties and R a binary relation concept analysis (FCA) (Ganter & Wille, defined between O and P [19, 20]. 1999) [15]. Primitively, the pruning strategy of association rules is based on crucial TABLE 1. FORMAL CONTEXT techniques namely the frequency of the O A B C D generated pattern through discarding all the I itemsets having a support less than o1 1 1 0 0 MinSup, and the strength of the o2 1 1 0 0 dependency between premise and o3 0 1 1 0 conclusion by pruning all the rules having a o4 0 1 1 1 confidence less than MinConf. o5 0 0 1 1 To prune effectively the extracted association rules, some authors [16] Definition 1 [19]: A formal context introduce another measures. In fact, Bayardo et al propose the conviction (O, P, R) consists of two sets O measure. Moreover, Cherfi et al [17] and P and a relation R between O and suggest five different measures such as the P. The elements of O are called the benefit (interest) and the satisfaction. objects and the elements of P are called Maddouri et al provide the gain measure the properties of the context. In order to [18]. express that an object o is in a relation In this paper, we introduce a new measure: R with a property p, we write oRp or (o, R and read it as "the object o has theא(the relevance. p Indeed, it is backboned on the Formal property p". O ofكConcept Analysis [19, 20]. Assuming that Definition 2 [19]: For a set A an itemset is completely represented by a P of properties, weكformal concept as a couple of intent (the objects and a set B classic itemset) and extent (its support), it define : combines the support of the rule with the The set of properties common to the length of the itemset. Thus, we propose to objects in A : {AאP | oRp for all oאinclude a semantic aspect on association A={p rules extraction by taking into account the The set of objects which have all confidence measure during the selection of properties in B : {BאO | oRp for all pאfrequent itemsets during association rules B={o generation. The couple of operators (, ) is a Galois Connection. IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 4, No. 2, July 2011 ISSN (Online): 1694-0814 www.IJCSI.org 492 Definition 3 [19]: A formal concept of of formal concepts CV={RE 1 , RE 2 , the context (O, P, R) is a pair (A, B) ..., RE n } in (O, P, R), such that any P, A=B and B=A. couple (o, p) in the context (O, P, R) isكO, Bكwith A We call A the extent and B the intent of included in at least one concept of CV. the concept (A, B). Definition 4 [19]: The set of all FIGURE 2. ILLUSTRATIVE EXAMPLE OF PSEUDO-CONCEPT, OPTIMAL CONCEPT, concepts of the context (O, P, R) is AND NON OPTIMAL CONCEPT CONTAINING THE COUPLE denoted by (O, P, R). An ordering (O3,B). O A B C relation (<<) is easily defined on this set I of concepts by : o1 1 1 0 A2 ֞ o2 1 1 0كA1, B1) << (A2, B2) ֞ A1) B1. o3 0 1 1كB2 a. Pseudo-concept of (o3,B) FIGURE 1. CONCEPT LATTICE OF THE CONTEXT (O, P, R) O A B C I o1 1 1 0 o2 1 1 0 o3 0 1 1 b. Optimal concept of (o3,B) O A B C I o1 1 1 0 In this subsection, we remind basic o2 1 1 0 theorem for Concept Lattices [19]: o3 0 1 1 c. Non optimal concept of (o3,B) (O, P, R, <<) is a complete lattice. It Example [18]: is called the concept lattice or Galois Considering the formal context (O, P, lattice of (O, P, R), for which infimum R) depicted by table 1, the figure 2.a and supremum can be described as represents the pseudo-concept follow: containing the couple (o3, B) being the I א I Ai), (∩ i א I (Ai,Bi)=((i א Supi union of the concepts FC2 and FC5. Bi)) A coverage of the context is formed by (I Bi א I Ai , (i א I (Ai, Bi)=( ∩ i א Infi the three concepts: {FC4, FC5, FC6} ) such as: Example [18]: table 1 illustrates the FC4 is the concept containing notion of formal context (O, P, R).The the items ({o1, o2}, {A, B}); latter is composed of five objects {o1, FC5 is the concept containing o2, o3, o4, o5} and four properties the items ({o3, o4}, {B, {A, B, C, D}. The concept lattice of C}); this context is drawn in Figure 1 FC6 is the concept containing containing eight formal concepts. the items ({o4, o5}, {C, D}). Definition 5 [19]: Let (o, p) be a couple The lattice constitutes concept in the context (O, P, R). The pseudo- coverage. concept PC containing the couple (o, p) is the union of all the formal concepts containing (o,p). Definition 6 [20]: A coverage of a context (O, P, R) is defined as a set IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 4, No. 2, July 2011 ISSN (Online): 1694-0814 www.IJCSI.org 493 4. DISCOVERY OF - Width of a concept FCi: the OPTIMAL ITEM-SETS number of objects in the extent A i of the concept. The most expensive step to derive - Conf of a concept FCi: the association rules is the computation of maximum confidence of the set of rules the frequent itemsets [4]. Indeed, this generated from the concept FCi. step consists of applying, iteratively, - Relevance of a concept: is a some heuristics to calculate candidate function of the width the length and the itemsets.