A Logical Frameworkfor Frequent Pattern Discovery in Spatial Data

Donato Malerba Floriana Esposito Francesca A. Lisi

Dipartimentodi Informatica- Universit/idegli Studidi via Orahona4 - 70126Bari { malerbaI espositoI lisi} @di.uniba.it

Abstract and recreation and for commercefacilities. Thus, once the In recenttimes, several extensions of data mining methods language of geographyhas been acquired, the major tasks andtechniques have been explored aiming at dealingwith amonggeographers are to observe the relevant spatial advanceddatabases. Many promising applications of features, to identify spatial patterns, to describe and inductivelogic programming (ILP) to knowledgediscovery quantify spatial associations and to elicit explanations for in databaseshave also emerged in order to benefitfrom pattern interactions. With the advent of geographical semanticsandinference rules of first-order logic. In this information systems (GIS), advanced functionalities paper,an ILP framework for frequent pattern discovery in spatial data mining such as frequent pattern discovery are spatialdata is presented. Thepattern discovery algorithm of great interest to GISusers. operatesonfirst-order logic descriptions computed byan initialstep of feature extraction froma spatialdatabase. The The design of algorithms for frequent pattern discovery algorithmbenefits of theavailable background knowledge has turned out to be a popular topic in data mining. This is on thespatial domain and systematically explores the not surprising given the relevance of data and patterns in hierarchical structure of task-relevant geographiclayers. the definition of data mining as a core step in the KDD Preliminary results have been obtained by running the process (Fayyad, Piatetsky-Shapiro, Smyth 1996). The algorithmSPADA on spatial data froman Italian province. blueprint for most algorithms proposed in the literature is the levelwise method by Mannila and Toivonen (1997), which is based on a breadth-first search in the lattice 1 Introduction spanned by a generality order betweenpatterns. The space In recent times, several extensions of data mining methods is searched one level at a time, starting from the most and techniques have been explored to deal with advanced general patterns and iterating between candidate databases such as spatial databases, temporal databases, generation and candidate evaluation phases. Frequent object-oriented databases and multimedia databases. patterns are commonly not considered useful for Progress in spatial databases, such as spatial data structures presentation to the user as such. They can be efficiently (Gating 1994), spatial reasoning (Egenhofer 1991), post-processed into rules that exceed given threshold computational geometry (Preparata and Shames, 1985), values. In the case of association rules the threshold values etc., paved the wayfor the study of knowledgediscovery in of support and confidence offer a natural way of pruning spatial databases which aims at the extraction of implicit weakand rare rules (Agrawaland Srikant 1994). knowledge, spatial relations, or other patterns not In this paper, we propose a logical framework for explicitly stored in spatial databases (Koperski, Adhikary frequent pattern discovery in spatial data. The main novelty and Hen1996). Generally speaking, a spatial pattern is a with respect to previous contributions to spatial data pattern showing the interaction of two or more spatial mining (Koperski and Han 1995) is the expressive power objects or space-depending attributes according to a of the language chosen for representing both data and particular spacing or set of arrangements (DeMers2000). patterns. Indeed, the research to date in the field has For instance, cities across nations are often clustered near generally taken the path of merely embedding spatial lakes,oceans and streams. Actually such an arrangement constructs on the top of well-established statistical reveals a spatial association, meaning that one spatial techniques in order to accommodatethe space dimension pattern is totally or partially related to someother spatial (Roddick and Spiliopoulu 1999). Weclaim the application pattern. Furthermore, questions can be raised about the of Inductive Logic Programming (ILP) methods and causes not only of single distributions but also of spatially techniques (Lavrac and Dzeroski 1994) to knowledge correlated distributions of phenomena.For instance, we discovery in spatial databases in order to benefit from mayexplain that the tendencyof cities to cluster near water semanticsand inference rules of first-order logic. bodies is driven by the need for sources of drinking water The paper is organized as follows. Section 2 will introduce the task of mining spatial association rules viewed as context for frequent pattern discovery in spatial Copyright©2000, American Association for Artificial Intelligence (www.aaai.org).All rightsreserved. data. In Section 3, representation, problemand algorithmic issues in the ILP approach to the task at hand will be

From: FLAIRS-01 Proceedings. Copyright © 2001, AAAI (www.aaai.org). All rights reserved. SPATIOTEMPORALREASONING 6S7 discussed and illustrated by means of a sample task of 1998). Anyway,no insight in the algorithmic issues has frequent pattern discovery in data of an Italian province. been provided. A proposal of logical frameworkinspired to Conclusionsand future workare given in Section 4. the work on mining association rules from multiple relations by Dehaspeand De Raedt (1997) is sketched the following Section. 2 The mining task

The discovery of spatial association rules is a descriptive 3 The logical framework miningtask aiming at the detection of associations between reference objects and some task-relevant objects, the The basic idea in our proposal of logical frameworkis that former being the main subject of the description while the a spatial databaseboils downto a deductive relational latter being spatial objects that are relevant for the task at database(DDB) once the spatial relationships between hand and spatially related to the former. The discovery reference objects and task-relevant objects have been process may be activated by a user query expressed in a extracted. Indeed, DDBs define relations both database mining query language such as extensionally as groundfacts (extensional database,EDB) MINEASSOCIATIONS DESCRIBING "large_towns’ and intensionally as rules (intensional database,IDB). WITHRESPECT TO topology(T.geo, R.geo), R.name, Thus,the expressivepower of first-order logic in databases topology(T.geo,W.geo), W.name, topology(T.geo, B.geo), allows to specify backgroundknowledge (BK) such B.admin_region2 spatial hierarchies,spatial constraintsand rules for spatial FROMtown T, road R, waterW, boundary B qualitative reasoning. WHERET.type=*large" AND distance(T.geo, R.geo) < "5 km’ ANDdistance(T.geo, W.geo) < "5 kin" 3.1 Representation issues ANDdistance(T.geo, B.geo) < "30 km" Let L={al, a2 ..... a~} a set of Datalog atoms of the form where large townsplay the role of reference objects while p(h,..,t,), whereeach term tj maybe either a variable or a roads, water bodies and boundaries play the role of constant (Ceri, Gottlob and Tanca 1989). A conjunction geographic layers from which task-relevant objects are atoms is named atomset. In our framework patterns are taken. Query processing involves massive spatial represented as atomsets. Since the ILP approach operates computationto extract spatial relations from the underlying in the context of a DDB,we denote the DDBat hand D(S) spatial database. Somekind of taxonomic knowledge on to meanthat it is obtained by adding spatial relations task-relevant geographic layers may also be taken into extracted from SDBas concerns the set of reference objects account to get descriptions at different concept levels S to the previously supplied BK.The tuples in D(S) can (multiple-level association rules). As usual in the problem grouped into distinct subsets: Each group, uniquely setting of association rule mining, we search for identified by the corresponding reference object sES, is associations with large support and high confidence (strong called spatial observation and denoted O[s]. Actually, a rules). spatial observation is multi-key, namelyit contains not only Formally, the problemcan be stated as follows: spatial relations betweenthe reference object seS and some Given task-relevant object rjeRt but also spatial relations between ¯ a spatial database SDB, rj and somes’ES. Thus, a spatial observation is given by ¯ a set of reference objects S, O[s] = O[sls] u{O[r~[s][ 3 tuple 0~D(S):0(s, r~)} ¯ some task-relevant geographic layers Rk, I

568 FLAIRS-20Ol confidence at level I in the spatial hierarchies. Anatomset reference objects is assumedas absolute frequency of the C is large (or frequent) at level l if a(C)>_minsup[l]and all pattern in D(S). It is noteworthy that the property ancestors of C with respect to the taxonomiesare large at linkedness guarantees the equivalence betweenthe absolute their corresponding levels. The confidence of a spatial frequency of a pattern and the number of observations association rule A--}B is high at level l if covered by the pattern. The support is obtained as relative q~(B[A)>_minconj[l].sp atial as sociation ru le A--}B is frequencyof the pattern in D(S). strong at level ! if the atomset AuB is large and the confidenceis high at level L Cycleon the level (l_>l) of the spatialhierarchies Findlarge l-atomsetsat level l Cycleon the size (k>1) of the atomsets 3.2 Problem issues Generatecandidate k-atomsets at level l Within the ILP approach, the problem of mining spatial fromlarge (k-1)-atomsets association rules can be decomposed into four sub- Generatelarge k-atomsetsat level l problems: fromcandidate k-atomse~s I) Extract spatial relationships betweenreference objects until no morelarge atomsetsare found. and task-relevant objects 2) Represent each extracted relationship as atom Figure 2. The algorithm SPADA 3) Find large (or frequent) atomsets 4) Generatehighly-confident spatial association rules A rough preliminary remark on the computational Both the problem statement and the problemsolution are complexity of SPADAleads to the notorious trade-off quite complicated since the spatial domain is inherently between expressivity and efficiency in first-order complex. The preliminary feature extraction step maybe representations. Indeed, it is well knownthat a simple performed by the two-step spatial computation proposed by matching of two expressions with commutative and Koperski and Hart (1995). Such a pre-processing associative operators (such as the logical ORof atoms in necessary for saving computationaleffort both in time (on- clause) is NP-complete (Garey and Johnson 1979). line computation of spatial relations) and in space Therefore, any knownalgorithm that checks the coverage (materialization of spatial relations). Therelations returned of an atomset or equivalently that evaluates a query with by the spatial computation are represented as facts to be respect to a relational database has an exponential inserted into D(S). A solution to the third sub-problem complexity. Nevertheless, it has been also proved that queries with up to k atoms, where each atom contains at (frequent pattern discovery) is illustrated in the following Section. The sub-problem of generating highly-confident most j terms, can be evaluated in polynomial time (De rules from frequent patterns is solved as usual in the Raedt and Dzeroski 1994). Whether these constraints are problem setting of association rule mining (Agrawal and applicable to the domainof spatial data analysis is still Srikant 1994). under investigation. Example 3 The algorithm SPADAhas been run on the 3.3 Algorithmic issues mining task in Example 1 with support thresholds minsup[l]=70%, minsup[2]=68%, and minsup[3]=50%. The algorithm SPADA(Spatial Pattern Discovery Some interesting patterns have been discovered. For Algorithm) being proposed for frequent pattern discovery instance, at level l=2 in the spatial hierarchies, the in spatial data implements the aforementioned levelwise following candidate C: method(see Figure 2). It can be considered as an extension is_a(X,large_town), intersects(X,R), is_a(R, main_trunk_road), of WARMR(Dehaspe and De Raedt 1997) to explore intersects(Y,R), diff(Y,X), is_a(Y, large_town) systematically the hierarchical structure of task-relevant has been generated after k=-5 refinement steps and geographic layers. The pattern space is structured evaluated with respect to D(S) by meansof the query: according to 0-subsumption (Plotkin 1970). The candidate ?- is_a(X,large_town), intersects(X,R), is_a(R, main_trunk_road), generation phase consists of a refinement step followed by intersects(Y,R), diff(Y,X), is_a(Y, large_town) a pruning step. The former applies a specialization operator The answer set includes two substitutions, 9t={X~ under 0-subsumptionto patterns previously found frequent barletta, R\ss16, Y~bari} and 02={X~barletta,Rkss16bis, by preserving the property of linkedness (Helft 1987). The Y~bari}.Therefore, the spatial observationO[barletta], shown latter involves verifying that candidate patterns do not 0- in Example2, is covered. However,while computingthe subsumeany infrequent pattern. The candidate evaluation support, the two substitutions count as only one because phase is performed by comparing the support of the both refer to the samelarge town. Since ten of eleven candidate pattern with the minimumsupport threshold set spatial observations are covered and all the ancestor for the level being explored. If the pattern turns out not to patternsare large at their level (1<2), the patternis a large be a large one, it is rejected. As for the support count, the oneat level/--2 with support91%. For the sakeof clarity, candidate is transformed into an existential query whose the following pattern discoveredafter k=-5 refinementsteps answer set supplies all the substitutions that make the at level l=l pattern true in D(S). In particular, the numberof different is_a(X,large_town), intersects(X,R), is_a(R, bindings for the variable which is the placeholder for intersects(Y,R), diff(Y,X),is_a(Y, large_town)

560 FLAIRS-2001 spatial_hierarchy(road,1, null, [road]). Bydefinition, the observationencompasses not only spatial spatial_hierarchy(road,2, road, [motorway, mainjrunk_road, relationsbetween the referenceobject badetta¢S and task- regional_road]). relevantobjects in R~(adriatico, etc.), R2(a14, etc.), spatial_hierarchy(road,3, moterway, [a14]). (fg_boundary,etc.), but alsospatial relationsbetween each spatial_hierarchy(road,3, main_trunk_road, [ss16, ss16bis, ss96, of thesetask-relevant objects and some other s’ES such as ss98,ss99, ssl00]). adjacent_to(bari,adriatico), where bariES. spatial_hierarchy(road,3, regional_road, [r16, r93, r97, r170, To the atomsetC weassign an existentially quantified r171,r172, r271, r378]). conjunctiveformula eqc(C). spatial_hierarchy(water,1, null,[water]). spatial_hierarchy(water,2, water, [sea, river]). Definition (coverage)An atomset C coversan observation spatial_hierarchy(water,3, sea,[adriatico]). O[s]if eqc(C)is true in O[s]uBK. spatial_hierarchy(water,3, river, [, lacone]). Example2 Let us supposethat BKincludes the rule spatial_hierarchy(boundary,1, null, [boundary]). diff(X,Y):- X\= spatial_hierarchy(boundary,2, boundary, [fg_boundary, where~= is the ISOProlog Standard built-in predicatefor ta_boundary,br_boundary, mr_boundary, pz_boundary]). non-unifiabilityof twovariables. The pattern is_a(X,Y) :- spatial_hierarchy(_,_, Y,Nodes), C-is_a(X, large_town), intersects(X,Y), intersects(Z,Y), member(X,Nodes). diff(X,Z), is_a(Y, road) is_a(X,Y) :- spatial_hierarchy(Root,_, Father, Nodes), coversthe spatial observationO[barletta] shown in Example member(X,Nodes), is_a(Father, 1 becausethe correspondingexistentially quantified Here,the is-a relationshipis overloaded,namely it may conjunctiveformula stand for kind-of as well as for instance_ofdepending on eqc(C)-- :] is_a(X,large_town)^intersects(X,Y)^ the context. Spatial relations betweenobjects in S and intersects(Z,Y)^diff(X,Z)^is_a(Y, objects in any of R~,R2 and R3, are extracted by meansof is satisfied by O[barletta] uBK. [] spatial computationand transformedinto facts of kind road (RefObj, TaskRelevantObj) to be added D(S). Spatial observationsare portions of D(S), concerninga reference object. In our case,there are eleven main_lTunk_road motorway regional_road distinct spatial observations,one for eachlarge town.For ,E...... instance,O[barletta] is givenby theunion of the following ~ ss96~’:ss16~ a14 ~~7~ sets of ground facts ss16bis r93 O[barlattaI barletta] O[ssl61 barletta] is_a(barletta,large_town). is_a(ss16,road). Figure1. A spatial hierarchy for the layer of roads adjacent_to(barletta,adriatico). intersects(bad,ss16). intersects(badetta,a14). intersects(trani,ss16). DefinitionLet O be the set of spatial observationsin D(S) intersects(barletta,ss16). intersects(monopoli,ss16). and Oc denote the subset of O containing the spatial intersects(badetta,ss16bis). intersects(molfetta,ss16). observationscovered by the atomsetC. Thesupport of C is intersects(badetta,r170). definedas intersects(barletta,r193). O[sslSbisI badetta] ~(c)= lOft/IOl close_to(barletta,fg_boundary). is_a(ss16bis,road). ,,. intersects(bad,ss16bis). Definition A spatial association rule in D(S) is O[adriaticoI badetta] intersects(trani,ss16bis). implicationof the form is_a(adriatico,water). intersects(molfetta,ss16bis). A-->B(s%, c%), adjacent_to(bari,adriatico). where Ac_L, BcL, AnB=-O,and at least one atom in AuB adjacent_to(trani,addatico). Oir1701 barletta] represents a spatial relationship. Thepercentages s% and adjacent_to(molfetta,addatico). is_a(r170,road). c%are respectively called the support and the confidence adjacent_to(monopoli, intersects(andria,r170). of the rule, meaningthat s%of spatial observationsin D(S) adriatico). are coveredby AuBand c%of spatial observationsin D(S) 0[r1931 badetta] that are coveredby A are also coveredby AuB. Clia141barletta] is_a(r193,road). Definition Thesupport and the confidenceof a spatial is_a(a14,road). association rule A-->Bare given by s = tr(AuB) and c = intersects(bari,a14). O[fg_boundaryI barletta] intersects(trani,a14). is_a(fg_boundary,boundary). ~(Bla)= o(AuB)/ tr(X). intersects(bitonto,a14). adjacent_to(l~ani, The frequency of a pattern depends on the level intersects(gioia_del_colle,a14).fg_boundary). currently exploredin the hierarchical structure of task- intersects(molfetta,a14). relevant geographiclayers. DefinitionLet minsup[l]and minconj[l] be twothresholds setting respectively the minimumsupport and the minimum

SPATIOTEMPORALREASONING S59 is one of the large ancestors for the pattern C. Such a way of taking the taxonomiesinto account during the pattern discovery process implementswhat we referred References to as the systematic exploration of the hierarchical structure of task-relevant geographic layers. Furthermore, it is Agrawal, R.; and Srikant, R. 1994. Fast Algorithms for noteworthythat the use of variables and the addition of the MiningAssociation Rules. In Proceedings of the Twentieth atom diff(Y,Z) derived from the BKallow the algorithm VLDBConference, Santiago: Cile. distinguish betweenmultiple instances of the same class of Ceri, S.; Gottlob, G.; and Tanca, L. 1989. What you spatial objects (e.g. the class large town). Always Wanted to Know About Datalog (And Never Duringthe transformation of frequent patterns into rules, Dared to Ask). IEEE Transactions on Knowledgeand Data the following strong rule (91%support, 91%confidence) Engineering 1(1): 146-166. Dehaspe, L.; and De Raedt, L. 1997. Mining Association is_a(X,large_town), is_a(Y,large_town), diff(Y,X) Rules in Multiple Relations. In Lavrac, N.; and Dzeroski, -> intersects(X,R);is_a(R,main_trunk_road), intersects(Y,R) S. (Eds.), Inductive Logic Programming, LNCS1297, has beenderived from the pattern C. It states that "Given Springer-Verlag, 125-132. that 91%of large towns intersect a main trunk road which DeMers, M.N. 2000. Fundamentals of Geographic in turn is intersected by another large town distinct from Information Systems. 2nd ed., John Wiley & Sons. the previous one, 91%of pairs of distinct large towns are De Raedt, L.; and Dzeroski, S. 1994. First Order jk-clausal crossed by the same main trunk roaa~’. [] Theories Are PAC-Learnable. Artificial lntelh’gence 70:375-392. Egenhofer, M.J. 1991. Reasoning about Binary 4 Conclusions and future work Topological Relations. In Proceedings of the Second Symposium on Large Spatial Databases, Zurich, A logical frameworkfor pattern discovery in spatial data Switzerland, 143-160. has been sketched. The sample task shows that the Fayyad, U.M.; Piatetsky-Shapiro, G.; and Smyth, P. 1996. expressive power of first-order logic enables us to tackle From Data mining to KnowledgeDiscovery: An Overview. applications that cannot be handled by the AVapproach. In Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., The work being presented in this paper is in partial Uthurusamy, R. (Eds): Advances in Knowledge Discovery fulfillment of the research objectives set by the project in Databases, AAAIPress/The MITPress, 1-34. SPIN! (Spatial Mining for Data of Public Interest) funded Garey, M.R.; and Johnson, D.S. 1979. Computers and by the European Union. Intractability. W.H. Freeman & Co, San Francisco, For the future, we plan to optimize and test the algorithm SPADAon real-world data sets. Besides the California. Gating, R.H. 1994. An introduction to spatial database issues of efficiency and scalability that are of great interest to data mining community,the issue of robustness (noise systems. VLDBJournal 3(4), 357-400. Heltt, N. 1987. Inductive generalization: a logical handling, for instance) will be faced. It is noteworthythat very few works tackled this problem in data mining, framework, in Bratko, I.; Lavrac, N. (Eds): Progress in generally because huge amounts of data to be mined are Machine Learning, Sigma Press, 149-157. Koperski, K.; Adhikary, J. and Han, J. 1996. Spatial Data available. In this case, the presence of low levels of noise Mining: Progress and Challenges. in Proceedings can be easily kept under control by tuning the two main parameters of the association rule mining algorithms, Workshop on Research Issues on Data Mining and namely support and confidence. In spatial data mining, KnowledgeDiscovery, Montreal, Canada. robustness has another facet. Indeed, while the discovery of Koperski, K.; and Han, J. 1995. Discovery of Spatial association rules in transactions requires little Association Rules in GeographicInformation Databases. In transformation of stored data, the task of mining spatial Proceedings of the International Symposium on Large association rules relies on a more complex data pre- Spatial Databases, 47-66. Mannila, H.; and Toivonen, H. 1997. Levelwise search and processing which is error-prone. For instance, the borders of theories in knowledgediscovery. Data Mining generationof the predicatesclose_to or adjacent_tois based on the user-defined semantics of the closeness and and KnowledgeDiscovery 1 (3): 259-289. adjacency relations, which should necessarily be Popelinsky, L. 1998. KnowledgeDiscovery in Spatial Data approximated.Further workon the automatedextraction of by means of ILP. In Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge symbolicdescriptions from vectorised mapsis expectedto Discovery, LNAI1510, Springer-Verlag, 185-193. give somehints onthis issue. Plotkin, G. 1970. A note on inductive generalization. Acknowledgements.This work is part of the European MachineIntelligence 5: i 53- i 63. Commission Fifth Framework IST project no. 10536 Preparata, F.; and Shamos, M. 1985. Computational (SPIN!)on spatial data miningfor data of public interest. Geometry: An Introduction. Springer-Verlag, NewYork. We would like to thank Willi Kl6sgen for his useful Roddick, J.F.; and Spiliopoulou, M. 1999. A bibliography comments and Luigi Rubino for his help to implement of temporal, spatial and spatio-temporal data mining SPADA. research. SIGKDDExplorations 1 (i): 34-38.

SPATIOTEMPORALREASONING 561