An Update Algorithm for Restricted Random Walk Clusters

An Update Algorithm for Restricted Random Walk Clusters zur Erlangung des akademischen Grades eines Doktors der Wirtschaftswissenschaften (Dr. rer. pol.) von der Fakultat¨ fur¨ Wirtschaftswissenschaften der Universitat¨ Fridericiana zu Karlsruhe genehmigte DISSERTATION von Dipl.-Inform.Wirt. Markus Franke Tag der mundlichen¨ Prufung:¨ 28.02.2007 Referent: Prof. Dr. Andreas Geyer-Schulz Korreferent: Prof. Dr. Karl-Heinz Waldmann Karlsruhe: 24.01.2007 Contents 1 Motivation 1 1.1 Definitions.............................. 2 2 Related Work 11 2.1 StochasticProcesses . 12 2.1.1 RandomWalks ....................... 12 2.1.2 TheRestrictedRandomWalkUsedforClustering . 13 2.1.3 OtherConceptsof“RestrictedRandomWalks” . 14 2.1.4 MarkovChains ....................... 15 2.2 RandomGraphs........................... 18 2.3 ClusterAlgorithms . .. .. .. .. .. .. 20 2.3.1 k-MeansClustering. 23 2.3.2 SinglePassClustering . 24 2.3.3 HierarchicalAgglomerativeClustering . 25 2.4 DynamicClustering. 25 2.4.1 Integration of New Objects into an Existing Clustering.. 26 2.4.2 HandlingChangingSimilarities . 45 2.4.3 MobileScenarios . 49 2.5 RandomizedAlgorithms . 52 2.5.1 Randomized k-Clustering ................. 53 2.5.2 EvolutionaryClustering . 54 2.5.3 ClusteringwithSimulatedAnnealing . 55 2.5.4 RandomWalkCircuitClustering. 58 2.5.5 Randomized Generation of Hierarchical Clusters in Ad HocNetworks ....................... 59 2.5.6 High-SpeedNounClusteringbyHashFunctions . 59 2.6 Random Walk Theory and Markov Processes for Clustering . ... 60 2.6.1 ClusteringbyMarkovianRelaxation . 60 2.6.2 TextclassificationwithRandomWalks . 62 2.6.3 ClusteringbyFlowSimulation . 62 2.6.4 Separation by Neighborhood Similarity and Circular Escape 63 i ii CONTENTS 2.7 Summary .............................. 64 3 Restricted Random Walk Clustering 65 3.1 RRWClustering........................... 65 3.1.1 TheWalks ......................... 66 3.1.2 ClusterConstructionMethods . 71 3.2 PropertiesoftheMethod . 77 3.2.1 Complexity......................... 78 3.2.2 AsymptoticBehavioroftheWalkProcess . 81 3.3 Excursus:Applications . 87 3.3.1 LibraryUsagehistories. 87 3.3.2 IndexingLibraryCorpora . 88 3.3.3 GivingRecommendations . 91 3.4 Summary .............................. 97 4 Updating Restricted Random Walk Clusters 99 4.1 TheBasicUpdateCases . 100 4.1.1 NewPaths .........................101 4.1.2 IllegalSuccessors. 109 4.1.3 NewPathsandIllegalSuccessors . 113 4.1.4 NewNodes .........................113 4.1.5 DeletionofNodes . 113 4.2 ConcurrentUpdatesoftheSimilarityMatrix . 113 4.3 ComplexityoftheUpdateProcedure . 114 4.3.1 NewSuccessors . 115 4.3.2 IllegalSuccessors. 115 4.3.3 ClusterConstruction . 115 4.3.4 Simulation .........................116 4.4 Evaluation..............................117 5 Conclusion and Outlook 119 A A Sample Raw Basket 121 B The Deep South Raw Data 125 Chapter 1 Motivation Being able to quickly evaluate newly arriving data or information is a crucial capability in today’s economy. But this evaluation may be difficult, especially if the new data cannot be interpreted in an isolated way but rather is added to a data base where it may lead to the update of preexisting records or the creation of new ones. An excellent example of such a situation is a purchase data base maintained by a retailer: In regular intervals, new purchases are integrated into the data base which, in turn, may influence the marketing and product portfolio decisions made by the retailer. In a static scenario, cluster algorithms are used among other methods to reduce the complexity of such a data set and to identify for instance groups of customers with similar purchase histories. For large data sets, the execution of a cluster algorithm may take considerable time and computing resources [Vie97]. These time and resource requirements may be acceptable for static information, when the data set only needs to be clustered once. But what if the data is updated frequently? In most cases, a reclustering of the whole data set will be both impractical as well as not economical, especially for small updates. Since in this case large parts of the data set are not changed by the update, valuable time is lost while the cluster algorithm recomputes clusters that were not affected by the update. This problem constitutes the motivation for this thesis. In earlier contribu- tions [Fra03, FGS04, FGS05, FT05], the restricted random walk (RRW) cluster algorithm developed by Schöll and Schöll-Paschinger [SP02] was evaluated in the context of a large data set of library purchase histories and found to work well for that data set in static scenarios, both in terms of the quality of the clusters and of the computation time required. The challenge for this work is the integration of new data into the cluster structure with minimal computational effort. As a fur- ther condition, the cluster quality should remain the same whether the new data is integrated using the update algorithm or by reclustering the complete data set. New data in this context can mean one of three different things: 1 2 CHAPTER 1 1. New objects may enter the data set, along with information about their similarity or distance to other, existing objects. 2. An object may change its similarity or distance to other objects, either by physically moving to another location or by changing its characteristics that determine its similarity to other objects. 3. An object may be removed from the set. We will see that, contrary to other methods reviewed in chapter 2, the update procedure proposed here is able to handle all three cases within reasonable computational time while maintaining the cluster quality. This thesis is structured as follows: In the remainder of the first chapter, terms used in this work are defined. Chapter 2 contains an overview of the current state of research in the area of stochastic processes and cluster algorithms. In chapter 3 the RRW clustering method is introduced as a Markov chain on a similarity or distance graph. In addition, the chapter contains applications for which the algorithm has been successfully used as well as some considerations about the algorithm’s complexity and the asymptotic behavior of the walk process. The core of this thesis, the update algorithm, is developed in chapter 4 together with a proof of its correctness and a comparison with the algorithm classes introduced in chapter 2. Chapter 5 concludes the thesis with an outlook. 1.1 Definitions A very fitting definition for cluster analysis has been given by Kaufman and Rousseeuw [KR90]: “Cluster analysis is the art of finding groups in data.” Al- though this view on cluster analysis may seem informal, the informality is justi- fied. There exists – to the practitioner’s regret – no “best” cluster algorithm that copes with all applications on all data sets equally well. Rather, it depends on the characteristics of the data set as well as on the requirements of the concrete application at hand what a good cluster is and, consequently, a plethora of criteria [HK01, HBV02b, HBV02a, PL02, War63, Wat81] has been proposed to measure the quality of different cluster algorithms. As we are not able to give a global exact definition, let us at least consider a sort of least common denominator. A cluster, according to the Oxford Advanced Learner’s Dictionary [Cow89, p. 215], is a 1 number of things of the same kind growing closely together: a cluster of berries, flowers, curls ◦ ivy growing in thick clusters. 2 number of people, animals or things grouped closely together: a cluster of CHAPTER 1 3 Figure 1.1: Cluster shapes: (a) elongated, (b) compact, (c) ring, (d) sickle houses, spectators, bees, islands, diamonds, stars ◦ a consonant cluster, eg str in strong. In the context of this work, the following definition (modified from [Fra03]) is used: Definition 1.1.1 (cluster) A cluster is a set of objects that are either (a) similar to each other or (b) close – in the sense of some metric – to each other. Espe- cially objects in one cluster should be more similar to each other (intra-cluster homogeneity) than to objects in any other cluster (inter-cluster heterogeneity). In this context, the intra-cluster criterion can be based for example on diam- eter, radius (cf. page 25), variance (cf. page 53), or the sum of squared errors, i.e. the variance multiplied by the number of elements in the cluster. The actual quality of a clustering can consequently be given as a function of this criterion. However, the definition does not make any assumptions concerning the actual shape of a cluster. In the literature (e.g. [KR90]) the shapes depicted in Fig. 1.1 are often considered as standard cases. Depending on the shape of the clusters, the performance of different cluster algorithms – in terms of the chosen quality criterion – may vary considerably. For instance, there are algorithms like k-means clustering that are especially fit to detect spherical objects like the ones in case 4 CHAPTER 1 (b), and that thus cannot cope well with elongated clusters like case (a) in Fig. 1.1. Others, like single linkage clustering, have the problem of bridging: If, between two clusters, there exists a “bridge” of outliers, the two clusters may be connected via this weak link, though it does not accurately reflect the natural grouping. In addition to a cluster, a clustering is defined as follows: Definition 1.1.2 (clustering) A clustering is the result of the execution of a cluster algorithm on an object set X. It describes the assignment of objects to clusters. X, the set of objects to be clustered, is said to be covered by the clustering C con- sisting of clusters Ci iff X = Ci (1.1) C C [i∈ Furthermore, a clustering created by a cluster algorithm can either be disjunctive or not and it can be hierarchical or partitional: Definition 1.1.3 (disjunctive clusters) We call a set of clusters disjunctive if each object belongs to exactly one cluster, i.e. if the clusters do not overlap. In this case, the following property holds: ∀Ci,Cj ∈ C : Ci ∩ Cj = ∅ (1.2) If a cluster algorithm produces disjunctive clusters for every object set, it is also called disjunctive.

Load more