Algorithms for Multi-Sample Cluster Analysis

University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 8-2007 Algorithms for Multi-Sample Cluster Analysis Fahad Almutairi University of Tennessee - Knoxville Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Part of the Management Sciences and Quantitative Methods Commons Recommended Citation Almutairi, Fahad, "Algorithms for Multi-Sample Cluster Analysis. " PhD diss., University of Tennessee, 2007. https://trace.tennessee.edu/utk_graddiss/114 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Fahad Almutairi entitled "Algorithms for Multi- Sample Cluster Analysis." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Management Science. Kenneth C. Gilbert, Major Professor We have read this dissertation and recommend its acceptance: Hamparsum Bozdogan, Kenneth B. Kahn, Charles E. Noon Accepted for the Council: Carolyn R. Hodges Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) To the Graduate Council: I am submitting herewith a dissertation written by Fahad Almutairi entitled "Algorithms for Multi-Sample Cluster Analysis." I have examined the …nal electronic copy of this dissertation for form and content and recommend that it be accepted in partial ful…llment of the requirements for the degree of Doctor of Philosophy, with a major in Management Science. Kenneth C. Gilbert Major Professor We have read this dissertation and recommend its acceptance: Hamparsum Bozdogan Kenneth B. Kahn Charles E. Noon Accepted for the Council: Carolyn R. Hodges Vice Provost and Dean of the Graduate School (Original signatures are on …le with o¢ cial student records.) Algorithms for Multi-Sample Cluster Analysis A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Fahad Almutairi August 2007 Copyright c 2007 by Fahad F. Almutairi. All rights reserved. i Dedication I am honored to dedicate my dissertation to my beloved country Kuwait. ii Acknowlegment I am thankful for Dr. H. Bozdogan for introducing me to the MSCA problem and for his comments and suggestions. I would like to thank Dr. C. Noon and Dr. K. Kahn for sharing their ideas and insights with me. I can not thank enough my advisor Dr. K. Gilbert. This work would not have been possible without his help and support. Finally, I would like to thank the faculty, sta¤, and graduate students of the SOMS department for their kindness and professionalism. iii Abstract In this study, we develop algorithms to solve the Multi-Sample Cluster Analysis (MSCA) problem. This problem arises when we have multiple samples and we need to …nd the statistical model that best …ts the cluster structure of these samples. One important area among others in which our algorithms can be used is international market segmentation. In this area, samples about customers’preferences and characteristics are collected from di¤erent regions in the market. The goal in this case is to join the regions with similar customers’characteristics in clusters (segments). We develop branch and bound algorithms and a genetic algorithm. In these algorithms, any of the available information criteria (AIC, CAIC, SBC, and ICOMP) can be used as the objective function to be optimized. Our algorithms use the Clique Partitioning Problem (CPP) formulation. They are the …rst algorithms to use information criteria with the CPP formulation. When the branch and bound algorithms are allowed to run to completion, they converge to the optimal MSCA alternative. These methods also proved to …nd good solutions when they were stopped short of convergence. In particu- lar, we develop a branching strategy which uses a "look-ahead" technique. We refer to this strategy as the complete adaptive branching strategy. This strategy makes the branch and bound algorithm quickly search for the optimal solution in multiple branches of the enumeration tree before using a depth- …rst branching strategy. In computational tests, this method’s performance was superior to other branching methods as well as to the genetic algorithm. iv Contents List of Tables ix List of Figures x List of Algorithms xii 1 Introduction 1 1.1 Multi-Sample Cluster Analysis (MSCA) . 3 1.2 Information Criteria . 7 1.2.1 General Structure . 8 1.2.2 AIC ............................ 9 1.2.3 CAIC ........................... 9 1.2.4 SBC ............................ 10 1.2.5 ICOMP .......................... 10 1.2.6 MANOVA Model . 12 1.2.7 Varying Means and Varying Covariances Model . 14 1.2.8 The Monotonic Conditions . 15 1.3 International Market Segmentation . 16 v 2 MSCA Formulations and Approaches 20 2.1 Uncapacitated Facility Location Formulation . 21 2.1.1 Formulation . 21 2.1.2 Algorithms . 23 2.2 Set Partitioning Formulation . 25 2.2.1 Formulation . 26 2.2.2 Algorithms . 27 2.3 The Clique Partitioning Problem (CPP) Formulation . 27 2.3.1 Formulation . 28 2.3.2 Algorithms . 32 2.3.3 Linear Formulation for the Number of Clusters . 33 3 MSCA Branch and Bound Algorithms Using the CPP Formu- lation 38 3.1 Introduction............................ 38 3.2 Branching Strategies . 40 3.2.1 Sequential Branching . 40 3.2.2 Adaptive Branching . 40 3.2.3 Reordering . 41 3.2.4 Complete Adaptive Branching . 44 3.3 Bounding Strategies . 45 3.3.1 UpperBound ....................... 45 3.3.2 Extension of Bao et al (2005) Lower Bounds . 46 3.3.3 New Upper and Lower Bounds . 47 3.4 Complete Enumeration Algorithm . 64 vi 3.5 Sequential Branch and Bound Algorithm . 66 3.5.1 Introduction . 66 3.5.2 Agglomerative Sequential Branch and Bound Algorithm 68 3.5.3 Divisive Sequential Branch and Bound Algorithm . 69 3.6 Adaptive Branch and Bound Algorithm . 69 3.7 Adaptive Branch and Bound Algorithm With Reordering . 75 3.8 Complete Adaptive Branch and Bound Algorithm With Re- ordering .............................. 76 3.9 The Lower Bounds Modules . 78 3.9.1 TheModules ....................... 80 3.9.2 Computational Remarks . 80 3.10 Experimental Results . 81 3.10.1 Preliminary Experiments . 81 3.10.2 Evaluation of Strategies Using the IRIS Data Set . 84 3.10.3 Other Data Sets . 90 3.10.4 Upper Bound Improvement Charts . 92 3.11 Computational Remarks . 96 3.11.1 Sequential Branching . 96 3.11.2 The A Matrices . 96 3.12 Conclusions and Future Work . 97 4 Adaptive Clustering Genetic Algorithm With Re-initialization101 4.1 Introduction............................ 101 4.2 The Genetic Algorithm . 106 4.2.1 Overview ......................... 106 vii 4.2.2 Encoding ......................... 107 4.2.3 Guided Random Initialization . 107 4.2.4 Roulette Wheel Selection . 109 4.2.5 Crossover . 110 4.2.6 Adaptive Mutation . 112 4.2.7 Elitism........................... 113 4.2.8 Re-initialization . 113 4.2.9 Recommended Parameters’Values . 114 4.3 Experimental Results . 116 4.3.1 IRIS Data Set . 117 4.3.2 Other Data Sets . 119 4.3.3 Simulation Experiment . 123 4.4 Conclusions and Future Work . 127 Bibliography 132 Vita 141 viii List of Tables 1.1 The size of the MSCA problem. 7 3.1 Results of the preliminary experiments. 82 3.2 Branch and bound strategies that performed well on the IRIS dataset............................... 85 3.3 Branch and bound strategies that did not perform well on the IRISdataset............................ 88 3.4 The complete adaptive branching performace on the IRIS data set.................................. 89 3.5 Branch and bound algorithms’results on other data sets. 91 4.1 Recommended values of GA parameters. 115 4.2 Results of experiments using GA strategies. 117 4.3 Results of experiments on other data sets using the repetitive adaptive clustering GA. 121 4.4 Branch and bound algorithms’results on the simulated data set. 124 4.5 Results of the experiment on the simulated data set using the repetitive adaptive clustering GA . 126 ix List of Figures 2.1 Two redundant alternatives that join the same objects together. 23 2.2 Triangle Constraint: Node 2 is connected to nodes 1 and 3, which forces nodes 1 and 3 to be connected to each other. 29 3.1 A simple branch and bound example. 39 3.2 Adaptive branching complete tree. 42 3.3 Expected performance of adaptive and complete adaptive branch- ingstrategies. ........................... 45 3.4 Theloggraph............................ 49 3.5 Theloge¤ect............................ 50 3.6 The heuristic local upper bound idea. 62 3.7 Upper bound improvement using the complete algorithm on the IRISdataset............................ 93 3.8 Upper bound improvement using the heuristic algorithm on the IRISdataset............................ 93 3.9 Upper bound improvement using the complete algorithm on the Bank (15 samples) data set. 94 x 3.10 Upper bound improvement using the heuristic algorithm on the Bank (15 samples) data set. 94 3.11 Upper bound improvement using the complete algorithm on the Bank (30 samples) data set. 95 3.12 Upper bound improvement using the heuristic algorithm on the Bank (30 samples) data set. 95 4.1 An example of a split and a join crossovers. 111 4.2 TheGA‡owchart. ........................ 114 4.3 One run of the repetitive adaptive GA. 120 4.4 One run of the basic GA. 120 4.5 Best repetitive adaptive clustering GA run on the Bank data set (15 samples). 122 4.6 Best repetitive adaptive clustering GA run on the Bank data set (30 samples).

Load more