Wallace's Approach to Unsupervised Learning: the Snob Program

The Computer Journal Advance Access published January 27, 2008 # The Author 2007. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. For Permissions, please email: [email protected] doi:10.1093/comjnl/bxm121 Wallace’s Approach to Unsupervised Learning: The Snob Program MURRAY A. JORGENSEN1,* AND GEOFFREY J. MCLACHLAN2 1Department of Statistics, University of Waikato, Hamilton, New Zealand 2Department of Mathematics, University of Queensland, St. Lucia, Brisbane, Qld 4072, Australia *Corresponding author: [email protected] We describe the Snob program for unsupervised learning as it has evolved from its beginning in the 1960s until its present form. Snob uses the minimum message length principle expounded in Wallace and Freeman (Wallace, C.S. and Freeman, P.R. (1987) Estimation and inference by Compact coding. J. Roy. Statist. Soc. Ser. B, 49, 240–252.) and we indicate how Snob estimates class parameters using the approach of that paper. We will survey the evolution of Snob from these beginnings to the state that it has reached as described by Wallace and Dowe (Wallace, C.S. and Dowe, D.L. (2000) MMM mixture modelling of multi-state, Poisson, Von Mises Circular and Gaussian distributions. Stat. Comput., 10, 73–83.) We pay particular attention to the revision of Snob in the 1980s where definite assignment of things to classes was abandoned. Keywords: computer program, EM algorithm, minimum message length, mixture model, cluster analysis 1. INTRODUCTION approach, these indicator variables are treated as unknown parameters to be estimated along with the unknown par- In this article, we consider the work of Chris Wallace, his stu- ameters in the assumed distributional forms for the class den- dents and collaborators, in unsupervised learning. This area is sities corresponding to the clusters to be imposed on the data. also known as clustering, cluster analysis or numerical taxon- In contrast, with the mixture approach via the EM algorithm, omy. We focus our attention on the pioneering Snob program, these class indicator variables are treated as ‘missing data’ and wryly so-called because it places individuals in classes (C.S. at each iteration of that algorithm are replaced by the current Wallace, personal communication). values of their conditional expectations, given the observed The Snob program [1, 2] represents a pioneer contribution data. The mixture approach thus circumvents the biases in to a model-based approach to unsupervised learning. At the the parameter estimates produced by the classification same time, as Snob made an early contribution to model-based approach due to outright (hard) assignments of the obser- clustering (as unsupervised learning based on probability dis- vations during the iterative process. The Snob program tributions for the clusters), it also provided early evidence that initially used hard assignments, but later switched to soft a form of inference based on coding theory could tackle non- (partial) assignments in its implementation of the minimum trivial applications involving substantial amounts of data. message length (MML) approach. Other approaches to clustering that appeared at approximately the same time in the statistical literature [3–9] also considered a mixture model-based approach, primarily focuss- ing on the assumption of normality for the component distri- 2. AN INFORMATION MEASURE butions. In a related approach, Hartley and Rao [10] and Wallace and Boulton [1] do not present their work as a new Scott and Symons [11] considered the so-called classifi- method for grouping cases into classes but rather as a criterion cation—likelihood method of clustering. for evaluating the results of such clusterings, as they state in As to be discussed later in more detail, the distinction their introduction: between the mixture and classification approaches to clustering is on how they are formulated and subsequently The aim in this paper is to propose a measure of the good- implemented. Both work with the joint likelihood formed on ness of a classification, based on information theory, the basis of the observed data and the unobservable which is completely independent of the process used to class-indicator variables. However, with the classification generate the classification. THE COMPUTER JOURNAL,2008 Page 2 of 8 M. A. Jorgensen and G. J. McLachlan Of course, once such a criterion is proposed, it is a small step cut-off. This means a constant length for this part of the to seek a clustering that optimizes it. That there was and is a message, which is therefore disregarded. need for such a criterion cannot be denied. Consulting works such as [12] or [13] reveals an embarrassment of clustering methods and the alternatives have only increased with the 2.2. Class name dictionary passage of time. Consider, for example, traditional hierarchi- The receiver of the message is assumed to be in possession of a cal clustering methods based on a distance or similarity ‘code book’ containing a number of possible sets of class matrix. The method of calculating the distance or similarity names. This part of the message will tell the receiver which measure must be specified, as must the criterion used for com- set of names will be used in the message. If the code book con- bining or dividing classes, and the level of similarity at which tains a large number of sets of names, then it will be possible the resulting dendrogram is cut. Each of these decision points for the sender to select one such set in which the large classes offer a profusion of choices, which multiply up to give the receive short names, which will be useful when sending the final number of available methods. class name for every observation. However, too large a code The measure that Wallace and Boulton [1] proposed will book means that the part of the message that specifies which later be developed into a more general context as the principle set to use must itself be very large. Balancing these two of MML. They considered a digital encoding of a message the requirements leads to a fairly long technical discussion in purpose of which is to describe the attribute values of each the appendix to [1]. observation. A useful classification will divide all the observations into a finite number of concentrated classes. This use- fulness is reflected in the coding scheme by allowing a short 2.3. Description of the class distribution function encoding of observations within a class. In their words It is assumed that within each class, the attributes are indepen- If the expected density of points in the measurement dently distributed. This permits the multivariate distribution to space is everywhere uniform, the positions of the points be described by simply concatenating the descriptions of the cannot be encoded more briefly than by a simple list of distributions of each attribute. (Recent versions of Snob relax the measured values. However, if the expected density this requirement.) The considerations for encoding the distri- is markedly non-uniform, application of Shannon’s bution of a categorical attribute are similar to those discussed theorem will allow a reduction in the total message for the class name dictionary. The encoding of a continuous length by using a brief encoding for positions in those variable within a class is based on the assumption of a normal regions of high expected density. distribution for the variable. Decisions must be made on the The criterion for evaluating a classification of points into values of mean and variance to state, and the precision with classes will be the length of a message that describes all which to state these. This leads to another technical section of the attribute values of all the data points that are con- the paper to determine approximately optimal values. structed with the assistance of the classification. The message is divided into five parts communicating (i) the 2.4. Class and attribute values for each observation number of classes; (ii) a dictionary of class names; (iii) a description of the distribution function for each class; (iv) These merely need to be stated in the coding schemes for each point, the name of the class to which it belongs described earlier in the message. and (v) for each point, its attribute values in the code set up for its class. We remark parenthetically that in all his writings on classi- 2.5. The form of the message fication, Wallace has eschewed terms like ‘observation’, Putting all the components of the message together leads to a ‘case’ or ‘operational taxonomic unit’ in favour of the pithy large and not particularly elegant expression for message ‘thing’. It is regrettable that this sensible lead was not followed length. While a drawback for analytical work, this is not a par- by most of the literature. ticular problem for comparing the message lengths corre- We refer the reader to the original paper for the details of the sponding to various proposed clusterings of a data set, as the message construction, but we will make brief comments on the lengths in any case would in practice be computed by a com- different components of the message. puter program. 2.1. Number of classes 3. THE SNOB PROGRAM Wallace and Boulton [1] effectively assume an equal prob- The Snob program itself gets only a brief mention by Wallace ability for any number of classes up to some arbitrary and Boulton [1] as a program that attempts to minimize the THE COMPUTER JOURNAL, 2008 THE SNOB PROGRAM Page 3 of 8 measure defined in the paper. We need to refer to [2] for a Merging brings two classes together, and the distribution description of the structure of Snob itself. adjustment procedure is carried out for the new class.

Wallace's Approach to Unsupervised Learning: the Snob Program

Arxiv:1502.07813V1 [Cs.LG] 27 Feb 2015 Distributions of the Form: M Pr(X;M ) = ∑ W J F J(X;Θ J) (1) J=1

Nmo Ash University

A Minimum Message Length Criterion for Robust Linear Regression

Suboptimal Behavior of Bayes and MDL in Classification Under

Efficient Linear Regression by Minimum Message Length

Arxiv:0906.0052V1 [Cs.LG] 30 May 2009 Responses, We Can Write the Y Values in an N × 1 Vector Y and the X Values in an N × M Matrix X

Model Selection for Mixture Models-Perspectives and Strategies Gilles Celeux, Sylvia Frühwirth-Schnatter, Christian Robert

Minimum Message Length Ridge Regression for Generalized Linear Models

Bayesian Model Averaging Across Model Spaces Via Compact Encoding

0A1AS4 Tjf"1/47

Model Identification from Many Candidates

Model Selection by Normalized Maximum Likelihood