Self-Tuned Descriptive Document Clustering Using a Predictive Network, Vol
Total Page:16
File Type:pdf, Size:1020Kb
SELF-TUNED DESCRIPTIVE DOCUMENT CLUSTERING USING A PREDICTIVE NETWORK, VOL. X, NO. X, XXX XXXX 1 Self-Tuned Descriptive Document Clustering using a Predictive Network Austin J. Brockmeier, Member, IEEE, Tingting Mu, Member, IEEE, Sophia Ananiadou, and John Y. Goulermas, Senior Member, IEEE Abstract—Descriptive clustering consists of automatically organizing data instances into clusters and generating a descriptive summary for each cluster. The description should inform a user about the contents of each cluster without further examination of the specific instances, enabling a user to rapidly scan for relevant clusters. Selection of descriptions often relies on heuristic criteria. We model descriptive clustering as an auto-encoder network that predicts features from cluster assignments and predicts cluster assignments from a subset of features. The subset of features used for predicting a cluster serves as its description. For text documents, the occurrence or count of words, phrases, or other attributes provides a sparse feature representation with interpretable feature labels. In the proposed network, cluster predictions are made using logistic regression models, and feature predictions rely on logistic or multinomial regression models. Optimizing these models leads to a completely self-tuned descriptive clustering approach that automatically selects the number of clusters and the number of feature for each cluster. We applied the methodology to a variety of short text documents and showed that the selected clustering, as evidenced by the selected feature subsets, are associated with a meaningful topical organization. Index Terms—Descriptive clustering, feature selection, logistic regression, model selection, sparse models F 1INTRODUCTION XPLORATORY data analysis techniques such as cluster- should summarize its contents, such that the description E ing can be used to identify subsets of data instances alone should enable a user to predict whether an arbitrary with common characteristics. Users can then explore the instance belongs to a particular cluster. Likewise, a machine data by examining some instances in each cluster, rather classifier trained using the features subset should also be than examining instances from the full dataset. This en- predictive of the cluster membership. The classification ac- ables users to efficiently focus on relevant subsets of large curacy provides an objective and quantitative criterion to datasets, especially for collections of documents [1]. In par- compare among different feature subsets. ticular, descriptive clustering consists of automatically group- To serve as a concise description, the number of fea- ing sets of similar instances into clusters and automatically tures used by the classifier must be limited (e.g., a linear generating a human-interpretable description or summary classifier that uses all features is not easily interpretable). for each cluster. Each cluster’s description allows a user to A relatively small set of predictive features can be iden- ascertain the cluster’s relevance without having to examine tified using various feature selection methods [3], [4]. In its contents. For text documents, a suitable description for particular, we identify features subsets by various statisti- each cluster may be a multi-word label, extracted title, or a cal and information-theoretic criteria [5] and by training list of characteristic words [2]. The quality of the clustering linear classifiers with additional sparsity-inducing regular- is important, such that it aligns with a user’s idea of simi- izations [6], [7], [8], e.g., the `1-norm for the Lasso [9] or larity, but it is equally important to provide a user with an a combination of `1 and `2-norms for the Elastic Net [10], informative and concise summary that accurately reflects such that only a small set of features have non-zero coeffi- the contents of the cluster. However, objective criteria for cients. In a similar spirit, Lasso has been used for selecting evaluating the descriptions as a whole, which do not resort predictive features for explaining classification models [11]. to human evaluation, have been largely unexplored. In addition to the cardinality constraint on the number With the aim of defining an objective criterion, we con- of features, we only permit features that are positively sider a direct correspondence between description and pre- correlated with a given cluster, i.e., features whose presence diction. We assume each instance is represented with sparse are indicative of the cluster. This constraint ensures that no features (such as a bag of words), and each cluster will be cluster is described by the absence of features, which are described by a subset of features. A cluster’s description present in other clusters. For instance, given a corpus of book and movie reviews, the positivity constraint avoids This research was partially supported by Medical Research Council grants a cluster consisting of mainly of book reviews from being • MR/L01078X/1 and MR/N015665/1. described as movie, i.e., the absence of the word fea- A. J. Brockmeier and Y. G. Goulermas are with the School of Electrical En- ture movie. In¬ general, this constraint can be enforced by • gineering, Electronics & Computer Science, University of Liverpool, Liv- erpool, L69 3BX, United Kingdom. (e-mail: [email protected]) admitting only features that are positively correlated with T. Mu and S. Ananiadou are with the School of Computer Science, a particular cluster; for linear classifiers, this can be done • University of Manchester, Manchester, M1 7DN, United Kingdom. by enforcing the constraint that the coefficients are non- Date of current version November 18, 2017. negative [12]. SELF-TUNED DESCRIPTIVE DOCUMENT CLUSTERING USING A PREDICTIVE NETWORK, VOL. X, NO. X, XXX XXXX 2 Data Cluster assignment / 5. For each cluster, the description consists of the instance feature subset associated with the encoder's non-zero 3. Select features for model coefficients. predicting cluster assignment Clustering Decoder ensuring positive correlation. Feature 4. Choose the most predictive predictions feature subset. – – Sparse features Encoder Cluster / predictions 1. Generate a set of candidate clusterings . 2. Select clustering with the best model for predicting features. Fig. 1. Proposed approach and network architecture for descriptive clustering (the notation for the decoupled auto-encoder is described in Table 1.). Constraining the number of features and constraining as features, but the approach is applicable to any data with the positivity will inevitably limit the performance of the sparse count-valued features. classifier. The natural question is exactly how many fea- Our main contributions are (1) a new quantitative evalu- tures are needed to ensure a ‘reasonable’ classification per- ation for descriptive clustering; (2) PDC as a unified frame- formance. We use a model order selection criterion as a work to automatically select both the clustering itself (across principled approach to answer this question. We model the different numbers of clusters) and also select a set of features probability of class membership given the different subsets (including its size) to describe each cluster; (3) a comparison of features using logistic regression and use the Bayesian of feature selection algorithms including regularized and information criterion (BIC) [13] to select the model cor- constrained logistic regression; (4) a direct comparison to an responding to a feature subset that is both predictive and existing topic modeling approach [18], which shows that parsimonious. PDC both performs better and is more efficient; and (5) Given a clustering of the instances, we apply the afore- motivating examples of PDC on a variety of text datasets. mentioned approach to automatically generate an inter- We begin by discussing related work in exploratory data pretable classifier for each cluster. The features of the classi- analysis for text datasets in Section 2. We then present the fier form the description of the cluster, and the network of proposed methodology in Section 3, detailing the selection classifiers is a compact approximation of the original cluster of clustering, selection of candidate feature subsets, and assignments. The two benefits of this paradigm versus pre- selection of model size. In Section 4, we apply the proposed vious descriptive clustering approaches [2], [14], [15], [16] approach to publicly available text datasets (movie, book, are the ability to quantify the description accuracy as well as and product reviews along with Usenet newsgroup posts, a principled and automatic way to select the size of feature news summaries, academic grant abstracts, and recipe in- subsets that serve as the descriptors for each cluster. gredient lists) and show that meaningful and descriptive Similarly, predictive performance can be used to select features subsets are selected for the clusters. Furthermore, the clustering itself and in particular, the number of clusters. we show that the interpretable classifiers that use the feature Although a variety of criteria for selecting the number of subsets are accurate. clusters exist [17], they are often specific to the underlying objective of the clustering algorithm. Independent from the clustering algorithm, we propose to select the number of 2RELATION TO EXISTING APPROACHES clusters that is most predictive of the original feature occur- There has been extensive research on clustering and other rences. For each clustering,