Distributed Document and Phrase Co-embeddings for Descriptive Clustering

Motoki Sato1, Austin J. Brockmeier2, Georgios Kontonatsios1, Tingting Mu1, John Y. Goulermas2, Jun’ichi Tsujii3,1 and Sophia Ananiadou1

1University of Manchester, National Centre for (NaCTeM), Manchester, UK 2Department of Computer Science, University of Liverpool, Liverpool, UK 3Artificial Intelligence Research Center, AIST, Tokyo, Japan [email protected], [email protected] [email protected]

Abstract methods have been shown to be useful in a va- riety of scenarios, including Descriptive document clustering aims to (Bharambe and Kale, 2011), analysis of social net- automatically discover groups of seman- works (Zhao and Zhang, 2011), and large-scale tically related documents and to assign exploration (Nassif and Hruschka, 2013) and visu- a meaningful label to characterise the alisation of text collections (Kandel et al., 2012). content of each cluster. In this paper, A number of previously proposed descriptive we present a descriptive clustering ap- clustering techniques work by extending a stan- proach that employs a distributed repre- dard document clustering approach. Documents sentation model, namely the paragraph are typically clustered based on a bag-of-words vector model, to capture semantic similar- (BoW) representation (i.e., the occurrence counts ities between documents and phrases. The of the words that appear in each document). proposed method uses a joint representa- Then, each cluster is labelled using the most com- tion of phrases and documents (i.e., a co- monly occurring word or phrase within the clus- embedding) to automatically select a de- ter (Weiss, 2006). In contrast to this approach, scriptive phrase that best represents each the recently proposed descriptive clustering ap- document cluster. We evaluate our method proach (CEDL) (Mu et al., 2016) maps documents by comparing its performance to an ex- and candidate cluster labels into a common se- isting state-of-the-art descriptive cluster- mantic vector space (i.e., co-embedding). The ing method that also uses co-embedding co-embedding space facilitates the straightforward but relies on a bag-of-words represen- assignment of descriptive labels to document clus- tation. Results obtained on benchmark ters. The CEDL method has been shown to gener- datasets demonstrate that the paragraph ate accurate cluster labels and achieved improved vector-based method obtains superior per- clustering performance when compared to stan- formance over the existing approach in dard descriptive clustering methods. Nonetheless, both identifying clusters and assigning ap- the co-embedding is based solely on a BoW rep- propriate descriptive labels to them. resentation of the documents and is thus limited in its ability to accurately represent the semantic 1 Introduction similarity between documents. Document clustering is a well-established tech- In this paper, we investigate a specific case of nique whose goal is to automatically organise a descriptive clustering that selects a single multi- collection of documents into a number of seman- word phrase to characterise each cluster of doc- tically coherent groups. Descriptive document uments (Li et al., 2008). Firstly, we assume de- clustering goes a step further, in that each iden- scriptive phrases are to be selected from a can- tified document cluster is automatically assigned didate phrase set extracted from the corpus dur- a human-readable label (either a word or phrase) ing preprocessing. The proposed method then that characterises the semantic content of the doc- follows the co-embedding descriptive clustering uments within the cluster. Descriptive clustering paradigm of the CEDL algorithm. However, in-

991 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 991–1001, Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics stead of using a BoW representation, we employ followed the description-comes-first (DCF) the paragraph vector (PV) (Le and Mikolov, 2014) paradigm (Osinski´ et al., 2004; Weiss, 2006; model to learn a distributed vector representa- Zhang, 2009). DCF-based methods work by tions of phrases and documents. These distributed firstly identifying a set of cluster labels, and representations move beyond unstructured BoW subsequently forming document clusters by representations by considering the local contexts measuring the relevance of each document to a in which words and phrases appear within docu- potential cluster label. DCF-based approaches ments, which provides a more precise estimate of have several shortcomings, which include poor . clustering performance and low readability of In particular, we present two extensions to cluster descriptors (Lee et al., 2008; Carpineto et the initial PV-based method that enable mod- al., 2009). els that learn a common co-embedding space More recent developments in descriptive clus- of documents and phrases. The first extension tering have proposed alternative techniques, which jointly learns co-embeddings of documents and approach the problems of improving clustering phrases. The second extension constructs ‘pseudo- performance and descriptive label quality from documents’ consisting of the lexical context sur- various different angles. For instance, Scaiella et rounding each occurrence of a particular phrase. al. (2012) identifies Wikipedia concepts in doc- Each of these contexts are treated as separate doc- uments and then computes relatedness between ument instance that are associated with a single documents according to the linked structure of embedded vector. In both cases, after clustering Wikipedia. Navigli and Crisafulli (2010) pro- the document embedding vectors, each embedded pose a method that takes into account synonymy phrase is a candidate cluster label. To select the and polysemy. Their method utilises the Google most appropriate descriptive label amongst these Web1T corpus to identify word senses based on candidates, we first rank the documents according word co-occurrences and computes the similarity to their proximity to each candidate label’s embed- between documents using the extracted sense in- ding vector and then select the phrase whose rank- formation. ing maximises the average precision for a given More recently, Mu et al. (2016) presented their cluster. co-embedding based descriptive clustering ap- We compare the results obtained by our PV- proach that learns a common co-embedding vec- based descriptive clustering method against two tor space of documents and candidate descriptive methods: spectral clustering (Shi and Malik, phrases. The co-embedded space simplifies the 2000), which only identifies clusters (but does not clustering and cluster labelling task into a more assign labels to them), and the previously intro- straightforward process of computing similarity duced CEDL method (Mu et al., 2016), which car- between pairs of documents and between docu- ries out both clustering and labelling. Experimen- ments and candidate cluster labels. tal results based on publicly available benchmark text collections demonstrate the effectiveness and 2.2 Distributed Representation superiority of our methods in both clustering per- formance and labelling quality. Distributed representation techniques are becom- ing increasingly important in a number of su- 2 Related Work pervised learning tasks, e.g., (Dai et al., 2015), text classification (Dai et al., 2.1 Descriptive Clustering 2015; Ma et al., 2015) and named entity recog- Descriptive clustering methods typically use an nition (Turian et al., 2010). A number of mod- unsupervised approach to firstly group documents els have been proposed to learn distributed word into flat or hierarchical clusters (Steinbach et or phrase representations in order to predict word al., 2000). Document clusters are then charac- occurrences given a local context (Mnih and Hin- terised using a set of informative and discrimina- ton, 2009; Mikolov et al., 2013b; Mikolov et al., tive words (Zhu et al., 2006), phrases (Mu et al., 2013a; Pennington et al., 2014). Subsequently, the 2016; Li et al., 2008) or sentences (Kim et al., PV model was proposed to learn representations 2015). of both words and documents (Le and Mikolov, Early approaches to descriptive clustering 2014; Dai et al., 2015). The PV model has been

992 shown to be capable of learning a semantically local context: richer representation of documents compared to 1 unstructured BoW models. To our knowledge, our log p(pt dt) + log p(pt c) (1) | t | t c t work constitutes the first attempt to use distributed X∈TP |C | X∈C representation models to co-embed documents and 1 + log p(ws ds) + log p(ws c) phrases for unsupervised descriptive clustering. | s | s c s X∈TW |C | X∈C where is the set of training phrase instances; 3 Proposed Descriptive Clustering TP p is the t-th phrase instance; d denotes Method t ∈ P t the document corresponding to the t-th training As outlined above, the descriptive clustering task instance; c denotes a member of the local con- (i.e., grouping documents according to semantic text t = [qt L, . . . , qt 1, qt+1, . . . , qt+L], which C − − relatedness and characterising the cluster content occurs within a window size of L of the training using a representative descriptive phrase) relies instance ( t = 2L) and consists of both words |C | heavily on learning a representation of documents and phrases qt ; likewise, is the set ∈ P ∪ W TW and phrases that can accurately capture relevant of training word instances; ws is the s-th tar- ∈ W semantic information. A particularly effective get word instance; ds denotes the document cor- strategy for descriptive clustering is to jointly map responding to the s-th training instance; and s is C documents and descriptive phrases together into a its local context with s = 2L. To summarise, |C | common embedding space (Mu et al., 2016). The the probability terms p(pt dt) and p(ws dt) model | | clustering of documents and selection of descrip- the document content information from a global tive phrases for each cluster is then carried out by level, while p(pt c) and p(ws c) model the local | | calculating the cosine similarities between docu- context. There are 2L + 1 conditional probabili- ments (to form clusters), and between documents ties estimated for each training instance. and descriptive phrases (to determine descriptive The probability of a given lexical unit labels) in the learned space. Instead of relying qt (either a word or a phrase) is mod- ∈ P ∪ W on the commonly used BoW model, we propose a elled using the vector embeddings of the + |P| |W| novel descriptive clustering approach. Our method words and phrases and the softmax function as fol- uses similarities computed from distributed joint lows: embeddings of documents and phrases, which are exp(uq>t zdt ) learned by considering both the global context p(qt dt) = (2) | q exp(uq>zdt ) provided by the document and the local context of ∈P∪W the descriptive phrases. We propose two different P exp(uq>t zc) p(qt c) = (3) strategies to learn these embeddings, as described | q exp(uq>zc) below. ∈P∪W P where uqt is a weight vector specific to the tar- z 3.1 Joint Learning of Document and Phrase get word or phrase, dt is the embedding vector Embeddings of the document corresponding to instance t, and zc is the embedding vector of a word or phrase in The first strategy jointly learns the distributed rep- the context of qt. Since the document, phrase and resentations for documents and phrases by rep- word vectors all use the same weight vector uqt resenting phrases, words, and documents as vec- to predict the target phrase, they are necessarily in tors that are used both to predict the occurrence of the same vector space. words in given documents (reflecting global doc- ument content information), and to predict the co- 3.2 Phrase Embeddings via Local Context occurrences of words and phrases within a sliding Pseudo-Documents window, to reflect the local context information. The previous model considers learning an em- We extend the PV model described in Dai et al. bedding as a multi-objective problem by trying (2015) to simultaneously generate word, phrase to predict phrases and words based on the global and document embeddings. The objective func- and local context. Besides indexing, Equation (1) tion is to maximise the log probability of words treats words and descriptive phrases interchange- and phrases conditioned on either their global or ably. An alternative approach is to treat phrases

993 as ‘pseudo-documents’ by using the sets of words solving the following optimisation problem: appearing in the local context of each phrase oc- curence. Specifically, training instances for a exp(u> xt + v> zp) max log wt wt (5) phrase’s embedding vector are constructed by ex- z p w exp(uw>xt + vw>zp) t p tracting the local context around each occurrence X∈T of a phrase in the document collection. Using P the augmented training set, consisting of both xt = [xw>t L ,..., xw>t 1 , xw>t+1 ,..., xw>t+L ]> the original documents and the additional pseudo- − − documents, we can then employ any existing PV where xt is the concatenation of all word vectors in the context of word w and u and v model (Le and Mikolov, 2014; Dai et al., 2015) to { w}w { w}w learn the document-phrase co-embeddings. for w. To find an approximate solution, the pa- rameters of the embedding vector are randomly However, due to the significant differences in initialised and optimised using stochastic gradient the sizes and numbers of documents and pseudo- descent; the gradient is calculated via backpropa- documents, there is a danger that the addition of gation (Rumelhart et al., 1986). the pseudo-documents can have a detrimental ef- fect on the performance of the model. Thus, we The PV-DBOW model simplifies the PV-DM adopt a two-stage training procedure. Firstly, an model by ignoring the local context of words in embedding model is trained using only the docu- the log probability function. The probability that ments. Then, we fix the weights of the model and a target word will appear in a given lexical context optimise the phrase embeddings by providing the is conditioned solely by the document. Dai et al. pseudo-documents as the input to the model. (2015) introduced a modified version of the PV- DBOW model that treats words and documents We have integrated the above-mentioned pro- as interchangeable inputs to the neural network. cess with two PV approaches, namely the dis- This enables the model to jointly learn word and tributed memory model (PV-DM) Le and Mikolov document embeddings in the same space; we de- (2014), and the extension of the distributed BoW note the model as PV-DBOW-W. Essentially, the model (PV-DBOW) in Dai et al. (2015). objective of the PV-DBOW-W model is a combi- In the PV-DM model, the probability that a tar- nation of both the skip-gram model (Mikolov et get word will appear in a given lexical context al., 2013b) that generates word embeddings and is conditioned on the surrounding co-occurring the PV-DBOW method which is used for learning words and also the document: document embeddings:

1 log p(wt t, dt), (4) log p(wt dt) + log p(wt c). (6) |C | t | t t W |C | c t X∈TW X∈T X∈C To optimise the embedding of a specific phrase, where wt is the target word for instance t, is the set of training word instances, denoted p, the existing word embeddings remain TW fixed, and the objective function is simplified as t = [wt L, . . . , wt 1, wt+1, . . . , wt+L] are con- C − − log p(wt p) where wt p are word in- text words that occur within a window size of L t p | ∈ T stances∈T that appear in the local contexts around words around wt, and dt denotes the document P corresponding to the t-th training instance. The the phrase. The optimal embedding vector for this probability is modelled using a softmax function. phrase is determined by solving the following op- timisation problem: For phrase p, the objective is to maximise the log (w , p) sum of the log probabilities t p p t t ∈T |C exp(uw>t zp) where wt p are the word instances that appear max log (7) z ∈ T P p w exp(uw>zp) in local context around the phrase, i.e., is the set t p Tp X∈T of word instances across all pseudo-documents, P and is the set of words that occur around the t-th where the weight vectors u are fixed. As in Ct { w}w word instance which also occur within the pseudo- the previous model, the parameters of the embed- documents for the phrase. Explicitly, the optimal ding vector are randomly initialised and optimised embedding vector for the phrase is determined by using stochastic gradient descent.

994 3.3 Descriptive Phrase Selection 4.1 Datasets Given co-embeddings of documents and phrases, We use two well-known, publicly available any clustering algorithm can be applied. We datasets: “Reuters-21578 Text Categorization Test use k-means, with the cosine similarity-based dis- Collection” from the Reuters newswire (Lewis, tance metric, to cluster the documents. Given 1997), and the “20 Newsgroups” email dataset1. the set of documents within each identified clus- We pre-process the 20 Newsgroups corpus to re- ter ,..., , the document embedding vec- move email header information while for both G1 GK tors z N and the descriptive phrase embed- datasets we extract candidate phrases using Ter- { d}d=1 ding vectors z P , we then select a descriptive mine (Frantzi et al., 2000), an automatic term ex- { p}p=1 phrase that best represents the documents assigned traction tool. to a cluster. For the Reuters corpus, we use the complete A baseline approach for descriptive phrase se- document collection for training the PV models. lection is to select the phrase whose embedding For evaluation, we use both the training and test- vector is nearest to the cluster centroid; however, ing sets of the modApte split, and select the 10 proximity to the cluster centroid is not always a categories with the largest number of documents. good indicator of cluster membership, as it ignores Moreover, we remove documents that belong to the location of documents belonging to other clus- multiple categories, this process results in an eval- ters. An ideal phrase vector should lie closer to uation set of 8, 009 documents. For the 20 News- documents within the cluster than documents out- groups dataset, we use the complete set of 18, 846 side of the cluster. Accordingly, we rank doc- documents for training the PV models. We remove uments based on their proximity to a candidate words and phrases that only appear in a single phrase and calculate the average precision of this document and then remove any empty documents. ranking (where documents belonging to the given This process results in an evaluation set of 18, 813 cluster are the true positives). documents with 20 categories, organised into 4 For cluster , we define the cluster membership higher level parent categories. Table 1 summarises G indicator for each document as: various characteristics of the employed datasets, including: a) number of documents, b) number of 1 d candidate phrases and c) category labels. y = ∈ G (8) d 0 otherwise  Table 1: Categories included in the evaluation For a given phrase p, let πp(1) be the index of subsets. ‘R10’ corresponds to the 10 largest cat- the nearest document to the phrase, and πp(i) be egories after removing documents with multiple the index of the i-th nearest neighbour. The preci- categories; the number of documents is in paren- sion after the k-nearest documents are retrieved is theses. All 20-Newsgroups categories have be- 1 k Pπp (k) = k i=1 yπp(i). The phrase which max- tween 628 and 997 documents. imises the average precision P is selected as the P p cluster descriptor Reuters - 8,009 docs - 9,984 words - 11,732 phrases R10 earn(3923), acq(2292), crude(374), trade(327), money-fx(293), interest(271), money-supply(151), 1 ship(144), sugar(122), coffee(112) p∗ = arg max P p = Pπp (k) , (9) 20 News - 18,813 docs - 43,285 words - 36,041 phrases p   |G| k sci crypt, electronics, med, space  X∈|G|  comp os.ms-windows.misc, sys.ibm.pc.hardware, graphics, windows.x, sys.mac.hardware where is the number of documents in the clus- rec autos, motorcycles, sport.baseball, sport.hockey |G| ter. mix comp.os.ms-windows.misc, rec.autos, rec.sport.baseball, sci.med, sci.space all * 4 Results

We evaluate the proposed PV-based descriptive 4.2 Paragraph Vector Models clustering methods in terms of cluster quality and descriptive phrase selection. Additionally, we In this section, we provide implementation de- show a visualisation of the co-embedding space in tails for the three PV models (PV-DBOW-WP, the supplementary material. 1http://qwone.com/˜jason/20Newsgroups/

995 PV-DBOW-W, and PV-DM), introduce a fourth Moreover, we follow the process described in Le model (PV-CAT) and explain the different settings and Mikolov (2014) to tune the learning rate. For that we use throughout the experiments. The PV- this, we set the initial learning rate to 0.025 and DBOW-WP model is used to jointly train phrase, decrease it linearly during 10 training epochs such word and document co-embeddings. For the PV- that the learning rate is 0.001 during the last train- DBOW-W and PV-DM models, we use the two- ing epoch. stage training approach, in which the document embeddings and softmax weights are trained first, 4.3 Baseline Methods and then the phrase co-embeddings are trained As our first baseline, we perform spectral clus- using pseudo-documents. A window size of 10 tering based on the affinity matrix produced ac- words around the target phrase is used as the lo- cording to the cosine similarity between the stan- cal context to create the pseudo-documents. dard term-frequency inverse document (tf-idf) rep- Each PV model has a number of parameters, in- resentation of the documents. We used the nor- cluding the dimension of the embedded spaces and malised cut (NC) spectral clustering algorithm the size of the context window. We set all em- proposed by Shi and Malik (2000). bedding dimensions to 100. For the PV-DBOW- We also compare our proposed method to W and PV-DBOW-WP model, we use a context the CEDL algorithm (Mu et al., 2016), which window of 10 words/phrases while for the PV-DM uses a measure of second-order similarity be- model a window size of 2 words (we tuned the tween phrases and documents, based on their co- size of the context window by applying the two PV occurrences at the document level, to obtain a models to a small development set of the Reuters spectral co-embedding. We use the same pa- corpus). This disparity in window size is not sur- rameters suggested in the original publication, prising since the PV-DM model considers the or- but carried out minor changes to the algorithm der of words within the local context and uses dif- to allow the method to be scaled up to larger ferent parameters for the vectors at each location datasets. To compare clustering performance, we in the context window, Equation (5), whereas an also run the CEDL algorithm without the phrase increased window size does not add additional pa- co-embeddings. rameters to the PD-DBOW model. We create an additional model, namely PV- 4.4 Evaluation of Cluster Quality CAT, by concatenating the vector representations In this experiment, we evaluate the clustering per- induced by the PV-DBOW-W and the PV-DM formance of the methods by comparing automati- models. This is performed after training the docu- cally generated document clusters against the gold ment and the phrase vectors. Intuitively, the con- standard categories. For all methods, we use k- catenation of the PV-DBOW-W and PV-DM fea- means clustering with cosine similarity as the dis- ture vectors can provide complimentary informa- tance metric. Following previous approaches (Xie tion given that the two models are trained using and Xing, 2013), we set the number of clusters a different size of context window (i.e., 10 and 2 equal to the number of gold standard categories. words, respectively). As evaluation metrics, we use the macro-averaged 2 3 Given that the size of the vocabulary is very F1 score , and normalised mutual information . large, computing the softmax function during Table 2 compares the clustering performance stochastic gradient descent is computationally ex- achieved by four PV models (PV-DBOW-WP, PV- pensive. For faster training, different optimisa- DBOW-W, PV-DM, and PV-CAT) against the per- tion algorithms can be used to approximate the formance of the baselines (i.e., the two versions log probability function. We use a combina- of the CEDL algorithm and spectral clustering via tion of negative sampling and hierachical soft- normalised cut). max via backpropagation (Mnih and Hinton, 2009; 2For each category, the maximum F1 score across the Mikolov et al., 2013b). Specifically, we use neg- clusters is used. 3 MI(G,C) ative sampling and then further optimise the em- Normalised mutual information max H(G),H(C) is de- { } beddings using hierarchical softmax. Although, fined as the mutual information between the automatically generated clusters and gold standard categories MI(G, C) these are different optimisation approaches, both divided by the maximum of the entropy of the clusters H(G) methods can be applied in this ad-hoc manner. or the categories H(C).

996 4.5 Evaluation of Cluster Labelling Table 2: Clustering performance assessed by correspondence measures to gold standard cate- In this section, we evaluate the cluster labels (i.e., gories. NC: spectral clustering via normalised multi-word phrases) selected by the proposed PV- cut, CE1: CEDL, CE2: CEDL without phrase based descriptive clustering methods. As a base- co-embeddings, PV1: PV-DBOW-WP, PV2: PV- line approach, we use the CEDL algorithm that DBOW-W, PV3: PV-DM, PV4: PV-CAT. produces a co-embedded space of documents and phrases5. NC CE1 CE2 PV1 PV2 PV3 PV4 For each document cluster, we apply the phrase F1 (macro-averaged; higher is better) selection criterion, Equation (9), to identify the R10 0.60 0.54 0.56 0.54 0.66 0.48 0.66 sci 0.89 0.89 0.88 0.91 0.91 0.90 0.92 phrase that best describes the underlying cluster. comp 0.48 0.46 0.55 0.67 0.66 0.63 0.65 Then for each gold standard category, the cluster rec 0.87 0.86 0.90 0.92 0.92 0.88 0.92 having the highest proportion of documents be- mix 0.93 0.92 0.93 0.95 0.94 0.93 0.95 all 0.56 — — 0.69 0.69 0.66 0.69 longing to the category is determined. This pro- all† 0.52 0.48 0.55 0.67 0.67 0.64 0.67 cess means some clusters are assigned to multi- Normalised mutual information (higher is better) ple categories while other clusters are left unas- R10 0.42 0.41 0.42 0.47 0.50 0.40 0.51 sci 0.68 0.68 0.66 0.72 0.71 0.70 0.74 signed. For each assigned cluster, we rank all comp 0.25 0.22 0.30 0.40 0.39 0.36 0.39 documents according to their similarity to the au- rec 0.69 0.67 0.74 0.78 0.78 0.70 0.78 tomatically selected phrase (in the co-embedding mix 0.80 0.77 0.79 0.84 0.82 0.80 0.83 all 0.51 — — 0.62 0.62 0.60 0.63 space), where documents within the cluster have all† 0.50 0.44 0.52 0.63 0.62 0.60 0.64 precedence over documents outside the cluster.

† Average of 10 random subsets using 10% of each category’s We evaluate the quality of the cluster label by documents. computing the average precision of this ranking in recalling the gold standard category. The average precision is maximised when documents closest to The PV-DBOW-W and PV-CAT methods yield the selected phrase belong to the gold standard cat- the best clustering performance on the Reuters egory. Table 3 shows the selected cluster descrip- dataset. Performance gains over the three baseline tors aligned to the gold standard categories, the av- methods range between 6% 12% (F1 score) and − erage precision and mean average precision scores 8% 9% (normalised mutual information). On − achieved by the CEDL method and PV-based de- the 20 Newsgroups dataset, the PV-DBOW-WP scriptive clustering models when applied to the and PV-CAT models outperformed the baseline Reuters and the 20 Newsgroups dataset. methods4 by approximately 2% 21% (F1 score) − The Reuters dataset presents a challenging case and 3% 18% (normalised mutual information). − for descriptive clustering methods given that the Moreover, we note that the performance achieved distribution of gold standard categories is highly by the PV-CAT model exceeds the best results re- skewed, i.e., the majority categories (e.g., ‘earn’ ported in Xie and Xing (2013) (normalised mutual and ‘acq’) correspond to more than one clus- information of 0.6159 using a multi-grain cluster- ters while the remaining clusters cover multiple ing ). smaller categories. Nonetheless, we observe that Finally, we note that co-embedding is not the automatically selected cluster descriptors are designed as a mechanism for improving clus- related to the corresponding gold standard cate- ter quality. For CEDL, the co-embedding of gories (e.g., ‘import coffee’ and ‘oil export’ for phrases reduced the clustering performances in gold standard category ‘ship’). In practice, the most datasets. This reduction in performance is skewed distribution of gold standard categories equally observed for the paragraph vector models can be addressed by using a larger number of clus- when applied to the Reuters dataset: the jointly ters in k-means, or by using a cluster algorithm trained co-embedding model (i.e., PV-DBOW- more amenable to heterogeneously sized clusters. WP) achieved a lower performance than the two- The 20 Newsgroups dataset shows a more bal- stage approach PV-DBOW-W (F1 score of 0.56 anced distribution of categories than the Reuters and 0.66, respectively). 5The spectral clustering via normalised cut and the CEDL 4Our implementation of the CEDL algorithm did not scale algorithm without document and phrase co-embeddings are up to the entire dataset (‘all’), but average results on random document clustering methods but not descriptive clustering subsets were consistent, as shown in the table. models and thus they are excluded from this experiment.

997 Table 3: Cluster descriptions and average precision (as percentages) achieved by descriptive clustering methods. CE1: CEDL, PV1: PV-DBOW-WP, PV2: PV-DBOW-W, PV3: PV-DM, PV4: PV-CAT. The average precision metric depends on not only the phrase but also the location of documents relative to the selected phrase; consequently, the average precision of a phrase may vary among the embeddings.

Category CE1 PV1 PV2 PV3 PV4 earn memotec datum 85 payable april 76 mln oper rev 89 mth jan 77 mln oper rev 88 acq undisclosed sum 80 tender offer 58 undisclosed sum 73 tender offer 70 definitive agreement 70 crude opec oil 92 venezuelan crude oil 47 oil production 88 oil export 79 oil production 91 trade european community 49 trade relation 48 trade problem 75 representative clayton 77 trade problem 79 money-fx bank rate 18 commercial lending rate 13 trade problem 17 deposit rate 22 trade problem 16 interest bank rate 68 commercial lending rate 36 week t 49 deposit rate 53 week t 45 ship european community 8 import coffee 7 raw sugar 15 oil export 11 costa rica 14 sugar european community 67 import coffee 17 raw sugar 79 representative clayton 8 costa rica 34 money-supply bank rate 9 commercial lending rate 15 week t 33 deposit rate 21 week t 37 coffee european community 8 import coffee 43 raw sugar 25 representative clayton 8 costa rica 68 Unused clusters: furman selz report improve earning share takeover offer definitive agreement share takeover offer loss nil undisclosed term high earning april record revenue growth bid null group turnover tax profit exclude loss april record record today current qtr april record mln net mln dividend present intention net shr profit Mean average precision 48 36 54 42 54

sci.crypt clipper key 93 key registration 94 back door 95 encryption method 95 back door 96 sci.electronics transistor circuit 87 power amp 90 voltage divider 89 radio shack 86 circuit board 90 sci.med other doctor 93 tech people 95 other treatment 94 other symptom 94 other treatment 95 sci.space earth orbit 94 deep space 95 first spacecraft 95 low earth orbit 94 low earth orbit 96 comp.os.ms-windows.misc trident 8900c 24 enhance mode 62 ms speaker sound driver 60 window version 55 dos app 59 comp.graphics image file 44 art scene 70 art scene 74 swim chip 68 facet based modeller 70 comp.sys.ibm.pc.hardware scsi hard drive 47 ide drive 60 tape drive 51 cmos setup 55 tape drive 52 comp.windows.x application code 56 other widget 85 return value 77 widget name 79 return value 78 comp.sys.mac.hardware apple price 59 mac lc 69 apple price 63 extra box 47 apple price 63 rec.autos auto car 92 luxury sedan 94 same car 92 new car 89 sport car 94 rec.motorcycles other bike 95 cruiser rider 95 same site 94 dod ama icoa nia 89 waterski bike 95 rec.sport.baseball padded bat 88 leadoff hitter 94 playoff team 95 total baseball 91 playoff team 95 rec.sport.hockey hockey playoff 92 cup final 94 nhl results 96 cup final 90 cup final 95 comp.os.ms-windows.misc dos window font 95 auto show 97 dos window 97 dos window 97 dos window 98 rec.autos bmw car 96 other car 97 same car 96 new car 97 other car 97 rec.sport.baseball baseball fan 98 worst team 99 more game 98 last season 98 baseball season 99 sci.med other doctor 94 other medical problem 96 many patient 94 other symptom 93 many patient 95 sci.space earth orbit 94 japanese space agency 95 solar power 94 lunar surface 94 solar power 96 Mean average precision 80 88 86 84 87 corpus, and we note that all descriptive cluster- for the “20 Newsgroups” dataset. It can be noted ing methods were able to identify meaningful clus- that although neither of the two documents explic- ter descriptors that have a clear correspondence to itly contain the input phrase, the first discusses a the gold standard categories (e.g., ‘window ver- semantically similar topic, and the second uses the sion’ and ‘dos app’ for the category ‘comp.os.ms- acronym GUI, i.e., graphical user interface. windows.misc’). As another example, we generate a two- With regard to the mean average precision, we dimensional visualisation of the document-phrase observe that the PV-DBOW-W and PV-CAT mod- co-embeddings using t-SNE (van der Maaten els obtained the best performance. Moreover, the and Hinton, 2008) that demonstrates how co- PV-CAT model achieved statistically significant embedded phrases can be used as ‘landmarks’ for improvements over the CEDL baseline in terms exploring a corpus. For this example, we use the of the average precision across the 28 categories ‘sci’ categories from the 20 Newsgroup corpus and while no significant6 improvement was observed select the 200 most frequent phrases in this subset. for the remaining three PV-based models. As input to t-SNE, we use the chordal distance de- The results that we obtained demonstrate that fined by the cosine similarity in the co-embedding the PV-based co-embedding space can effectively space and set the perplexity level to 40. Figure 1 capture semantic similarities between documents in the supplementary material shows the visuali- and phrases. An illustrative example of this is sation with the cluster boundaries, location of the shown in Table 4. In this example, we selected two documents and co-embedded phrases. documents that neighbour the phrase “user inter- face” in the PV-CAT co-embedded feature space 5 Conclusion Descriptive document clustering helps informa- 6 For significance testing, we used a paired sign-test, with tion retrieval tasks by automatically organising a significance threshold of 0.05 and Bonferroni multiple test correction for the 4 tests; the uncorrected p-value for the PV- document collections into semantically coherent CAT model is 0.0009. groups and assigning descriptive labels to each

998 web clustering engines. ACM Computing Surveys Table 4: Two documents whose vector embed- (CSUR), 41(3):17. dings were the 5th and 6th nearest neighbours (ac- cording to the cosine of the angle of the corre- Andrew M. Dai, Christopher Olah, and Quoc V. Le. sponding vectors) to the phrase “user interface” in 2015. Document embedding with paragraph vec- the PV-CAT based co-embedded space. tors. CoRR, abs/1507.07998. train/comp.windows.x 67337 Does anyone know the difference between MOOLIT and Katerina Frantzi, Sophia Ananiadou, and Hideki OLIT? Does Sun support MOOLIT? Is MOOLIT avail- Mima. 2000. Automatic recognition of multi-word able on Sparcstations? MoOLIT (Motif/Open Look Intrin- terms: The c-value/nc-value method. International sic Toolkit allows developers to build applications that can Journal on Digital Libraries, 3(2):115–130. switch between Motif and Open Look at run-time, while OLIT only gives you Open Look. Sean Kandel, Ravi Parikh, Andreas Paepcke, Internet: [email protected] Joseph M. Hellerstein, and Jeffrey Heer. 2012. Pro- test/comp.windows.x 68238 filer: Integrated statistical analysis and visualization Hi there, I’m looking for tools that can make X programming easy. I for data quality assessment. In Proceedings of the would like to have a tool that will enable to create X mo- International Working Conference on Advanced tif GUI Interactivly. Currently I’m Working on a SGI with Visual Interfaces, pages 547–554. forms. A package that enables to create GUI with no cod- ing at all (but the callbacks). Sun Kim, Lana Yeganova, and John W. Wilbur. 2015. Any help will be appreciated. Summarizing topical contents from PubMed docu- Thanks Gabi. ments using a thematic analysis. In Proceedings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing, pages 805–810. Associ- group. In this paper, we have presented a descrip- ation for Computational Linguistics. tive clustering method that uses paragraph vec- tor models to support accurate clustering of doc- Quoc Le and Tomas Mikolov. 2014. Distributed repre- sentations of sentences and documents. In Proceed- uments and selection of meaningful and precise ings of the 31st International Conference on Ma- cluster descriptors. Our PV-based approach maps chine Learning (ICML-14), pages 1188–1196. phrases and documents to a common feature space to enable the straightforward assignment of de- Yeha Lee, Seung-Hoon Na, and Jong-Hyeok Lee. scriptive phrases to clusters. We have compared 2008. Search result clustering using label language model. In Proceedings of the Third International our approach to another state-of-the-art algorithm Joint Conference on Natural Language Processing: employing a co-embedding based on bag-of-word Volume-II. representations. The PV-based descriptive clus- tering method achieved superior clustering perfor- David D. Lewis. 1997. Reuters-21578 text catego- rization test collection, distribution 1.0. Available mance on both the Reuters and the 20 Newsgroups at https://archive.ics.uci.edu/ml. datasets. An evaluation of the selected cluster de- scriptors showed that our method selects informa- Yanjun Li, Soon M. Chung, and John D. Holt. 2008. tive phrases that accurately characterise the con- Text document clustering based on frequent word tent of each cluster. meaning sequences. Data & Knowledge Engineer- ing, 64(1):381–404.

Acknowledgments Chenglong Ma, Weiqun Xu, Peijia Li, and Yonghong Proceedings of the 1st Workshop on Vec- This research was partially funded by the Medi- Yan, 2015. tor Space Modeling for Natural Language Process- cal Research Council grant MR/L01078X/1 “Sup- ing, chapter Distributional Representations of Words porting Evidence-based Public Health Interven- for Short Text Classification, pages 33–38. Associ- tions using Text Mining”. ation for Computational Linguistics.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word represen- References tations in vector space. Workshop at International Ujwala Bharambe and Archana Kale. 2011. Land- Conference on Learning Representations (ICLR). scape of web search results clustering algorithms. In Advances in Computing, Communication and Con- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- trol, pages 95–107. Springer. rado, and Jeff Dean. 2013b. Distributed representa- tions of words and phrases and their compositional- Claudio Carpineto, Stanislaw Osinski,´ Giovanni Ro- ity. In Advances in Neural Information Processing mano, and Dawid Weiss. 2009. A survey of Systems, pages 3111–3119.

999 Andriy Mnih and Geoffrey E. Hinton. 2009. A scal- Laurens van der Maaten and Geoffrey Hinton. 2008. able hierarchical distributed language model. In Ad- Visualizing data using t-SNE. Journal of Machine vances in Neural Information Processing Systems, Learning Research, 9:2579–2605. pages 1081–1088. Dawid Weiss. 2006. Descriptive Clustering as a Tingting Mu, John Y. Goulermas, Ioannis Korkontze- Method for Exploring Text Collections. Ph.D. thesis, los, and Sophia Ananiadou. 2016. Descriptive doc- Poznan University of Technology, Poznan,´ Poland. ument clustering via discriminant learning in a co- embedded space of multilevel similarities. Journal Pengtao Xie and Eric P. Xing. 2013. Integrating docu- of the Association for Information Science and Tech- ment clustering and topic modeling. In Proceedings nology, 67(1):106–133. of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence. C. Filipe Nassif and Eduardo Raul Hruschka. 2013. Chengzhi Zhang. 2009. Document clustering descrip- Document clustering for forensic analysis: An ap- tion based on combination strategy. In Fourth Inter- proach for improving computer inspection. Infor- national Conference on Innovative Computing, In- mation Forensics and Security, IEEE Transactions formation and Control (ICICIC), pages 1084–1088. on, 8(1):46–54. IEEE. Roberto Navigli and Giuseppe Crisafulli. 2010. Induc- Peixin Zhao and Cun-Quan Zhang. 2011. A new clus- ing word senses to improve web search result clus- tering method and its application in social networks. tering. In Proceedings of the 2010 Conference on Pattern Recognition Letters, 32(15):2109–2118. Empirical Methods in Natural Language Process- ing, pages 116–126. Association for Computational Ye-Hang Zhu, Guan-Zhong Dai, Benjamin C.M. Fung, Linguistics. and De-Jun Mu. 2006. Document cluster- ing method based on frequent co-occurring words. Stanisław Osinski,´ Jerzy Stefanowski, and Dawid In Proceedings of the 20th Pacific Asia Confer- Weiss. 2004. Lingo: Search results clustering al- ence on Language, Informatics, and Computation gorithm based on singular value decomposition. In (PACLIC), pages 442–445. Intelligent Information Processing and Web Mining, pages 359–368. Springer. A Supplementary Material Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Associa- tion for Computational Linguistics.

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back- propagating errors. Nature, 323:533–536.

Ugo Scaiella, Paolo Ferragina, Andrea Marino, and Massimiliano Ciaramita. 2012. Topical clustering of search results. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pages 223–232.

Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905.

Michael Steinbach, George Karypis, and Vipin Kumar. 2000. A comparison of document clustering tech- niques. In KDD Workshop on Text Mining, pages 525–526.

Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394. As- sociation for Computational Linguistics.

1000 law enforcement clipper chip Cluster 1 escrow agency Cluster 2 session key Cluster 3 encryption algorithm Cluster 4 key escrow sci.crypt escrow agent sci.electronics court order law enforcement agency sci.med legal authorization cellular phone sci.space government official other word legitimate user back door law enforcement block escrow house serial number phone company unit key phone conversation enforcement block drug dealer law enforcement field possible key encryption scheme screw thread phone call key escrow system other end enforcement agency security device public key secret key private key strong crypto unauthorized release laissez passer product cipher escrowe key phone line or bl electronic mail crypto system assemble byte hash function wiretap chip mov cx black market floppy disk encryption technology prejudice ucc first step anonymous ftp encryption device press release more information government agency fact sheet anyone know new technology source code strong cryptography public domain general public ftp site encryption product nuclear weapon pgp signature voice encryption civil liberty high speed begin pgp encryption system general purpose available today commercial machine copy protection anonymous posting few year email address other hand phone number good luck power supply anonymous post more info bad idea same thing system administrator good thing ground conductor good way night sky other thing light pollution ground wire cd player hot wire mailing list first place same time breast cancer other people only way good reason duty cycle such thing few day tremendous goalkeeping lunar base more detail last year few month few week other way space station little bit radar detector news group ham radio good example perfect safety space program high frequency other side nice nm oleo first time original poster good point commercial space make subscriber launch system low cost launch vehicle good idea space shuttle human stupidity many people standard disclaimer commercial launch many year new one first flight corona discharge laser printer long time same way solar sail solar array kirlian photography nuclear plant fossil fuel concrete floor solar system lunar surface whole thing lunar orbit long way heavy water other day large amount medical school large number high school small amount magnetic field physical universe black hole short time smokeless tobacco gamma ray lyme disease neutron star chinese food hiv infection chinese restaurant immune system chinese restaurant syndrome infectious disease side effect private sector oxalic acid health care high dose several year public health weight gain tobacco use kidney stone cancer center conventional wisdom heart disease immune response amino acid clinical trial

candida bloom mucus membrane yeast infection anecdotal evidence

Figure 1: t-SNE visualisation of the PV-CAT co-embedding of the 20 Newsgroups ‘sci’ categories. Documents are represented by markers corresponding to the gold standard categories. Text boxes show the 200 most frequent phrases with nearby co-embedded phrases aggregated together. To show the correspondence between the categories and the phrase embedding, the text boxes are coloured based on the majority category of the documents nearest to each phrase. Sets of phrases with no majority category are left white. Hatch lines in the background denote the boundaries of each cluster, where the hatch angle (and colour) is based on the cluster. In the embedding space, each cluster is a convex set, but the t-SNE algorithm preserves local neighbourhoods and may fragment the clusters. (Best viewed with digital magnification.)

1001