A Comparison Study of Document Clustering Using Doc2vec Versus Tfidf Combined with Lsa for Small Corpora

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 A COMPARISON STUDY OF DOCUMENT CLUSTERING USING DOC2VEC VERSUS TFIDF COMBINED WITH LSA FOR SMALL CORPORA

1AMALIA AMALIA, 2OPIM SALIM SITOMPUL, 3ERNA BUDHIARTI NABABAN, 4TEDDY MANTORO

1Department of Computer Science, Faculty of Computer Science and Information Technology, Universitas Sumatera Utara, Medan, Indonesia 2,3Department of Information Technology, Faculty of Computer Science and Information Technology, Universitas Sumatera Utara, Medan, Indonesia 4Department of Computer Science, Universitas Sampoerna, Jakarta, Indonesia

E-mail: [email protected]

ABSTRACT

The selection of a suitable word vector representation is one of the essential parameters in document clustering because it affects the performance of clustering. The excellent word vector representation will generate a good clustering result, even only using the simple clustering algorithm like K-Means. Doc2Vec, as one of word vector representations, has been extensively studied in large text datasets and proven outperforms the performance of traditional word vector representation in document categorization. However, only a few studies analyze word vector representations of small corpora. As appropriate, learning observation in a small corpus is also crucial because, in some cases, a large corpus was not always available, particularly in some low-resources languages like Bahasa Indonesia. Moreover, the clustering of the small datasets also plays essential roles in pattern recognition and can be an initial step to implement the analysis result in a more significant corpus. This study is an experimental study that aims to explore more in-depth exploration to compare document clustering using Doc2Vec versus TFIDF-LSA for small corpora in Bahasa Indonesia. In this study, the quality of word vector representation is measure by the cluster performance using intrinsic and extrinsic measurements. The study also considers measuring word representation based on time and memory consumption. This study also concerns with getting an optimal word vector representation by tuned appropriate hyper-parameter. The word vector representations were tested to various sizes of the small corpora using the K-Means algorithm. The result of this study, a TFIDF-LSA gets a better cluster performance; meanwhile, the Doc2Vec model gets a better time and memory usage efficiency.

Keywords: Clustering, Word Vector Representation, Word Embedding, Clustering Comparison, Small Corpora

1. INTRODUCTION approach can be the best solution. Document clustering is an unsupervised machine learning Along with the growth of the internet makes the techniques to automatically group documents into growth of data text on the internet has explosive as clusters based on document similarity [1], [2]. well. Therefore, automation of document Clustering is one of the most important tasks in data categorization, such as classification and clustering analysis [3]. Clustering does not need pre-defined is an essential task for nowadays. The drawback of labels for each group by human labor. Many supervised text categorization like classification is algorithms for document clustering have been the necessity of annotated linguistic resources. This proposed, for instance, K-Means [4], K-Means++ is a challenge for languages that do not have [5], Self Organizing Map [6] et cetera. In addition to adequate available annotated linguistic like Bahasa the choice of cluster algorithm, the other essential Indonesia. The task to manually annotate the aspects of generating good clustering performance linguistic resources from scratch like an annotated are selecting appropriate word vector representation. corpus requires many times and human labors. Word vector representation is a feature Therefore, text categorization using a clustering representation for data input in textual form. Word

3644

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 vector representation, also known as text document categorization, the Doc2Vec yields higher representation, in this study, we used these two classification accuracy than other document terms interchangeably. Text representation is a pre- representation methods in various domains, such as processing stage in machine learning for data input sentiment classification, news categorization, and in a textual form so that the data able to further forum question duplication [14]. However, most processed in machine learning. One of the simple studies that stated word embedding outperforms methods in word vector representation is Bag of BoW using a large text corpus [15] to implement the Words (BoW). The instances of BoW methods are learning process. Compare to the BoW model; word one hot encoding and term-frequency inverse- embedding certainly has the advantage if document-frequency (TFIDF) [7]. The BoW is a implemented in a large dataset, since word count-based text representation method that treats embedding does not consume as much memory as documents as a set of distinct atomic words; some classic methods like TFIDF and LSA [15] therefore, this method cannot preserve the semantic make many researchers implemented as much as and syntaxis information. It means that relationships data corpus for training. Moreover, the assumption between words, such as synonyms, are not that more data is better for catch semantic and incorporated [1]. Another drawback of BoW is syntactic information makes learning representation sparsity and generate high dimensional vectors. trained from a large corpus containing about billions Despite many deficiencies, the BoW is still widely of tokens. However, it is still unclear how many used because of its simplicity and surprising documents in the corpus does the embedding model accuracy. Many previous studies tried to anticipate needs in generating a good word embedding the lack of BoW. For instance, a study by [8] representation. There are only a few studies analyze implemented Latent Semantic Analysis (LSA) in semantic representations of small corpora [15]. It TFIDF to capture semantic and dimension reduction. should be more observation of small corpora, Implementation LSA in TF-IDF proven to increase particularly in some cases, large corpora not be cluster performance. Another previous study by [9] always available [16], for examples in some low- added a synonym vocabulary checked in the pre- resources languages [17] or in domain-specific like processing stage to anticipate the synonym medical corpus [18]. Moreover, the clustering of the deficiency in TFIDF representation. Both of these small datasets also plays important roles in pattern studies were implemented in small corpora and recognition [19], where we can predict the behavior gained Purity about 75%. of the unseen data from data training. This task is Beside BoW, another alternative concept of also referred to as the learning process [20]. word vector representation is word embeddings. A Observation in the small dataset also can be an initial word embedding is a low-dimensional, dense, and step to implement the analysis result to a bigger real-valued vector representation [10]. A word corpus. Even though clustering does not need embedding is generated using a prediction-based by labeled data for training, but the ground truth label is neural network approach based on learning still needed to measure the cluster performance representation in a large text corpus. This process accuracy; therefore, to analyze first in a small corpus generated word vector representation that captures is the right decision. syntactic and semantic aspects [11]. Therefore, The research question, is word embeddings words that have similarity meaning and close outperform TF-IDF and LSA representation for correlation will have similar word vector small corpora? It is still not clear whether word representation as well. Many previous studies in embedding outperforms classical models in the Natural Language Processing (NLP) stated word small corpus. A previous study by [21] found LSA embeddings representation increases NLP task produced better performances than word embedding performance, including document categorization models in small to medium size of the training [11]. Using word embedding certainly added corpus. However, a previous study by [22] advantage for the document clustering process, concluded that the corpus size is not always the main because the same words in documents with similar parameter in generating right word embedding. This topic tend to have similar word vector study also revealed that the word2vec model could representation. Word embedding models have been extract linguistic information from a small domain- researched in previous studies by [12] [13] and specific corpus and get a satisfactory result. proven to outperform BoW for various NLP tasks. Based on these two contrary statements, we want The instances of word embedding models are to explore more in-depth exploration in which text word2vec for word-level representation and representation methods are better for small corpora. Doc2Vec for document-level representation. In Unfortunately, to obtain the best suitable word

3645

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 representation is not easy; there is no universally 2. LITERATURE REVIEW good representation, the choice of representation also determined by domain knowledge [20]. Word vector representation for data text in However, we can measure the quality of word machine learning is a process to transform input representation based on the performance of the objects into numbers or vector. In the NLP task such clustering result. If the word vector representation is as classification and clustering, this process is an right, even the simple clustering algorithm like K- essential process because it affects the performance Means will find the data cluster pattern and generate and result. In this study, we compare two kinds of a good clustering result [20]. feature representations in the document clustering In this study, to explore more in-depth exploration in task, which are TFIDF-LSA and Doc2Vec model. which text representation methods are better for small corpora, we compare two model word vector 2.1 TFIDF-LSA representations which are classical model: TF-IDF TFIDF [7] is one method to represent text combined with LSA versus word embedding model: documents into a vector. TFIDF contains two Doc2Vec. To simplify, we used the terms TFIDF- calculation which is TF and IDF. TF or term LSA to mention TF-IDF combined with LSA. We frequency is the number of times that the term t generated a small corpus that contains only 500 occurs in document d. The more frequencies, the articles in Bahasa Indonesia. We also split the corpus more value of TF. Meanwhile, IDF or inverse into three smaller sub-corpora that contain 60, 125, document frequency is a calculation of the number and 250 articles. Further, to find the performance of of document D that contain the term t in a whole clustering, the K-Means algorithm was implemented corpus. The formula of TFIDF describes in formula for each model. To measure the cluster performance, 2. we used intrinsic and extrinsic cluster evaluation measurements. We implemented the Silhouette tfidf(t, d, D) = tf (t, d). idf (t, D) (1) Coefficient evaluation for intrinsic measurement. For extrinsic cluster performance measurement, we The TFIDF generates high value for the essential implemented Purity and adjusted random index terms in a document and filters out the common (ARI) evaluation. Besides the clustering terms that occur in many documents. The feature performance, in this study, the comparison is also representation in this method comes in feature concern with the clustering processing time and matrix with dimensional as many as the total unique memory consumption for each model. This study words number in a whole corpus. Some pre- also concerns to tune various parameter initialization processing stages are implied to reduce the for each model to get the best model. dimension of the vector. For instance, to reduce Our contribution is the comparison result dimensionality, a threshold cut-off is used to use based on a quantitative experiment that can consider only those words with high values. TFIDF is one of other researchers in choosing the suitable word the BoW which treats a word as an atomic unit representation for a small corpus, particularly for without considering the relationship of the word to Bahasa Indonesia based on cluster performance, others in the corpus. Therefore, this method cannot time, and memory consumption. The small dataset handle the meaning representation, such as observation can also reveal the quality of word synonyms are not incorporated. To account for the representation for low-resources language like meaning representation, the TFIDF can be combined Bahasa Indonesia. As far as our knowledge, there is with Latent Semantic Analysis (LSA) [23] [24]. still no previous observation to compare word vector LSA is one of the most used methods for word representation in the small corpus for Bahasa meaning representation in BoW representation. Indonesia. Besides handling meaning representation, LSA can The rest of this paper is organized as also be used to reduce the dimensionality of TFIDF follows: In Section 2, we described the literature matrices. A dimensionality reduction is review. In Section 3, the related works by previous implemented by a truncated Singular Value studies were described. In Section 4, we explain the Decomposition, SVD, which projects every word in methodology of this study, parameter initialization, a subspace of a pre-defined number of dimensions. and cluster performance measure. The experimental Once the vectorial representation of words is results are discussed in Section 5. Finally, in Section obtained, the semantic similarity between two terms 6, we conclude the current work with a few future is typically computed by the cosine of the angle research directions. between them [15].

3646

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

2.2 Doc2Vec using the Doc2Vec algorithm. Doc2Vec algorithm is Word embedding is a method to transform the extended to the word2vec algorithm by (Le & text into a dense vector in real numbers using a Mikolov, 2014). Doc2vec was built based on neural network approach. A study by [10] in the form paragraph vector, an unsupervised algorithm that of a feed-forward neural network language model as learns fixed-length feature representations from one of the pioneers the word embedding. A study by variable-length pieces of texts, such as sentences, [25] proposed a simpler method that produces high- paragraphs, and documents [30]. quality vectors called word2vec. Word embedding The primary task of document embedding is to by word2vec begins with the learning process in a determine an appropriate distributed representation collection of text documents or corpus. The learning for a single document by learning a neural network process aims to exploits the statistical properties of with the word's information and its surrounding the textual structure in a vectorial space. Words with words in the document. The vector representation is similar meanings tend to be located close to each trained to predict words in a paragraph. The other in vectorial space. This hypothesis relies on the Doc2Vec algorithm concatenates the paragraph idea that words with similar meanings tend to occur vector with several word vectors from a paragraph in similar contexts [26]. The learning process has and predicts the following word in the given context. collected the information about neighbor words In the Doc2Vec approach, each document has its (context words) of each word (target word) in the own vector values in the same space as for words. corpus. Thus, the distributed representation for both words Further, fully connected feed-forward neural and documents are learned simultaneously. There networks are implemented to predict context words are two algorithms of Doc2vec, which are the based on the target word or vice versa. The aim is to Distributed Memory Model of Paragraph Vectors get the optimal prediction based on the information (PV-DM) and the Distributed Bag of Words version on the learning corpus stage. Words embedding are of Paragraph Vector (PV-DBOW). PV-DM is a trained using stochastic gradient descent, and the document-embedding algorithm that randomly takes gradient is obtained via backpropagation. The sequence words from a document and tries to predict weight is adjusted to get the optimal prediction. a target word from the randomly sampled set the Word embeddings are these adjusted weights. Once context words as input. Meanwhile, PV-DBOW is an the neural network has been trained, the learned algorithm to generate document embedding without linear transformation in the hidden layer is considering word order in documents. There is some considered the word representation [15]. There is hyper-parameter that needs to be tuned to generate a two architecture in word2vec, which are Continuous good vector for document embedding, such as the bag-of-words (CBOW) and Skip Gram. CBOW window size, dimension, and minimum words. With architecture was implemented to predict the target the right parameter, feature representation for the word based on the context word. Meanwhile, Skip document will increase document clustering. Gram is an architecture to predict context words based on the target words. The result of the word 2.3 Document Clustering with K-Means embedding the words with a similar meaning is Clustering is an unsupervised process to groups mapped to a similar position in the vector space [27]. a set of objects based on the similarity between the Word2vec is a tool to provides an efficient objects. A good cluster will split the collection of implementation of the continuous bag-of-words and data objects into k clusters. The data objects that are skip-gram architectures for computing vector similar to one another will be grouped in one cluster, representations of words [28]. This tool learns to and the different data objects will be put in a projects words into a latent d-dimensional space with different cluster. One of the methods to document n-window size. Window size is a parameter to clustering is using the K-Means algorithm [20]. K- determine the number of context words. Also, some means clustering is one of partitioning hard researchers try to build model embedding at the clustering methods that split a given dataset into a document level to extend the level of feature fixed number (k) of clusters. In K-Means, we have to representation in some NLP tasks. However, when set k value (number of clusters) in the first process. using word embedding models to create document- Based on the number of k, the K-Means will choose level representations, the word vectors need to be k numbers randomly as centroids, which are used as aggregated somehow. A general approach to the beginning points for every cluster, and then calculate document embedding is to simply estimate performs iterative (repetitive) calculations to the word vectors' mean for all terms in the document optimize the positions of the centroids. The process [29]. Another method to train the document level is of centroid adjustment is repeated until the values of

3647

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 the centroids stabilize. The final centroids will be feature representation methods. This method used to produce the final clustering of the input data, measures of quality compare a clustering to an each with a class identity. external knowledge source such as ground truth labels. Examples of extrinsic measures are Purity, 2.4 Cluster Evaluation Measure Random Index, and Adjusted Random Index (ARI). Cluster evaluation measurement of a clustering Purity is a simple and transparent evaluation algorithm is essential as the algorithm itself. A measure. Purity's formula is to calculate the total clustering evaluation demands an independent and number of the most frequent object class in each reliable measure for assessing and comparing cluster and then divided by the total number of clustering experiments and results. However, cluster objects (N). The formula of Purity can be seen in evaluation not as trivial as classification evaluation. formula 3. Where M is a set of clusters and D is a Evaluating a clustering algorithm's performance is set of classes. not just as counting the number of correct clustering like precision and recall measurement of a P = ∑∈ max |𝑚 ∩𝑑| (3) supervised classification algorithm. In clustering, the ∈ good cluster based on an internal and external The highest purity score is 1, which means each criterion. The text document that grouped in the document in its cluster, and the worse purity score is same cluster or intra-cluster should have high 0. Purity cannot be used to calculate the quality of similarity. On the other hand, documents in different clustering against the number of clusters. clusters or inter-cluster should be dissimilar Random Index is a method to calculate documents. In cluster evaluation, any evaluation quality clusters based on the benchmark metric should not take the cluster labels' absolute classifications. On the other hand, RI is the values into account. However, the cluster evaluation percentage of correct decisions made by the is more to define separations of the data similar to algorithm. The formula of RI described in formula some ground truth set of classes. Members (4). belonging to the same class are more similar than members of different classes according to some similarity metric [31]. There are two kinds of RI = = (4) methods in document clustering, namely intrinsic and extrinsic measures. Intrinsic or internal Where, n is a set of elements, a is a number of pairs measures of quality such as distortion or log- of elements in S that are in the same subset in X and likelihood to indicate how well an algorithm the same subset in Y. Meanwhile, b represents the optimized a particular representation. Intrinsic number of pairs of S elements in different subsets of comparisons are inherently limited by the given X and different subsets of Y. The value c is several representation in other words dependent on the pairs of elements in S that are in the same subset of feature representations, therefore intrinsic can not X and different subsets of Y, and d represents the compare between different representations [32]. number of pairs of elements in S that are in different Intrinsic measures, calculate the cluster separation, subsets of X and the same subset of Y. The RI score and cohesion. The advantage of this method, it does between 0 and 1. The drawbacks of RI measurement not require a ground truth label. The example of this are that the RI does not has a constant baseline, method is the Silhouette coefficient. The silhouette implying that these measures are not comparable coefficient calculates how similar an object to its across clustering methods with different numbers of own cluster (cohesion) compared to the different clusters. The raw RI score is then adjusted for chance clusters (separation). In other words, the silhouette into the ARI score using the following scheme: coefficient (S) is calculated using the mean intra- cluster distance (a) and the mean nearest-cluster ARI = (RI - Expected_RI) / (max(RI) Expected_RI) distance (b) for each sample in the cluster (C). The (5) best value of the silhouette coefficient is 1, and the worst value is -1. The formula of silhouette can be The ARI values between 0 and 1, with 1 representing described in formula 2. identical partitions, and is adjusted for the number of partitions in X and Y 𝑠 ,𝑖𝑓 |𝐶| 1 (2) , 3. RELATED WORKS On the other hand, the extrinsic or external measure A study by [8] experiments by trying various is able to compare the clustering result from different pre-processing processes and various clustering

3648

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 algorithms to increase clustering accuracy. This word count is 1. This study also shows K-Means is study is an experimental study to observe the the best algorithm. suitable clustering process for Bahasa Indonesia A study by [15] proposed a comparison study with a small corpus that contains only 100 to analyze semantic representations of small corpora. documents. The experiments were conducted for This study compared the LSA representation versus several cluster algorithms, such as K-Means, K- SkipGram word2vec. The finding of this study LSA Means++, and Agglomerative, with various pre- showed better performance than Skip-gram in a processing stages. The results of this study indicate small size training corpus. This study also stated that the K-Means and K-Means++ algorithms are LSA could capture relevant words associations in the superior. This study also concludes the TFIDF-LSA training corpus, even in a small number of low- pre-processing produces the best cluster purity for k frequency words. The study investigated the = 2, and followed by k = 3 and k = 4. It is shown that optimality of different methods to achieve reliable the more clusters to be formed, the cluster purity semantic mappings when only medium to small value will decrease. The study implemented TFIDF corpora are available for training. word vector representation combined with various A study by [21] investigates corpus reduction percentages with LSA. specificity and corpus size in a word embedding. Another previous related study by [9] This study investigates the suitable number of words implemented document clustering using the BoW embedding dimensions. In this study, the representation approach that aims to anticipate the observation was done to a whole corpus and sub- BoW drawbacks, unable to identify synonym words. corpus. The sub-corpus was generated from the The study proposed an additional pre-processing primary corpus that was split into a half size corpus, stage, which is the dictionary synonym checking a third size corpus, and a quarter size corpus. The function. The study also implemented LSA to reduce study found word2vec obtained its best performance the dimension of BoW representation. The when it is trained with the whole corpus. In this experiment process results indicated that the study, the contrary fact also found that the addition of the synonym checking function could specialization (removal of out-of-domain increase the quality cluster up to 13%. documents) of the training corpus, accompanied by A study by [14] aims to categorize a decrease of dimensionality, can increase LSA documents using a classification approach. This word- representation. From a cognitive-modeling study also focuses on various document point of view, the study points out that LSA representation methods such as LSA, Latent representation acquisitions may not be efficiently Dirichlet Allocation (LDA), and Doc2Vec. The exploiting higher-order co-occurrences and global study implemented semi-supervised learning (SSL) relations, whereas word2vec does. to improve classification performance, particularly A study by [22] observed the applicability in the labeling process. The study used five popular of word2vec to extracting similar words from small, English corpora such as 20 newsgroup, Reuters, domain-specific data. The study found the corpus's semEval and OhSumed. specificity has much more influence on word2vec The LSA requires the highest dimension, follow by results than the corpus size. This study concluded the LDA and Doc2Vec. The results of this study specificity of the corpus is more important than the Doc2Vec yields the highest classification size of the corpus. word2vec was used to extract performance. domain-specific related terms from very small Another previous related research by [1] corpora. The study gets satisfactory results with evaluated several document clustering and topic small data for domain-specific words using a word modeling models for the Online Social Network. embedding approach. The study used a large dataset corpus collected from A study by [2] observed document clustering on social media such as Twitter and Reddit social. This a large public domain. This study used various study implemented word vector representation like algorithms and various word vector representation. TFIDF, LDA, and word embedding. This research This study obtained the Doc2Vec algorithm to get shows word embedding combined with k-means the best result. clustering delivered the best performance. This study A study by [16] proposed a method to generate also concerns observing several hyper-parameter effective word embedding from a limited dataset. settings for word embedding, particularly for short This study expanded the small text corpus by text. This study found that 100 dimensions are generated multiple versions for sentences in the sufficient for short text representation like twitter, corpus. the size of the context window is 5 and the minimum

3649

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Other previous studies by [17], [18], and [19] Figure 1. Methodology stated observation of a small corpus is important. According to these previous studies, our study 4.1 Data Preparation proposes an experimental study to compare TFIDF- 4.1.1 Generate a Corpus LSA versus Doc2Vec algorithm in small corpora. The first step of data preparation is to prepare a Different from the most previous studies that are corpus. In this study, a corpus is a set of documents. using available popular publicly English corpus, in We used a partial corpus of the works by [33]. The our study, the observation language is Bahasa corpus was generated from Indonesian online Indonesia. We generated the corpus from Bahasa newspapers. However, the corpus by [33] is already Indonesia newspapers. Our study also focuses on transformed into one big file in txt format. tuning the hyper-parameter for each word vector Meanwhile, we need the corpus that contains various representation to get the optimal result. We also topics in separate files. Therefore, we took the concern in some criterion cluster performance using original form after the crawling process, which is various cluster evaluations. We observed the quality still in JSON format. Each JSON file results from the of intra-cluster and inter-cluster using the silhouette crawling process from many articles in one topic coefficient. from one newspaper. There are seven newspapers and five topics which are economics, politics, law, health, and technology. Each JSON file contains 4. METHODOLOGY some metadata like URL, author, date, title, and The methodology of this study described in Figure. articles content. In this study, we only took articles 1 content. The number of articles in these JSON files reaches thousands of articles. The purpose of this Data Preparation study is to compare word vector representation in Generate Corpus small corpora; therefore, we collected 500 articles. The corpus with 500 articles can be assumed as a small corpus compared to billions of articles in a big Generate Sub-Corpora corpus from previous studies.

4.1.2 Generate Sub-Corpora Generate Ground Truth The purpose of this study is to compare the Labels cluster performance of two words vector representation in small corpora. Therefore, in addition to process the whole corpus with 500 Pre- Processing documents, we also split this corpus into another 3 sub-corpora. To ease the corpus identification, we named the corpus with corpus-500, corpus-250, corpus-125, and Corpus-60. The detail of each Word Vector Representation & corpus can be seen in Table 1. Parameter Initialization

Table 1: The Corpus Information K-Means Clustering Implementation Corpus Number Total Total of number number articles of words of unique words Hyper-parameter Optimization Corpus-500 500 110521 9629

Corpus-250 250 57773 6615

Data Visualization Corpus-125 125 46559 5770

Corpus-60 60 14346 2639

Cluster Evaluation

3650

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

4.1.3 Determine Ground Truth Labels Scikit-learn library [36] in python. Since in this In this study, determine ground truth labels is a study, we focus on word representation, not in the process to annotate each document with an clustering algorithm, so we used default values appropriate news category. Some of the evaluation provided by the Scikit-learn python package for all cluster algorithms, such as Purity and ARI, need experiments. We used 300 iterations and n-init = 10. ground truth labels to calculate the cluster We set clusters number as 5; this number is performance. We determined the ground truth labels consistent as news categories number in a training for each document based on the category determined corpus. by the newspapers. To ease ground truth labels determination, we collected the exact same amount 4.4 Hyper-parameters Optimization number for each category for each corpus. For Despite many parameters to be tuned that can example, corpus-500 contains 100 Health articles, be implemented for each word vector representation 100 law articles, 100 economics articles, 100 and for each corpus, in this study, the hyper- technology articles, and 100 sports articles. parameters' initialization is tuned only in corpus-60. With consideration, corpus-60 is the smallest corpus 4.1.4 Pre-Processing size. Therefore, training time will be shorter. For Pre-processing is a step to transform raw data each tuning of hyper-parameters, the best model is into an understandable and efficient format. In this taken that generates the best Purity, silhouette, and stage, we implemented tokenization, data cleaning, ARI score. This model is then a representative of stopwords removal, and stemming. Tokenization is word vector representation. a process to split documents in a corpus into word by word. Data cleaning is a process to remove unwanted 4.4.1 Tuning Hyper-parameters of TFIDF-LSA symbols like HTML tags and unwanted symbols. model Stopwords removal is the step to exclude For TFIDF-LSA, we tested variations of n unimportant words from a language. This study (range 1-3 ) in n-grams and implemented them with analyzes Bahasa Indonesia; therefore, we the K-means algorithm. The result can be seen in implemented a stopword list for Bahasa Indonesia by Table 2. Tala [34]. Stemming is a process to get the root base of the words. In this study, we implemented a Table 2: The Result of TFIDF-LSA Model stemming algorithm for Bahasa Indonesia by Model Silhouette Purity ARI studying [35]. TFIDF-LSA ( n = 1 / 0.51 0.5 0.45 unigram) TFIDF-LSA ( n = 2 / 0.52 0.5 0.518 4.2 Word Vector representation and Initialize bigram) parameter TFIDF-LSA (3-gram) 0.46 0.48 0.49 This stage is a process to transform the cleaning corpus into an observed word vector representation, which is TFIDF-LSA and Doc2Vec. Based on the result in Table 2, we can conclude that To get an optimal word vector representation, we the optimal hyper-parameters of TF-IDF combined have to choose a set of optimal hyper-parameters for with LSA is the bigram model or n = 2. a learning algorithm. There are some parameters to be tuned for each text representation. For instance, 4.4.2 Tuning Hyper-parameters of Doc2Vec one of the parameters that affect TF-IDF quality is n Model number initialization in n-grams. An n-gram is a contiguous sequence of n items, and n is the For the Doc2vec model, we tested various parameter to set the maximum number of words in parameters to look for the best performance. We text sequence that be converted into a token, implemented 2 algorithms of Doc2Vec, which are meanwhile, the instances of hyper-parameters of PV-Dbow and PV-DM. We tested context window Doc2Vec model including the choice of Doc2Vec sizes for each algorithm ranging to values 5, 8, and algorithm, window size, dimensions, and minimum 15 and dimension with values 50, 100, and 300. We word count and the number of epoch iteration. set epoch number = 300 and min-count = 5. For another hyper-parameters initialization with default 4.3 K-Means Clustering values provided by Gensim. The result can be seen Each model from the previous stage hereafter in Table 3. implemented by the K-Means algorithm. In this Based on Table 3, the best performance of Doc2Vec study, we implemented the K-means algorithm in the model is with implemented parameters: algorithm PV-DBOW, dimensions or vector size = 300,

3651

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 context window size = 8, epoch = 300 and window

Dimension = Dimension = size = 5. Dimension =

300 300 100 100 Further, the best model for each word representation 50 in the hyper-parameters optimization stage will be tested using the corpus's different size. To ease the identification, we named the process with different Silhouette Silhouette codes. The feature representation for TFIDF-LSA Silhouette Silhoutte Silhoutte Purity Purity with different corpus is coded as TF1 – TF4. The Purity ARI ARI ARI feature representation for Doc2Vec with different corpus is coded as DV1 – DV4.

Table 3. Hype TF1 = 2 grams, TFIDF-LSA, corpus-60 -0.0004 -0.0004 S=5 S=8 S=5 S=5 S=8 WS =15 = WS 8 WS = 5 WS =15 = WS 8 WS = 5 .2 00 00 00 002 -0.02 0.012 0.03 0.02 0.02 0.025 0.024 0.59 0.33 .6 .4 .5 .6 .5 0.56 0.30 0.55 0.33 0.56 0.37 0.55 0.35 0.54 0.35 0.56 0.35 TF2 = 2 grams, TFIDF-LSA, corpus-125 0.56 0.37 TF3 = 2 grams, TFIDF-LSA, corpus-250 TF4 = 2 grams, TFIDF-LSA, corpus-500 r -parameters of Doc2Vec Model PV-DM PV-DBoW 0.012 PV1 = algorithm pv-dbow, vector size = 300, -0.01 0.04 0.010 0.020 0.025 0.60 0.37 .6 .6 .9 .2 0.55 0.32 0.52 0.37 0.59 0.37 0.56 0.37 0.56 0.35 min_count = 5, epoch = 300 and window size = 8, Corpus-60

PV2 = algorithm pv-dbow, vector size = 300, -0.006 0.020 0.004 0.003 .1 .1 .2 0.56 0.52 0.61 0.61 .3 .3 .2 0.32 0.32 0.33 0.33 min_count = 5, epoch = 300 and window size = 8, Corpus-125

PV3 = algorithm pv-dbow, vector size = 300, min_count = 5, epoch = 300 and window size = 8, Corpus-250

PV4 = algorithm pv-dbow, vector size = 300, min_count = 5, epoch = 300 and window size = 8, Corpus-500

The experiment results for various processes for TFIDF-LSA can be seen in Table 4. Meanwhile, the result for Doc2Vec can be seen in Table 5.

Table 4: Evaluation Cluster for TF-IDF combined with LSA 4.5 Data Visualization Model Silho Purity ARI Time Memory uette Proc (in Usage second) (in MiB) Data visualization is an essential parameter to TF1 0.52 0.50 0.51 0.6 243.17 describe the clustering pattern. To adjust with the TF2 0.49 0.51 0.56 0.6 248.82 limitation of human eyes, we can only see 2 TF3 0.43 0.54 0.61 0.8 269.14 dimensions; therefore, we have to reduce the word TF4 0.51 0.51 0.59 1.13 357 vector representation into x and y layer. The original TFIDF representation will generate a sparse matrix Table 5: Evaluation Cluster for Doc2Vec that has dimensions as much as total words in the Model Silho Purity ARI Time Memory corpus. However, in this study, we uette proc(in a Usage implemented LSA to get meaning representation, second) (in MiB) and this implementation is also impacted by the PVI 0.60 0.37 0.012 0.4 238.09 reduction dimension of the TFIDF-LSA model. For PV2 0.63 0.39 0.090 0.46 239.68 PV3 0.57 0.32 0.014 0.46 244.54 the Doc2Vec model, the word vector representation PV4 0.58 0.29 0.019 0.5 249 dimension is 300. In this study, we implemented the Principal Component Analysis (PCA) to reduce the model's dimension. PCA is also can emphasize

3652

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 variation and bring out a strong pattern in the corpus. 4.6 Cluster Evaluation Cluster result visualization for each model can be 4.6.1 Performance Cluster Evaluation seen in Figure 2. There are many aspects to Intrinsic Evaluation determine the quality of the cluster. Based on table 4 The silhouette coefficient score is one of an intrinsic and Table 5, for cluster performance, each model has measurement to indicate how well an algorithm 3 kinds of cluster evaluation scores, which are optimized a particular representation. The TF-IDF silhouette, Purity, and ARI. model for every size of the corpus gained about 0.5 silhouette score. Meanwhile, Doc2Vec got about 0.6

TF-1 DV-1

TF-2 DV-2

TF-3 DV-3

TF-4 DV-4

Figure 2. Data Visualization for Each Model of Word Vector Representation

3653

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 silhouette score. Even Though we cannot compare intrinsic evaluation between different These categories have many same words so that the representations, we can see both models, TFIDF and boundaries between clusters are not so obvious. Doc2Vec model, only gained half silhouette score Because K-means clustering is a hard cluster, if we for the number of clusters = 5; meanwhile, the initialize the k number bigger than the optimal k maximum purity score is 1. Still using k = 5, we obtained by intrinsic measurement, then the quality calculated the silhouette score for each model. The cluster's intrinsic score will not be optimal. The result tends to obtain the same value for various findings show the quality of clusters not only corpus sizes. This result indicates the clustering influenced by the corpus size but also by the quality algorithm not optimal to cluster the object into 5 of the corpus. It needs deeper observation for further clusters. This intrinsic measurement calculates the research that focuses on the quality of data corpus, score based only on word vector representation and such as the comparison of small corpora in domain- does not need other external information; therefore, specific and general corpora. We assumes the the experiments can be done to a various number of performance of cluster can be improved if various k. The result of the Silhouette score for various k size of corpus and various small corpora in domain- using the Doc2Vec model can be seen in Figure 3 specific is implemented in this study. and for the TFIDF-LSA model in Figure 4. Based on this experiment, we found the highest Silhouette score was obtained for k 2, 3, and 4. This fact made us re-examine the documents in the corpus. We found some categories are so similar to each other. For instance, in our corpus, the Economics category is too similar to the Politics category and Laws category.

k = 7

k = 6 k = 5 k = 4 k = 3 k = 2

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80

Corpus‐500 Corpus‐250 Corpus‐125 Corpus‐60

Figure 3. Silhouette Score for Doc2Vec Representation

k = 7 Figure 4. Silhouette Score for TFIDF-LSA Representation

k = 6

k = 5

k = 4 k = 3 k = 2 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

Corpus‐500 Corpus‐250 Corpus‐125 Corpus‐60

Figure 4. Silhouette Score for TFIDF-LSA Representation

3654

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Extrinsic Evaluation

In this study, the extrinsic measurement of cluster the elements belonging to different classes. Based performance is Purity and ARI. An extrinsic on the ARI score, the TFIDF-LSA representation measurement need ground truth labels as is better in separating elements that not belong to references to indicate how well the clusters For the same class as the Doc2Vec model. Based on both extrinsic evaluations, TFIDF-LSA gets a Table 4 and Table 5, the comparison of extrinsic better purity and ARI score than the Doc2Vec cluster performance evaluation is shown in Figure model. ARI measurement focuses on the 5. BesideThe performance of DocVec dep capability of the K-Means algorithm to separate

Purity Score Comparison 0.600 0.800 ARI Score Comparison 0.500 0.600 0.400 0.300 0.400 0.200 0.200 0.100 0.000 0.000 Corpus‐60 Corpus‐125Corpus‐250Corpus‐500 Corpus‐60 Corpus‐125Corpus‐250Corpus‐500

TF‐IDF‐LSA Doc2Vec TFIDF‐LSA Doc2Vec

Figure 5. Cluster Performance Comparison

Time and Memory Consumption Evaluation In addition to cluster performance, the other consideration is the efficiency of cluster processing. Based on the experiment in this It means the Doc2Vec model is better in time and study, the time and memory consumption memory usage efficiency than the TFIDF-LSA measurements are showed in Figure 6. Doc2Vec model. In the TFIDF-LSA model, time and needs less time and needs less memory in the memory usage are increasing along with corpus clustering process to compare to the TFIDF-LSA size increases. Meanwhile, in the Doc2Vec model. model, time and memory usage tend to be stable even though the corpus size increases.

1.20 400.00 1.00 300.00 0.80

0.60 200.00 0.40 100.00 0.20

0.00 0.00 Corpus‐60 Corpus‐125 Corpus‐250 Corpus‐500 Corpus‐60 Corpus‐125Corpus‐250Corpus‐500 TFIDF‐LSA Doc2Vec TFIDF‐LSA Doc2Vec

Figure 6. Time and Memory Consumption Comparison

3655

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

5. CONCLUSION [5] D. Arthur and S. Vassilvitskii, “K-means++: The advantages of careful seeding,” in The study's purpose is to observe the suitable Proceedings of the Annual ACM-SIAM word vector representation in text clustering for Symposium on Discrete Algorithms, 2007. small corpora. In this study, the comparison based [6] T. Kohonen, “The Self-Organizing Map,” on cluster performance, time, and memory Proc. IEEE, 1990. consumption. For the cluster performance, which [7] Joachims Thorsten, “A Probabilistic is based on extrinsic measurement, the TDIDF- Analysis of Rocchio Algorithm with TFIDF LSA gets better performance than the Doc2Vec for Text Categorization,” 1996. model. It means the TFIDF-LSA representation is [8] A. Amalia, M. S. Lydia, S. D. Fadilla, and better in separating elements that not belong to the M. Huda, “Perbandingan Metode Klaster same class as the Doc2Vec model for all the dan Preprocessing Untuk Dokumen observed corpora. On the contrary, Doc2Vec is Berbahasa Indonesia,” J. Rekayasa Elektr., better than TFIDF-LSA in time and memory vol. 14, no. 1, pp. 35–42, Apr. 2018. consumption. The usage of time and memory in [9] A. Amalia, M. S. Lydia, S. D. Fadilla, M. the TFIDF-LSA model is increasing, along with Huda, and D. Gunawan, “Document corpus size increases. Meanwhile, in the Doc2Vec Clustering Optimization with Synonym model, time and memory usage tend to be stable Dictionary Check Function Case Study : even though the corpus size increases. As further Documents in Bahasa Indonesia,” in 2017 work, the same experiments should be done in a International Conference on Electrical bigger corpus size but is still classified as a small Engineering and Informatics (ICELTICs), corpus to find out in what points the Doc2Vecs 2017, pp. 286–291. tend to get better performance than TFIDF-LSA. [10] Y. Bengio et al., “A Neural Probabilistic Moreover, it needs deeper observation for further Language Model,” 2003. research that focuses on the quality of data corpus. [11] Q. Li, S. Shah, X. Liu, and A. Nourbakhsh, “Data Sets: Word Embeddings Learned ACKNOWLEDGMENT from Tweets and General Data,” 2017. [12] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, The authors gratefully acknowledge that the present research is supported by Lembaga “Natural Language Processing (Almost) Penelitian Universitas Sumatera Utara. The from Scratch,” J. Mach. Learn. Res., vol. 12, support is under the research grant TALENTA pp. 2493–2537, 2011. USU of the Year 2020. [13] T. Mikolov, W.-T. Yih, and G. Zweig, “Linguistic Regularities in Continuous REFERENCES: Space Word Representations,” in Proceedings of the 2013 Conference of the [1] S. A. Curiskis, B. Drake, T. R. Osborn, and North American Chapter of the Association P. J. Kennedy, “An evaluation of document for Computational Linguistics: Human clustering and topic modelling in two online Language Technologies, 2013, pp. 746– social networks: Twitter and Reddit,” Inf. 751. Process. Manag., vol. 57, no. 2, p. 102034, [14] D. Kim, D. Seo, S. Cho, and P. Kang, Mar. 2020. “Multi-co-training for document [2] F. ois Role, S. Morbieu, and M. Nadif, classification using various document “Unsupervised evaluation of text co- representations: TF–IDF, LDA, and clustering algorithms using neural word Doc2Vec,” Inf. Sci. (Ny)., vol. 477, pp. 15– embeddings,” in International Conference 29, Mar. 2019. on Information and Knowledge [15] E. Altszyler, S. Ribeiro, M. Sigman, and D. Management, Proceedings, 2018, pp. 1827– Fernández Slezak, “Comparative study of 1830. LSA vs Word2vec embeddings in small [3] E. Rendón, I. Abundez, A. Arizmendi, and corpora: a case study in dreams database,” E. M. Quiroz, “Internal versus External Conscious. Cogn., vol. 56, pp. 178–187, cluster validation indexes,” Int. J. Comput. Oct. 2017. Commun., vol. 5, no. 1, pp. 27--34, 2011. [16] A. Silva and C. Amarathunga, “On Learning [4] S. ; Khaled; Ranka and V. Singh, “An Word Embeddings From Linguistically efficient k-means clustering algorithm,” Augmented Text Corpora,” in Proceedings 1997.

3656

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

of the 13th International Conference on [27] T. Mikolov, K. Chen, G. Corrado, and J. Computational Semantics, 2019, pp. 52–58. Dean, “Efficient Estimation of Word [17] C. Jiang, C. J. Hsieh, H. F. Yu, and K. W. Representations in Vector Space,” pp. 1–12, Chang, “Learning word embeddings for 2013. low-resource languages by PU learning,” in [28] Google, “Google Code Archive - Long-term NAACL HLT 2018 - 2018 Conference of the storage for Google Code Project Hosting.,” North American Chapter of the Association Code.Google.com. pp. 1–6, 2013. for Computational Linguistics: Human [29] J. Manotumruksa, C. Macdonald, and I. Language Technologies - Proceedings of Ounis, “A deep recurrent collaborative the Conference, 2018, vol. 1, pp. 1024– filtering framework for venue 1034. recommendation,” in International [18] Y. Gu, G. Leroy, S. Pettygrove, M. K. Conference on Information and Knowledge Galindo, and M. Kurzius-Spencer, Management, Proceedings, 2017. “Optimizing Corpus Creation for Training [30] Q. Le and T. Mikolov, “Distributed Word Embedding in Low Resource Representations of Sentences and Domains: A Case Study in Autism Spectrum Documents,” in International Conference Disorder (ASD),” AMIA ... Annu. Symp. on Machine Learning, 2014. proceedings. AMIA Symp., vol. 2018, pp. [31] “2.3. Clustering — scikit-learn 0.22.2 508–517, 2018. documentation,” 2020. [Online]. Available: [19] P. Tao, J. Minghua, and H. Ming, “A https://scikit- dynamic clustering algorithm based on learn.org/stable/modules/clustering.html#cl small data set,” in Proceedings of the 2009 ustering-performance-evaluation. 6th International Conference on Computer [Accessed: 10-May-2020]. Graphics, Imaging and Visualization: New [32] C. M. De Vries, S. Geva, and A. Trotman, Advances and Trends, CGIV2009, 2009, pp. “Document Clustering Evaluation: 410–413. Divergence from a Random Baseline,” [20] A. K. Jain, “Data clustering: 50 years 2012. beyond K-means,” Pattern Recognit. Lett., [33] A. Amalia, O. Salim Sitompul, E. Budhiarti vol. 31, no. 8, pp. 651–666, 2010. Nababan, M. Silvi Lydia, and N. [21] E. Altszyler, M. Sigman, and D. Fernández Rahmatunnisa, “BAHASA INDONESIA Slezak, “Corpus Specificity in LSA and TEXT CORPUS GENERATION USING Word2vec: The Role of Out-of-Domain WEB CORPORA APPROACHES,” J. Documents,” pp. 1–10, Jun. 2019. Theor. Appl. Inf. Technol., vol. 31, p. 24, [22] E. Dusserre, “Bigger does not mean better! 2019. We prefer specificity,” in 12th international [34] F. Z. Tala, “A Study of Stemming Effects on conference on computer semantics (IWCS), Information Retrieval in Bahasa Indonesia,” 2017, pp. 1–6. M.Sc. Thesis, Append. D, vol. pp, pp. 39–46, [23] T. K. Landauer, P. W. Foltz, and D. Laham, 2003. “An introduction to latent semantic [35] M. Adriani, J. Asian, B. Nazief, and H. E. analysis,” Discourse Process., vol. 25, no. Williams, “Stemming Indonesian : A Confi 2–3, pp. 259–284, Jan. 1998. x-Stripping Approach,” ACM Trans. Asian [24] N. E. Evangelopoulos, “Latent semantic Lang. Inf. Process., vol. 6, no. 4, pp. 1–33, analysis,” Wiley Interdiscip. Rev. Cogn. 2007. Sci., vol. 4, no. 6, pp. 683–692, 2013. [36] J. Montiel, J. Read, A. Bifet, and B. Kegl, [25] T. Mikolov, I. Sutskever, K. Chen, … G. C.- “Scikit-Multiflow: A Multi-output A. in neural, and U. 2013, “Distributed Streaming Framework,” J. Mach. Learn. Representations ofWords and Phrases and Res., vol. 19, pp. 1–5, 2018. their Compositionality,” CrossRef List. Deleted DOIs, vol. 1, 2013. [26] M. Sahlgren, “The distributional hypothesis *,” Ital. J. Disabil. Stud. , vol. 20, pp. 33– 53, 2008.

3657