Classification using a techniques Author - Prachi Bhalerao ([email protected]) Abstract Word embedding is one of the most commonly used techniques for finding word similarity or identifying test context. In this paper, I am illustrating the classification process with the help of word vectors obtained from word embedding techniques to come up with interesting classified results. Here, the process of word embedding to generate word vectors for every role followed by the clustering algorithms implementation is required. Thus, this paper focuses on various word embedding algorithms and clustering approaches that can go hand in hand to be used in applications like subtopic detection, single document summarization, etc.

Introduction

Word embedding is essential in solving most of today's NLP problems. Also, in algorithms especially clustering, there can be a need to understand how similar the two words are. This can be done using ‘word embedding’. It is the technique for representing words into the vectors of real numbers which helps in comparing semantics of different words and in efficient representation of data (). The idea behind implementing the word embedding algorithms is to capture with them as much as semantical or contextual information as possible. Some techniques for implementing word embedding are word2vec, GloVe, fastText, Gensim, etc. Taking forward the concept of word embedding, in this paper, I am also trying to include how the sentence, as well as document embedding, can be applied in NLP problem- solving.

Word embedding can be of two types- Frequency-based Embedding and Prediction based Embedding. Count Vector, TF-IDF vector and Co-occurrence Vector are all Frequency-based. In this paper, I am mostly focusing on Prediction based Embedding which relies on Neural Network. This type of embedding is a prediction based i.e. it provides probability to the words and has been applied for word analogies and word similarities as the rule of thumb.

1 | P a g e

Word2vec:

Word2vec is an example of prediction-based embedding. It can further be categorized into two types viz. Skip-gram and Continuous Bag of Words (CBOW). In skip-gram, the input is the target word and output are the words surrounding the target word. For example in the sentence, “Information retrieval is the activity of obtaining information system resources”, the input can be ‘Retrieval’ and output will be ‘Information’, ‘is’, ‘the’ and ‘activity’ considering the window size is 5.

In CBOW, the current target word (the center word) is predicted based on the source context words (surrounding words).

Fig. 1.a. CBOW model Architecture Fig. 1.b. Skip-gram model Architecture (Source: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al.)

FastText:

Rather than feeding target/Context words into the Neural Network, FastText breaks words into several n-grams. For example, tri-grams for the word apple is app, ppl, and ple. The word

2 | P a g e embedding vector for apple will be the sum of all these n-grams. After training the neural network, the word vectors for all the n-grams in the training dataset will be generated. Thus, the word embedding representation of apple is the average of the word vectors obtained from all n- grams of it (i.e. app, ppl, and ple in this case).

Fig. 2 FastText model (Source: Google Images)

Agglomerative Clustering:

Hierarchical clustering algorithms are either top-down or bottom-up. Bottom-up clustering treats each element as a singleton cluster and then merges the pairs of similar or the closest clusters until all clusters have been merged into a single cluster that contains all documents. Bottom-up is therefore called hierarchical agglomerative clustering. Top-down clustering requires a method for splitting a cluster. It proceeds by splitting clusters recursively until individual documents are reached. In this paper, I’ll be focusing on the bottom-up approach.

3 | P a g e

Fig. 3. Bottom-up Hierarchical agglomerative clustering

(Source: https://www.geeksforgeeks.org/ml-hierarchical-clustering-agglomerative-and-divisive- clustering/) In the survey section, I am discussing an interesting paper mentioned in references that talks about how the sentence embedding can be used for subtopic detection. I have also discussed reference papers that demonstrate the application of word embedding in classification and clustering of arguments and single-document summarization. Next is the ‘Compare and contrast of the related word’ section where I compare the word embedding applications in refereed papers and discuss the pros and cons with respect to their approaches. Last is the conclusion section, where I summarize my study and analysis results.

Contextualized word embedding:

Contextualized word embeddings address the issue that words can have different senses based on the context. These methods compute a vector equivalent for a given word based on the specific context the word is used in a sentence. This is in contrast with the word embedding methods, like word2vec or FastText where the words are always mapped to the same word vector, no matter what context they are used in.

4 | P a g e

Survey of related work

My current Research module:

The topic under consideration is a part of my ongoing research where I am trying to apply the technique of word embeddings in the clustering algorithm to the results obtained from one of the Ebiquity lab’s projects 'Cyber Attack Sensing Information Extraction System (CASIE)’. It is a system that extracts information about cybersecurity events from text and populates a semantic model to integrate into a knowledge graph of cybersecurity data to classify the ‘roles’ into sub- categories.

Papers Descriptions:

One of the papers in the references titled ‘A Method based on Sentence Embeddings for the Sub- Topics Detection’ [1] talks about sentence embedding for sub-topic detection. Since there is a huge similarity between sub-topics that belongs to the same topic, most of the available methods cannot be directly applied to the task of sub-topic discovery in the given topic. This issue is addressed in this paper by introducing a new technique based on sentence embedding. In the sub-topic detection, the same word may have different meanings under different sub-topics. To enhance differentiation, the idea of the authors was to allow each word to have different vector representations under different topics.

Approach: Initially, the data from Weibo – A popular Chinese social media platform used for blogging and topics discussion, was fetched and pre-processed. Then the latent Dirichlet allocation (LDA) was used to get the topics. Topic Word Embedding (TWE) model is used to train the Weibo data set under a topic. The word and the corresponding topic were trained together to get the word embeddings and the topic embeddings. Later, by taking the cosine similarity between the word embeddings and the topic embeddings as the weight value, the word embeddings of the target words under all topics were weighted and added. Thus, extending the topic information into the word embeddings and enhancing the semantics of the word embeddings. Next comes into the picture is the p-means method i.e. Power mean method. It is used to merge the blog into the sentence embeddings, which is the characteristic value of the

5 | P a g e blog. It is a generalized concept of average word embeddings. The obtained results are then passed to the k-means clustering algorithm. And finally, the sub-topic clusters are obtained after completion of this k-means process.

The second paper, ‘Adapting word2vec to Named Entity Recognition’ [2] was the word embedding used for classification in the task of Named Entity Recognition. Name Entity Recognition (NER) is a sequence prediction problem. Given a tokenized text, the task in this method is to predict which words belong to which predefined category.

Approach: Words were tokenized using NLTK toolkit. Word2vec and FastText model were trained on different subsets of tokenized data. The authors used this implementation to build experimental analysis results for different models. Using the default setting for vector dimensionality (say 100), as well as for the other parameters such as the number of training iterations and the size of the window, models were generated. The word embeddings were then applied to the classification process using a greedy implementation of the Linear Support Vector algorithm using the scikit-learn software and the default values

Addressing Cluster Granularity- A practical challenge in generating clusters from word embeddings lies in choosing the relevant cluster granularity, i.e. the cluster granularity that maintains the information relevant to the classification task at hand. Through their first experiment, they attained a rough idea of a task-adequate dimension of cluster granularity: They then evaluated three dimensions of clusters(100, 1000, and 5000) extrinsically by considering their effect on the performance of the NER system. In this scenario, the granularity 1000 performed best, suggesting that this is the correct range of dimensionality. In the next step, authors manually inspected the clusters built at this granularity. Though they were rather noisy they seemed to capture some regularities. Hierarchical clustering for multiple granularities is the solution for cluster granularity. They found considerable improvement in performance, with a growth in performance for every added granularity. The conclusion of the paper in regards to granularity was that combining multiple cluster granularities led to their best improvement. But, it did not improve performance for smaller data sets.

6 | P a g e

Effect of unlabeled corpus size on performance – The authors tried to study what effects the size of the unlabeled corpus has on the performance of the NER system. Their experiments indicated that the size of the unlabeled corpus does not have a direct correlation to the percentage of word types occurring in the testing data that are also covered by the word embeddings model. And the conclusion was that the performance of the NER model improved with growth of the size of the unlabeled data set but only to a limit (In this case, at around 300 000 types) at which it even started to drop.

Yet another paper titled ‘Classification and Clustering of Arguments with Contextualized Word Embeddings’ [3] explains the implementation of contextualized word embedding using two methods – ELMo (Embeddings from Language Models) and BERT (Bidirectional Encoder Representations from Transformers). It explains how the power of contextualized word embeddings can be used to classify and cluster topic-dependent argument.

The last IEEE paper, ‘Single-Document Summarization Using Sentence Embeddings and K-Means Clustering’ [4] I referred to explains an idea of document summarization using sentence embedding. It explains K-means clustering implementation which creates clusters containing sentences that have a similar meaning. Sentence embeddings processed by the K-Means algorithm into the number of clusters depending on the required summary size. Sentences in the given cluster contained similar information, and the most appropriate sentence was picked and included in the summary for each cluster by a ridge regression sentence scoring model.

Compare and Contrast Relevant Work

Word Embedding –

The first paper uses the Latent Dirichlet Allocation (LDA) for topic modeling and Topic Word Embedding (TWE) together for the initial embedding phase which then after merging into sentence embeddings is used for clustering. The second paper uses pre-trained word2vec models directly from the genism library. The third paper demonstrates the use of word embedding methods called ELMo and BERT whereas the last paper utilizes sentence embedding for creating ridge regression sentence scoring model.

7 | P a g e

Clustering and classification -

The first paper utilizes k-means clustering where each cluster represents sub-topic of the topic under consideration. The second paper implements the Linear Support Vector algorithm using scikit. The third paper uses agglomerative hierarchical clustering. The results were also tested for -means and DBSCAN clustering algorithms. But agglomerative hierarchical clustering helped in providing better results. In the last paper, the K-Means algorithm was used. The sentence in a similar cluster tends to be describing similar information. Most appropriate words were picked from those sentences to create a summary. Since the application of all these algorithms was different, we cannot directly compare their performance. The usage of a particular method depends on the data and the requirement.

Comparison based on the performance of the word embedding techniques –

Based on the paper [1], fig.4 below shows the comparison of the performances of different word embedding and clustering approaches for sub-topic detection in particular. We can achieve better results for sub-topic detection using the approach mentioned in the first paper description of the section 'Survey of Relevant work' above. This approach is indicated in the below table as 'Our Model', since the figure is referred from the same paper.

Fig. 4. Comparison based on the performance of sub-topic detection

8 | P a g e

Conclusion

In this paper, I compared different word embedding techniques and explained its usage in the clustering algorithms. There are some issues in word embedding that especially affects the clustering techniques like Homographs i.e. different words sharing the same spelling and Inflection i.e. Alterations of a word to express different grammatical categories. These challenges need to be addressed. The paper also discusses two reference papers of the sub-topic detection algorithm, Named Entity Recognition (NER) and document summarization respectively where word embedding is applied to the clustering model.

References

1. Yu Xie et al, ‘A Method based on Sentence Embeddings for the Sub-Topics Detection’ 2019 J. Phys.: Conf. Ser. 1168 052004; https://iopscience.iop.org/article/10.1088/1742- 6596/1168/5/052004/pdf

2. Scharolta Katharina Siencnik, ‘Adapting word2vec to Named Entity Recognition’, Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015); https://www.ep.liu.se/ecp/109/030/ecp15109030.pdf

3. Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, Iryna Gurevych. ‘Classification and Clustering of Arguments with Contextualized Word Embeddings’ ACL 2019; https://arxiv.org/abs/1906.09821

4. Sanchit Agarwal, Nikhil Kumar Singh, Priyanka Meel, ‘Single-Document Summarization Using Sentence Embeddings and K-Means Clustering’, IEEE conf. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8748762

9 | P a g e