Classification Using a Word Embedding Techniques

Classification using a word embedding techniques Author - Prachi Bhalerao ([email protected]) Abstract Word embedding is one of the most commonly used techniques for finding word similarity or identifying test context. In this paper, I am illustrating the classification process with the help of word vectors obtained from word embedding techniques to come up with interesting classified results. Here, the process of word embedding to generate word vectors for every role followed by the clustering algorithms implementation is required. Thus, this paper focuses on various word embedding algorithms and clustering approaches that can go hand in hand to be used in applications like subtopic detection, single document summarization, etc. Introduction Word embedding is essential in solving most of today's NLP problems. Also, in machine learning algorithms especially clustering, there can be a need to understand how similar the two words are. This can be done using ‘word embedding’. It is the technique for representing words into the vectors of real numbers which helps in comparing semantics of different words and in efficient representation of data (Dimensionality Reduction). The idea behind implementing the word embedding algorithms is to capture with them as much as semantical or contextual information as possible. Some techniques for implementing word embedding are word2vec, GloVe, fastText, Gensim, etc. Taking forward the concept of word embedding, in this paper, I am also trying to include how the sentence, as well as document embedding, can be applied in NLP problem- solving. Word embedding can be of two types- Frequency-based Embedding and Prediction based Embedding. Count Vector, TF-IDF vector and Co-occurrence Vector are all Frequency-based. In this paper, I am mostly focusing on Prediction based Embedding which relies on Neural Network. This type of embedding is a prediction based i.e. it provides probability to the words and has been applied for word analogies and word similarities as the rule of thumb. 1 | P a g e Word2vec: Word2vec is an example of prediction-based embedding. It can further be categorized into two types viz. Skip-gram and Continuous Bag of Words (CBOW). In skip-gram, the input is the target word and output are the words surrounding the target word. For example in the sentence, “Information retrieval is the activity of obtaining information system resources”, the input can be ‘Retrieval’ and output will be ‘Information’, ‘is’, ‘the’ and ‘activity’ considering the window size is 5. In CBOW, the current target word (the center word) is predicted based on the source context words (surrounding words). Fig. 1.a. CBOW model Architecture Fig. 1.b. Skip-gram model Architecture (Source: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al.) FastText: Rather than feeding target/Context words into the Neural Network, FastText breaks words into several n-grams. For example, tri-grams for the word apple is app, ppl, and ple. The word 2 | P a g e embedding vector for apple will be the sum of all these n-grams. After training the neural network, the word vectors for all the n-grams in the training dataset will be generated. Thus, the word embedding representation of apple is the average of the word vectors obtained from all n- grams of it (i.e. app, ppl, and ple in this case). Fig. 2 FastText model (Source: Google Images) Agglomerative Clustering: Hierarchical clustering algorithms are either top-down or bottom-up. Bottom-up clustering treats each element as a singleton cluster and then merges the pairs of similar or the closest clusters until all clusters have been merged into a single cluster that contains all documents. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering. Top-down clustering requires a method for splitting a cluster. It proceeds by splitting clusters recursively until individual documents are reached. In this paper, I’ll be focusing on the bottom-up approach. 3 | P a g e Fig. 3. Bottom-up Hierarchical agglomerative clustering (Source: https://www.geeksforgeeks.org/ml-hierarchical-clustering-agglomerative-and-divisive- clustering/) In the survey section, I am discussing an interesting paper mentioned in references that talks about how the sentence embedding can be used for subtopic detection. I have also discussed reference papers that demonstrate the application of word embedding in classification and clustering of arguments and single-document summarization. Next is the ‘Compare and contrast of the related word’ section where I compare the word embedding applications in refereed papers and discuss the pros and cons with respect to their approaches. Last is the conclusion section, where I summarize my study and analysis results. Contextualized word embedding: Contextualized word embeddings address the issue that words can have different senses based on the context. These methods compute a vector equivalent for a given word based on the speciﬁc context the word is used in a sentence. This is in contrast with the word embedding methods, like word2vec or FastText where the words are always mapped to the same word vector, no matter what context they are used in. 4 | P a g e Survey of related work My current Research module: The topic under consideration is a part of my ongoing research where I am trying to apply the technique of word embeddings in the clustering algorithm to the results obtained from one of the Ebiquity lab’s projects 'Cyber Attack Sensing Information Extraction System (CASIE)’. It is a system that extracts information about cybersecurity events from text and populates a semantic model to integrate into a knowledge graph of cybersecurity data to classify the ‘roles’ into sub- categories. Papers Descriptions: One of the papers in the references titled ‘A Method based on Sentence Embeddings for the Sub- Topics Detection’ [1] talks about sentence embedding for sub-topic detection. Since there is a huge similarity between sub-topics that belongs to the same topic, most of the available methods cannot be directly applied to the task of sub-topic discovery in the given topic. This issue is addressed in this paper by introducing a new technique based on sentence embedding. In the sub-topic detection, the same word may have different meanings under different sub-topics. To enhance differentiation, the idea of the authors was to allow each word to have different vector representations under different topics. Approach: Initially, the data from Weibo – A popular Chinese social media platform used for blogging and topics discussion, was fetched and pre-processed. Then the latent Dirichlet allocation (LDA) was used to get the topics. Topic Word Embedding (TWE) model is used to train the Weibo data set under a topic. The word and the corresponding topic were trained together to get the word embeddings and the topic embeddings. Later, by taking the cosine similarity between the word embeddings and the topic embeddings as the weight value, the word embeddings of the target words under all topics were weighted and added. Thus, extending the topic information into the word embeddings and enhancing the semantics of the word embeddings. Next comes into the picture is the p-means method i.e. Power mean method. It is used to merge the blog into the sentence embeddings, which is the characteristic value of the 5 | P a g e blog. It is a generalized concept of average word embeddings. The obtained results are then passed to the k-means clustering algorithm. And finally, the sub-topic clusters are obtained after completion of this k-means process. The second paper, ‘Adapting word2vec to Named Entity Recognition’ [2] was the word embedding used for classification in the task of Named Entity Recognition. Name Entity Recognition (NER) is a sequence prediction problem. Given a tokenized text, the task in this method is to predict which words belong to which predefined category. Approach: Words were tokenized using NLTK toolkit. Word2vec and FastText model were trained on different subsets of tokenized data. The authors used this implementation to build experimental analysis results for different models. Using the default setting for vector dimensionality (say 100), as well as for the other parameters such as the number of training iterations and the size of the window, models were generated. The word embeddings were then applied to the classification process using a greedy implementation of the Linear Support Vector algorithm using the scikit-learn software and the default values Addressing Cluster Granularity- A practical challenge in generating clusters from word embeddings lies in choosing the relevant cluster granularity, i.e. the cluster granularity that maintains the information relevant to the classification task at hand. Through their first experiment, they attained a rough idea of a task-adequate dimension of cluster granularity: They then evaluated three dimensions of clusters(100, 1000, and 5000) extrinsically by considering their effect on the performance of the NER system. In this scenario, the granularity 1000 performed best, suggesting that this is the correct range of dimensionality. In the next step, authors manually inspected the clusters built at this granularity. Though they were rather noisy they seemed to capture some regularities. Hierarchical clustering for multiple granularities is the solution for cluster granularity. They found considerable improvement in performance, with a growth in performance for every added granularity. The conclusion of the paper in regards to granularity was that combining multiple cluster granularities led to their best improvement. But, it did not improve performance for smaller data sets. 6 | P a g e Effect of unlabeled corpus size on performance – The authors tried to study what effects the size of the unlabeled corpus has on the performance of the NER system. Their experiments indicated that the size of the unlabeled corpus does not have a direct correlation to the percentage of word types occurring in the testing data that are also covered by the word embeddings model.

Load more