Comparative Analysis of NLP Models for Google Meet Transcript Summarization

EasyChair Preprint № 5404 Comparative Analysis of NLP models for Google Meet Transcript Summarization Yash Agrawal, Atul Thakre, Tejas Tapas, Ayush Kedia, Yash Telkhade and Vasundhara Rathod EasyChair preprints are intended for rapid dissemination of research results and are integrated with the rest of EasyChair. April 28, 2021 Comparative Analysis of NLP models for Google Meet Transcript Summarization Yash Agrawal1,a) Atul Thakre1,b) Tejas Tapas1,c) Ayush Kedia1,d) Yash Telkhade1,e) Vasundhara Rathod1,f) 1) Computer Science & Engineering, Shri Ramdeobaba College of Engineering and Management, Nagpur, India a) [email protected] , +91 7083645470 b) [email protected] , +91 8956758226 c) [email protected] , +918380073925 d) [email protected] , +91 8459811323 e) [email protected] , +91 9021067230 f) [email protected], +918055225407 Abstract. Manual transcription and summarization is a cumbersome process necessitating the development of an efficient automatic text summarization technique. In this study, a Chrome extension is used for making the process of transcription hassle- free. It uses the text summarization technique to generate concise and succinct matter. Also, the tool is accessorized using Google Translation, to convert the processed text into users' desired language. This paper illustrates, how captions can be traced from the online meetings, corresponding to which, meeting transcript is sent to the backend where it is summarized using an NLP model. It also walks through three different NLP models and presents a comparative study among them. The NLTK model utilizes the sentence ranking technique for extractive summarization. Word Embedding model uses pre-trained Glove Embeddings for extractive summarization. The T5 model performs abstractive summarization using transformer architecture. The working of the model is tested over meeting texts taken from various sources and results show that the NLTK model has an edge over the Word Embedding model based on ROUGE-1, ROUGE-2, and ROUGE-L scores. However, our analysis finds that T5 is generating a more concise summary. INTRODUCTION In today's world, an enormous amount of textual material is generated and is only growing every single day. On average 2.5 quintillions, bytes of data are produced by humans every day. In such a scenario manually analyzing and interpreting text becomes difficult. Consuming the data in its original unstructured form is time-consuming and inefficient. This paper talks about data generated from online video conferencing platforms like Google Meet. Statistics show that there is an increase in 87% of people using video conferencing systems for daily communication purposes. Also, $37 billion is wasted annually in the U.S. on unproductive meetings, hence arising the need for automatic text summarization. Automatic Text Summarization is an AI-driven process that comprises of making a concise text using the most important information such that its meaning is not changed. This helps in reducing manual effort and is much required in online meetings. It also helps in generating summaries that are not biased in comparison to human-generated summaries. This Paper uses Natural Language Processing(NLP) Model as a tool for automatic text summarization. NLP, a sub-field of Artificial Intelligence enables a computer to read, hear, interpret a text in human language and can also determine which parts of the text are im]portant. Hence NLP is the most appropriate tool to produce a summary. Types of Text Summarization Automatic Text Summarization[1] is mainly divided into 2 categories, extractive, and abstractive summarization based on the nature of generated summarized text. Extractive summarization performs the summarization of text by extracting the most important sentences of the entire text. The subset of text appears as it is in the summarized text. It is a selection-based approach. Abstractive Summarization attempts to simulate the human ability to generate entirely new sentences without misrepresenting the meaning of the original text. This paper presents a comparative analysis of the above two models. METHODOLOGY USED Data Preprocessing The received text from the chrome extension consists of redundant information like the name of the speaker, date, and other unnecessary bits of information. This data needs to be cleaned, integrated, transformed, and loaded before it passes through the NLP model for summarization. Text Processing also includes removal of stop words(trivial words such as “a”, “the”, “is”, “are”, etc. )Hence data pre-processing holds great importance before the ultra-processing step ushers in. Tokenization Tokenization is a key step in every summarization technique, whether it is abstractive or extractive, either it is using the bag of words technique or advanced neural network architecture. The next step that follows tokenization is to represent each token in mathematical language. Some encoding techniques use simple numbering to represent words, while complex word embeddings use multidimensional vector representations of the word. These representations can then be used for finding the semantic relationship between different tokens. Sentence tokenization is useful in finding similarities between different sentences and also to establish the importance of a particular sentence in the given text. The process of sentence selection is the basis of many extractive summarization techniques. Stemming and Lemmatization The basic aim of Stemming and Lemmatization[2] is to generate the root word of the inflected words. Stem words may or may not have a meaning. Lemmatization on the other hand will always give a meaningful word representation. [3]For words like 'finally', 'final', 'finalized' Stemming will generate its root word as 'fina', which does not have any meaning in the English language. While Lemmatization will generate the 'final' as a meaningful root word. VARIOUS MACHINE LEARNING MODEL NLTK Model Natural Language Toolkit (NLTK) is a python library that is a powerful tool for Computational Linguistics. It is a text processing library for human language data and is fabled for supporting powerful functions like text tokenization, stemming, text classification, etc. It is an extractive text summarization model and works on the principle of generating summaries by selecting top-ranked sentences. Example: Let the input text be: "A child goes to the park. The child starts playing in the park. In the evening, the child went home." It goes through given five processes: Data Cleaning Text: child goes park. child starts playing park. evening, child went home. Tokenization Word Tokenization: ['child', 'goes', 'park', '.', 'child', 'starts', 'playing', 'park', '.', 'evening', ',', 'child', 'went', 'home', '.'] Sentence Tokenization: ['child goes park.', 'child starts playing park.', 'evening, child went home.'] Generating Word frequency table This step involves finding the frequency for each word in the text [refer Table 1]. This step becomes necessary as it will be used in determining sentence scores. TABLE 1 Word Frequency child 3 goes 1 park 2 starts 1 plain 1 evening 1 went 1 home 1 Sentence Scoring Here sentence score is calculated and allocated to each sentence present in the text based on the summation of word frequency for each word present in that sentence. TABLE 2 Sentence Sentence Score child goes park 3 + 1 + 2 = 6 child starts playing park 3 + 1 + 1 + 2 = 7 evening child went home 1 + 3 + 1 + 1 = 6 Summary Sentences are prioritized according to their sentence scores and a summary is selected by selecting the top ranking sentences based on a predefined compression ratio that can be specified by the user. Word Embedding Model Glove stands for Global Vector Representation, which is a pre-trained dataset [4] of word embeddings. It is trained on 6 Billion words with a vocabulary size of 400,000. Each word is represented as a 100 Dimensional Vector. This vector is obtained by using an unsupervised machine learning algorithm operated on aggregated global word-word co- occurrence statistics. The limitation with models which are based on bag of words approach or TF-IDF (Term Frequency – Inverse Document Frequency) approach is that they use simple incremental encoding which does not convey any semantic relationship or similarity between words. Consider 4 words "King", "Queen", "Man", "Woman". Now let V-King be the 100 D vector representation for the word “King” and similarly for other words. Then, V-Queen = V-King – V-man + V-Woman. This example clearly shows the correlation between vectors of words that are closely related to each other. Thus the use of these vectors for the representation of tokens can prove to be quite effective. Just like the Bag of words model, the first step will be to get some kind of one hot encoding representation, which is meant to create a vocabulary of the given text. The glove vectors can then be used to create an embedding matrix for the vocabulary. This is the base for the representation of the corpus that is to be summarized. A consolidated vector representation for tokenized sentences can be created using the embedding matrix representation. To build a semantic relationship between different sentences of the corpus, some kind of mathematical measure is needed for calculating the similarity between each pair of sentences. This paper uses cosine similarity as a measure for calculating the similarity between vector representations. If X and Y are 2 vectors of the form (X1, X2, X3……….X100) and (Y1, Y2, Y3……….Y100), then cosine similarity between the 2 vectors can be calculated as : 푋. 푌 Cosine similarity (X, Y) = |푋| × |푌| where |X| is magnitude of the vector X and can be calculated as: |푋| = √푋. 푋 This similarity matrix is converted into a graph, with sentences as nodes and similarity values as edges between those nodes. This representation can then be passed to an algorithm for calculating the scores of sentences in the graph. This paper uses the TextRank algorithm for score calculation. Sentences are then ranked on basis of scores and top N sentences can be picked. The value of N can be selected based on a specific use case and the amount of compression required for the original text.

Load more