Information Technology and Control 2020/4/49 495
ITC 4/49 Deep Learning Based Semantic Similarity Detection Using Text Data Information Technology and Control Received 2020/07/04 Accepted after revision 2020/08/08 Vol. 49 / No. 4 / 2020 pp. 495-510 http://dx.doi.org/10.5755/j01.itc.49.4.27118 DOI 10.5755/j01.itc.49.4.27118
HOW TO CITE: Mansoor, M., Rehman, Z. U., Shaheen, M., Khan, M. A., Habib, M. Deep Learning Based Semantic Similarity Detection Using Text Data. Information Technology and Control, 49(4), 495-510. https://doi.org/10.5755/j01.itc.49.4.27118 Deep Learning Based Semantic Similarity Detection Using Text Data
Muhammad Mansoor, Zahoor Ur Rehman Computer Science Department, COMSATS University Islamabad, Attock Campus, Pakistan Muhammad Shaheen Faculty of Engineering & IT, Foundation University Islamabad, Pakistan Muhammad Attique Khan Department of Computer Science, HITEC University Taxila, Pakistan Mohamed Habib College of Computing and Informatics, Saudi Electronic University, Riyadh, Saudi Arabia Faculty of Engineering, Port Said University, Port Fuad City, Egypt
Corresponding author: [email protected]
Similarity detection in the text is the main task for a number of Natural Language Processing (NLP) appli- cations. As textual data are comparatively large in quantity and in volume than the numeric data, measuring textual similarity is one of the important problems. Most of the similarity detection algorithms are based upon word to word matching, sentence/paragraph matching, and matching of the whole document. In this research, a novel approach is proposed using deep learning models, combining Long Short-Term Memory Network (LSTM) with Convolutional Neural Network (CNN) for measuring semantics similarity between two ques- tions. The proposed model takes sentence pairs as input to measure the similarity between them. The model is tested on publicly available Quora’s dataset. In comparison to the existing techniques gave 87.50 % accuracy which is better than the previous approaches. KEYWORDS: Deep Learning, Semantics, Similarity, Quora, question duplication, LSTM and CNN. 496 Information Technology and Control 2020/4/49
1. Introduction In this digital era, the use of web has increased due to The above sentences are about jewels and almost a the advancement of the internet. A huge number of similar concept is explained in two different ways user-generated short text comes through the Internet which can easily be interpreted by humans. as well as social networks, e.g. user opinions about the Computing the text similarity between two texts is product, blogs, and services. Such short text is basi- very difficult for the machine due to huge variance cally subjective and semantics oriented. Detecting in natural languages. The authors in [14] presented similarity is the problem of finding the score to which a conceptual space framework, which permits the the input pair of texts has similar content. Words are model to profess similarity [53] which “changes with similar either lexically, in which words have a simi- changes in selective attention to specific perceptual lar sequence of characters, or semantically, in which properties.” Meanwhile, the short text has its own sole words are used in the same context or have the same characteristics. To classify short text is a difficult task meaning. Measuring the text similarity between since it provides many challenges related to the stan- words, sentences, etc. plays a vital role in information dard accuracy score. Due to the challenges and need, retrieval, clustering documents, short answer grad- text classification has high demand. Recently, deep ing, auto essay scoring, text summarization, and ma- learning shows a huge advancement in many machine chine translation. In terms of research, it is valuable learning research areas such as text/signature [6], ob- to detect and match the short text at different granu- ject classification [21, 47], visual surveillance [26, 51], larities. biometrics [3], medical imaging [27, 29, 42, 48, 52], The volume of the data obtainable in the form of text and many more [4, 25, 28, 30-32, 41]. In this research, is increasing rapidly. Therefore, the importance of a deep learning model is proposed for detecting simi- applications of NLP is increased for improving the larity between short sentences such as question pairs. progress of the analysis and retrieval of data avail- Contributions of the paper are the following: able in the form of text. The computation of seman- 1 A deep learning model is developed by using LSTM tic similarity among the sentences is the main task and CNN models to detect semantic similarity in many NLP applications like text classification among short text pairs, specifically Quora question and summarization [24], topic extraction [45], in- pairs. formation retrieval [39] and for the grading of short 2 To utilize the proposed method for finding similar answers generated automatically, finding duplicate questions, and validating effectiveness of method questions [36], document clustering [22], etc. Differ- in the detection of similar questions. ent methods are available to measure semantic sim- ilarity among sentences. Most of the methods mainly 3 Shown a wide range of comparative studies for se- depend on the features extracted by using different mantic similarity detection problem. techniques. These features are then given to the ma- The rest of the paper is organized as follows. In Sec- chine learning algorithms to find similarity scores tion 2, we discussed the related work comprising deep between the sentences. However, their accuracy still learning and semantic similarity. In Section 3, we needs to improve. proposed our model in detail. The experimental setup The similarity between the texts of varying length is described in Section 4. Finally, Section 5 will give produces better results when the length of the sen- the conclusion. tences is minimal. The techniques of similarity de- 1.1. Problem Statement tection are based on capturing strong relationships in two sentences. For example two sentences from the Online question-answering is gaining popularity in paper [35] are given below: different IT based web applications. Quora is an on- _ In jewelry, a gem is a stone that is used. line platform for people to ask questions and connect _ In the rings and necklaces, a jewel is a precious with others who give quality answers. Every month, stone. around 100 million people visit Quora and post their Information Technology and Control 2020/4/49 497
questions, so there are many chances that people may outperformed some benchmark scores that employ ask similar or existing questions. It will be beneficial word level embedding. In addition, the authors [11] for both users and community to detect similar ques- proposed a CNN architecture for analyzing senti- tions and to extract the relevant answers in less time. ments among short text pairs. Their proposed model This research is focused to find duplicate questions to use both networks and combined CNN and RNN for improve the performance and efficiency of the ques- sentiment analysis [55]. tions asked on Quora community. Let there are two questions denoted by Q1 and Q2 respectively. These 2.2. Short Text Similarity two questions are semantically equivalent or dupli- The authors in [59] proposed a MULTIP model to de- cate of each other if they convey the same meaning, tect paraphrase on twitter’s short text messages. They otherwise non-duplicate. modelled paraphrase relations for both sentences and The problems addressed in this research are formu- words. In research conducted in [12], the overlap- lated as questions below: ping in unigram and bigram features is evaluated for Q1: What is the significance of similarity detection in paraphrase detection. The work did not consider the Textual Data? semantic similarity score. F1 score for paraphrase detection in the said study was 0.674. In [58], the au- Q2: How to detect duplicate question pairs? thors proposed a new recurrent model named Gat- Q3: How the efficiency of the existing algorithms can ed Recurrent Averaging Network (GRAN) which is be enhanced? based on AVG and LSTM and gave better results than both of them. They [54] proposed simple features to execute PI and STS tasks on Twitter dataset. Based on the experiments, they concluded that by overlapping 2. Related Work word/n gram, word alignment by METEOR, scores Semantic similarity detection is a critical task for of BLEU and Edit Distance, one can find semantic most of the NLP applications such as question and information of Twitter dataset at a low cost. In [20], answering, detecting paraphrases and IR i.e. Informa- the authors presented a novel deep learning model tion Retrieval. Many approaches are proposed for the for detecting paraphrases and semantics similarity calculation of semantics similarity between sentenc- among short texts like tweets. They also studied how es. Semantic similarity is the problem of identifying different levels (character, word and sentence) of fea- how similar the given pair of texts are. tures could be extracted for modeling sentence rep- resentation. Character level CNN is used for finding 2.1. Semantic Similarity Using Deep Learning character n gram features, word level CNN for word In NLP tasks, deep learning played an important role n gram features and LSTM for finding sentence lev- in the past few years. Many researchers used a vari- el features. The model was tested on twitter dataset, ety of features for paraphrase identification tasks and the results showed that character level features like n gram features [40], syntactic features [10], play an important role in finding similarity between linguistic features [49], etc. The use of deep learn- tweets. They also concluded that the combination of ing methods have moved researchers’ consideration two models gave better results than that of individual towards semantic representation of text [2], [8]. The ones. use of CNN for learning sentence representations 2.3. Convolutional and Recurrent Neural achieved remarkable improvement in the results of Network classification. Recurrent Neural Network (RNN) that can learn the relationship among different elements The authors in [9] proposed an ensemble method in input sequence were used by researchers to rep- that represents text features in an efficient way. The resent text features [34]. The authors in [33] gave a experimental results showed that the accuracy of simple model that relied upon character level inputs, the proposed approach largely depends on the size of however, for predictions they used word level inputs. the dataset. When the dataset is small, the system is Although they utilized few parameters, their system vulnerable to overfitting. However, training on large 498 Information Technology and Control 2020/4/49
dataset gave good results. In [57], a model based on using deep learning techniques. Some of the existing similarities and dissimilarities in the sentences is approaches of finding paraphrases gave good results proposed. The model denoted each word as a single on clean text but are failed to perform well on noisy vector and calculated matching vector with respect and short texts. Therefore, a new, generic and robust to every word in another sentence and based on this architecture which they called Deep Paraphrase is matching score, the word vectors are separated into proposed which combined CNN and RNN models. similar and dissimilar component matrix. A two-chan- The experiments are conducted on Twitter dataset nel CNN model is then applied to capture feature and results showed that the proposed architecture vectors from the component matrix. The model was performed well on both types of texts. In [43], seman- applied on answer selection problem and showed tic textual similarity in paraphrases of Arabic tweets good performance. In [38], an interpretability layer is found by applying several text processing phases called iSTS is added which depends on arrangement and extracting different features (Lexical, Syntactic of sentences among pairs of segments. The authors and Semantic) to overcome the limitations of existing in [61] proposed an attention-based CNN for model- approaches. Machine learning classifiers like support ing sentence pairs for different NLP tasks and found vector regressions were trained by using these ex- that attention based CNN outperformed the simple tracted features. The accuracy of the model in PI and CNNs. A new model based on deep learning named STS tasks was high. In [15], the authors proposed Sia- as BiMPM was proposed [56]. In this model, for a giv- mese Gated units of machine learning algorithms like en pair of sentence segments, the model first encoded support vector machines, adaboost and random for- the sentences by using BiMPM and then matched en- est for predicting similar questions on Quora dataset, coded sentences in both directions. In each matching they did not calculate accuracy but they calculated direction, BiMPM compared sentence segments from cross entropy loss. In [5] the authors used Attentive different perspectives. The BiLSTM layer calculated Siamese LSTM for calculating semantic similari- the matching score which is then used for making the ty among sentences. A different architecture having final decision. In [50], deep learning techniques to re- bi- directional LSTM network with attention mech- rank short text pairs are proposed. They used CNN anism has been proposed [37]. They performed their to learn sentence representation and build similarity experiments on six different datasets for the classi- matrix. The experimental results showed that deep fication of sentiments and one for question classifi- learning models greatly improved results. cation and proved the model to be more accurate. In In [19], a deep neural model is proposed for model- [46], authors proposed deep learning architecture, ling sentences hierarchically. The proposed model which used attention mechanisms. In their method, achieved better results for sentence completion and they decomposed the large problem into smaller sub- response matching task. The model did not perform problems that can be easily solvable separately. Their well on paraphrase detection task. In [18], a Siamese proposed method gave better accuracy and outper- gated neural network is proposed to find similar ques- formed the complex deep neural models. A summary tions on Quora. The authors tried to detect seman- of the existing approaches is shown in Table 1. tically similar questions on online user forum and The literature revealed that a lot of methods have used the data from stack exchange forum [7]. They been proposed for detecting similarity between short presented a CNN-based approach which generat- sentences, but these methods have some problems ed vector representation of given question pairs and that need to be addressed. Different researchers ap- scored them using similarity matrix. The proposed plied different techniques to find similarity scores CNN-based method was tested on data from stack and tried to improve the accuracy. In this paper, our exchange forums. The results of the study were com- focus is to use deep learning models and develop a ro- pared with the standard machine learning algorithms bust similarity detection system for two sentences. like support vector machines. Domain word embed- The model only requires a pair of sentences and finds ding are evaluated with Wikipedia word embedding the similarity between them. For this purpose, a com- and the accuracy of CNN method was high. In [1], the bination of CNN and LSTM is used to find similarity authors attempted to detect paraphrases in short text between such sentence pairs. Information Technology and Control 2020/4/49 499
Table 1 Overview of Deep Learning Approaches and their Features
Work Method Features Dataset
Mohammad, A.-S. et Multi-layer Neural Network POS Tagger and Pre trained NN MSPR, Twitter al. [43]
Hybrid model (deep learning and Pre trained word embedding Agarwal, B. et al. [1] MSPR statistical features) and POS Tagger
Huang, J., et al. [20] Multi-layer Perceptron Pre-trained word embedding Twitter
Yin, W. et al. [61] Logistic Regression Word2vec based Embedding MSPR
GloVe vectors pre-trained on Godbole, A. et al. [15] Gated Recurrent Unit (GRU) Network Quora Dataset Wikipedia
Severyn, A. and A. TREC: QA and Convolutional neural network Word embedding Moschitti. [50] Microblog Retrieval
Reuters-21578 and Chen, G. et al. [9] CNN-RNN model word2vec embedding RCV1-v2.
Skip gram neural network Bogdanova, D. et al. [7] Convolutional neural network Quora Dataset architecture
300-dimensional GloVe vectors Homma, Y. et al. [18] Siamese Gated Recurrent Network Quora Dataset pre-trained on Wikipedia
Pre-trained word embedding by Wang, Z. et al. [57] Two-channel CNN model QASent, WikiQA Mikolov et al. (2013)
Parikh, A.P. et al. [46] Simple Attention-based model 300-dimensional GloVe vectors SNLI
300D word embedding (English Bao, W. et al. [5] Attentive Siamese LSTM and Chinese) SemEval 2014
TREC, SST1, SST2, Liu, G. and J. Guo [37] AC-BiLSTM Word embedding IMDB etc.
Google Pretrained word Deep Similarity Model LSTM and CNN embedding Quora Dataset
3. Proposed Work In the field of text similarity detection, techniques based ing. The goal is to find similar questions with reduced on deep learning moved researchers towards semanti- time and complexity. The method based on deep learn- cally distributed representations [2], [5], [37], [13], [23], ing will require minimal preprocessing and will take a [60], [17]. To find the semantic relatedness among ques- pair of sentences only hence reducing the time complex- tion pairs, we proposed a technique based on deep learn- ity. Figure 1 shows the diagram for our proposed model.
into the original form. Stemming is 3. Proposed Work performed through Snowball Stemmer 1 to In the field of text similarity detection, techniques remove the unnecessary portion of the text. A based on deep learning moved researchers few examples of regex are given in Table 2. towards semantically distributed representations [2], [5], [37], [13], [23], [60], [17]. To find the Table 2 semantic relatedness among question pairs, we Regex examples proposed a technique based on deep learning. The Original Converted Original Converted goal is to find similar questions with reduced time what’s what is ’ll will and complexity. The method based on deep learning will require minimal preprocessing and ‘ve have 0s 0 500 will take a pair of sentences oInformationnly hence Technology reducing and Control can’t cannot e -2020/4/49 mail email the time complexity. Figure 1 shows the diagram ‘nt not u s American for our proposed model. i’m i am ’re are
Figure 1 Table 2 FigureProposed 1 flow diagram for similarity detection Regex examples
Proposed flow diagram for similarity detection OriginalAlgorithmConverted 1: Process Originaltexts in datasetsConverted what’sInput: Textwhat is ’ll will
‘veOutput: Lihavest of Words 0s 0 Notation: T: text, S: stops, SW: stopwords, can’t cannot e - mail email STW: stem words, st: stemmer, stw: ‘ntstemmed_wordsnot u s American
i’m i am T, S, SW,’re STW, st, stware 1: T←𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰 T.lower().split()∶ 2: if SW removes : Process texts in datasets Algorithm3: 1 S ← set SW to English 4: Input: Textfor w in T do Output: List of Words 5: if not w Notation: T: text, S: stops, SW: stopwords, 6: stops STW: stem words, st: stemmer, stw: stemmed_ 7: words end if 8: end for T, S, SW, STW, st, stw 1: T T.lower().split() 9: 푰풏풊풕풊풂풍풊풔풂풕풊풐T ← w� ∶ 2: if SW removes 3.1.Pre-Processing 10: end← if 3.1. Pre-Processing 3: S set SW to English Cleansing Text: Preprocessing is one of the 11: T ← join T Cleansing Text: Preprocessing is one of the essential 4: for w in T do essential steps, especially in text mining. Regex ← steps, especially in text mining. Regex plays an im- 12: Clean the text using Regex plays an important role during the whole process. 5: if not w portant role during the whole process. Regex is used Regex is used to find interesting patterns in raw 6:13: if STW stops to find interesting patterns in raw text, and then per- text, and then perform the desired action. For our 7:14: T ←end split if T task,form 200000the desired unique action. words For wereour task, used 200000 in the uniquewhole words were used in the whole process due to compu- 8: end for process due to computational expenses. 15: st ← english tational expenses. 9: T w 16: stw ← stw in T InIn thethe firstfirst step step (as (as shown shown in in A lgorithmAlgorithm 11),), allall thethe 10: end if questions are tokenized and converted into 17: ← T ← join stw questions are tokenized and converted into sequenc- 11: T join T sequences.es. A threshold A threshold of no more of than no 30more words than per 30 question words 12:18: Cleanend if the text using Regex peris applied. question Stop is words,applied. punctuation Stop words and, noisepunctuation are re- ← andmoved noise from are the textremoved by using from NLTK, the and text the wordsby using are 13: if STW NLTK, and the words are converted into lower converted into lower case. Regex function is applied 14: T split T case. Regex function is applied to convert words 1 http://www.nltk.org/howto/stem.html to convert words into the original form. Stemming is 15: st english ← performed through Snowball Stemmer1 to remove the 16: stw stw in T unnecessary portion of the text. A few examples of re- ← 17: T join stw gex are given in Table 2. ← 18: end if ← 1 http://www.nltk.org/howto/stem.html 19: return T Information Technology and Control 2020/4/49 501
The input sequences are padded that are smaller than Algorithm 2: Data Processing the desired length while the longer ones are truncated. Input: Questions Algorithm 3: Converting Text to Sequences Output: Labels Notation: t1: texts_1, t2: texts_2, lb: labels, ti: test_ Input: Text ids, rd: reader, hd: header, tt1: test_texts_1, tt2: test_ Output: Number Sequence texts_2, qd: q_dict, tsf: train_feat, tsf: test_feat Notation: token: tokenizer, seq: sequences, tsseq: t1, t2, lb, ti, rd, hd, tt1, tt2 test_sequences, wi: word_index 1: with open.TDF as f 푰풏풊풕풊풂풍풊풔풂풕풊풐� ∶ : token, seq, tsseq, wi 2: rd read csv 1: token tokenize words 3: hd next of rd 푰풏풊풕풊풂풍풊풔풂풕풊풐풏 ← 2: seq tokenize words to sequences 4: for v in rd ← ← 3: tsseq tokenize words to test sequences 5: append tt1, tt2, lb ← 4: wi token.wi 6: end for ← ← 7: end with 3.3. Word Embedding 8: with open.TDF as f Word embedding is used to convert words which 9: rd read csv are then assigned to real valued vectors (known as 10: hd next of rd embedding) to retain different types of information ← 11: for v in rd about the words. They are used to represent the words ← 12: append tt1, tt2, ti in an N-dimensional vector space. Word embedding is of great importance as this embedding denotes the 13: end for meaning of the words, i.e. how words are semantically 14: end with identical. Google’s2 word2vec representation of text is 15: qd default dictionary used in our study. The main advantage of using Goo- 16: for i in questions gle’s word2vec is that it represents a word as a vector ← having 300 dimensions. They are used to learn word 17: qd concatenate questions embedding for all the words in the dataset. 18: end for ← 19: trf intersect questions Algorithm 4: Embedding Matrix Preparation 20: tsf intersect questions Input: Text ← Output: Vectors ← Algorithm 2 shows the process of collecting all the Notation: nbw: nb_words, em: embedding_matrix, vocabulary words and making a dictionary of words. w: word, wi: word_index, voc: vocab First, the training file is opened. All the questions in : nbw, em, w, wi, voc training file are being concatenated. After concatena- 1: nbw storing maximum number of words tion, the dictionary of words is formed using this text. 푰풏풊풕풊풂풍풊풔풂풕풊풐풏 2: em embed with zeros ← 3.2. Tokenization and Sequence Creation 3: for w, i in wi ← Tokenization simply divides a sentence into a list of 4: if w in voc words. We used Keras tokenizer function to tokenize 5: em[i] words vectors the strings and the use of another important function 6: end if ← ‘texts_to_sequences’ to make sequences of the words. 7: end for After converting texts to the sequences, the sequences are padded to the desired length which is done by using 2 http://mccormickml.com/2016/04/12/googles-pre- Max Length argument (in this case max length is 30). trained-word2vec-model-in-python/