Information Technology and Control 2020/4/49 495

ITC 4/49 Deep Learning Based Semantic Similarity Detection Using Text Data Information Technology and Control Received 2020/07/04 Accepted after revision 2020/08/08 Vol. 49 / No. 4 / 2020 pp. 495-510 http://dx.doi.org/10.5755/j01.itc.49.4.27118 DOI 10.5755/j01.itc.49.4.27118

HOW TO CITE: Mansoor, M., Rehman, Z. U., Shaheen, M., Khan, M. A., Habib, M. Deep Learning Based Semantic Similarity Detection Using Text Data. Information Technology and Control, 49(4), 495-510. https://doi.org/10.5755/j01.itc.49.4.27118 Deep Learning Based Semantic Similarity Detection Using Text Data

Muhammad Mansoor, Zahoor Ur Rehman Computer Science Department, COMSATS University , Attock Campus, Muhammad Shaheen Faculty of Engineering & IT, Foundation University Islamabad, Pakistan Muhammad Attique Khan Department of Computer Science, HITEC University , Pakistan Mohamed Habib College of Computing and Informatics, Saudi Electronic University, Riyadh, Saudi Arabia Faculty of Engineering, Port Said University, Port Fuad City, Egypt

Corresponding author: [email protected]

Similarity detection in the text is the main task for a number of Natural Language Processing (NLP) appli- cations. As textual data are comparatively large in quantity and in volume than the numeric data, measuring textual similarity is one of the important problems. Most of the similarity detection algorithms are based upon word to word matching, sentence/paragraph matching, and matching of the whole document. In this research, a novel approach is proposed using deep learning models, combining Long Short-Term Memory Network (LSTM) with Convolutional Neural Network (CNN) for measuring semantics similarity between two ques- tions. The proposed model takes sentence pairs as input to measure the similarity between them. The model is tested on publicly available Quora’s dataset. In comparison to the existing techniques gave 87.50 % accuracy which is better than the previous approaches. KEYWORDS: Deep Learning, Semantics, Similarity, Quora, question duplication, LSTM and CNN. 496 Information Technology and Control 2020/4/49

1. Introduction In this digital era, the use of web has increased due to The above sentences are about jewels and almost a the advancement of the internet. A huge number of similar concept is explained in two different ways user-generated short text comes through the Internet which can easily be interpreted by humans. as well as social networks, e.g. user opinions about the Computing the text similarity between two texts is product, blogs, and services. Such short text is basi- very difficult for the machine due to huge variance cally subjective and semantics oriented. Detecting in natural languages. The authors in [14] presented similarity is the problem of finding the score to which a conceptual space framework, which permits the the input pair of texts has similar content. Words are model to profess similarity [53] which “changes with similar either lexically, in which words have a simi- changes in selective attention to specific perceptual lar sequence of characters, or semantically, in which properties.” Meanwhile, the short text has its own sole words are used in the same context or have the same characteristics. To classify short text is a difficult task meaning. Measuring the text similarity between since it provides many challenges related to the stan- words, sentences, etc. plays a vital role in information dard accuracy score. Due to the challenges and need, retrieval, clustering documents, short answer grad- text classification has high demand. Recently, deep ing, auto essay scoring, text summarization, and ma- learning shows a huge advancement in many machine chine translation. In terms of research, it is valuable learning research areas such as text/signature [6], ob- to detect and match the short text at different granu- ject classification [21, 47], visual surveillance [26, 51], larities. biometrics [3], medical imaging [27, 29, 42, 48, 52], The volume of the data obtainable in the form of text and many more [4, 25, 28, 30-32, 41]. In this research, is increasing rapidly. Therefore, the importance of a deep learning model is proposed for detecting simi- applications of NLP is increased for improving the larity between short sentences such as question pairs. progress of the analysis and retrieval of data avail- Contributions of the paper are the following: able in the form of text. The computation of seman- 1 A deep learning model is developed by using LSTM tic similarity among the sentences is the main task and CNN models to detect semantic similarity in many NLP applications like text classification among short text pairs, specifically Quora question and summarization [24], topic extraction [45], in- pairs. formation retrieval [39] and for the grading of short 2 To utilize the proposed method for finding similar answers generated automatically, finding duplicate questions, and validating effectiveness of method questions [36], document clustering [22], etc. Differ- in the detection of similar questions. ent methods are available to measure semantic sim- ilarity among sentences. Most of the methods mainly 3 Shown a wide range of comparative studies for se- depend on the features extracted by using different mantic similarity detection problem. techniques. These features are then given to the ma- The rest of the paper is organized as follows. In Sec- chine learning algorithms to find similarity scores tion 2, we discussed the related work comprising deep between the sentences. However, their accuracy still learning and semantic similarity. In Section 3, we needs to improve. proposed our model in detail. The experimental setup The similarity between the texts of varying length is described in Section 4. Finally, Section 5 will give produces better results when the length of the sen- the conclusion. tences is minimal. The techniques of similarity de- 1.1. Problem Statement tection are based on capturing strong relationships in two sentences. For example two sentences from the Online question-answering is gaining popularity in paper [35] are given below: different IT based web applications. Quora is an on- _ In jewelry, a gem is a stone that is used. line platform for people to ask questions and connect _ In the rings and necklaces, a jewel is a precious with others who give quality answers. Every month, stone. around 100 million people visit Quora and post their Information Technology and Control 2020/4/49 497

questions, so there are many chances that people may outperformed some benchmark scores that employ ask similar or existing questions. It will be beneficial word level embedding. In addition, the authors [11] for both users and community to detect similar ques- proposed a CNN architecture for analyzing senti- tions and to extract the relevant answers in less time. ments among short text pairs. Their proposed model This research is focused to find duplicate questions to use both networks and combined CNN and RNN for improve the performance and efficiency of the ques- sentiment analysis [55]. tions asked on Quora community. Let there are two questions denoted by Q1 and Q2 respectively. These 2.2. Short Text Similarity two questions are semantically equivalent or dupli- The authors in [59] proposed a MULTIP model to de- cate of each other if they convey the same meaning, tect paraphrase on twitter’s short text messages. They otherwise non-duplicate. modelled paraphrase relations for both sentences and The problems addressed in this research are formu- words. In research conducted in [12], the overlap- lated as questions below: ping in unigram and bigram features is evaluated for Q1: What is the significance of similarity detection in paraphrase detection. The work did not consider the Textual Data? semantic similarity score. F1 score for paraphrase detection in the said study was 0.674. In [58], the au- Q2: How to detect duplicate question pairs? thors proposed a new recurrent model named Gat- Q3: How the efficiency of the existing algorithms can ed Recurrent Averaging Network (GRAN) which is be enhanced? based on AVG and LSTM and gave better results than both of them. They [54] proposed simple features to execute PI and STS tasks on Twitter dataset. Based on the experiments, they concluded that by overlapping 2. Related Work word/n gram, word alignment by METEOR, scores Semantic similarity detection is a critical task for of BLEU and Edit Distance, one can find semantic most of the NLP applications such as question and information of Twitter dataset at a low cost. In [20], answering, detecting paraphrases and IR i.e. Informa- the authors presented a novel deep learning model tion Retrieval. Many approaches are proposed for the for detecting paraphrases and semantics similarity calculation of semantics similarity between sentenc- among short texts like tweets. They also studied how es. Semantic similarity is the problem of identifying different levels (character, word and sentence) of fea- how similar the given pair of texts are. tures could be extracted for modeling sentence rep- resentation. Character level CNN is used for finding 2.1. Semantic Similarity Using Deep Learning character n gram features, word level CNN for word In NLP tasks, deep learning played an important role n gram features and LSTM for finding sentence lev- in the past few years. Many researchers used a vari- el features. The model was tested on twitter dataset, ety of features for paraphrase identification tasks and the results showed that character level features like n gram features [40], syntactic features [10], play an important role in finding similarity between linguistic features [49], etc. The use of deep learn- tweets. They also concluded that the combination of ing methods have moved researchers’ consideration two models gave better results than that of individual towards semantic representation of text [2], [8]. The ones. use of CNN for learning sentence representations 2.3. Convolutional and Recurrent Neural achieved remarkable improvement in the results of Network classification. Recurrent Neural Network (RNN) that can learn the relationship among different elements The authors in [9] proposed an ensemble method in input sequence were used by researchers to rep- that represents text features in an efficient way. The resent text features [34]. The authors in [33] gave a experimental results showed that the accuracy of simple model that relied upon character level inputs, the proposed approach largely depends on the size of however, for predictions they used word level inputs. the dataset. When the dataset is small, the system is Although they utilized few parameters, their system vulnerable to overfitting. However, training on large 498 Information Technology and Control 2020/4/49

dataset gave good results. In [57], a model based on using deep learning techniques. Some of the existing similarities and dissimilarities in the sentences is approaches of finding paraphrases gave good results proposed. The model denoted each word as a single on clean text but are failed to perform well on noisy vector and calculated matching vector with respect and short texts. Therefore, a new, generic and robust to every word in another sentence and based on this architecture which they called Deep Paraphrase is matching score, the word vectors are separated into proposed which combined CNN and RNN models. similar and dissimilar component matrix. A two-chan- The experiments are conducted on Twitter dataset nel CNN model is then applied to capture feature and results showed that the proposed architecture vectors from the component matrix. The model was performed well on both types of texts. In [43], seman- applied on answer selection problem and showed tic textual similarity in paraphrases of Arabic tweets good performance. In [38], an interpretability layer is found by applying several text processing phases called iSTS is added which depends on arrangement and extracting different features (Lexical, Syntactic of sentences among pairs of segments. The authors and Semantic) to overcome the limitations of existing in [61] proposed an attention-based CNN for model- approaches. Machine learning classifiers like support ing sentence pairs for different NLP tasks and found vector regressions were trained by using these ex- that attention based CNN outperformed the simple tracted features. The accuracy of the model in PI and CNNs. A new model based on deep learning named STS tasks was high. In [15], the authors proposed Sia- as BiMPM was proposed [56]. In this model, for a giv- mese Gated units of machine learning algorithms like en pair of sentence segments, the model first encoded support vector machines, adaboost and random for- the sentences by using BiMPM and then matched en- est for predicting similar questions on Quora dataset, coded sentences in both directions. In each matching they did not calculate accuracy but they calculated direction, BiMPM compared sentence segments from cross entropy loss. In [5] the authors used Attentive different perspectives. The BiLSTM layer calculated Siamese LSTM for calculating semantic similari- the matching score which is then used for making the ty among sentences. A different architecture having final decision. In [50], deep learning techniques to re- bi- directional LSTM network with attention mech- rank short text pairs are proposed. They used CNN anism has been proposed [37]. They performed their to learn sentence representation and build similarity experiments on six different datasets for the classi- matrix. The experimental results showed that deep fication of sentiments and one for question classifi- learning models greatly improved results. cation and proved the model to be more accurate. In In [19], a deep neural model is proposed for model- [46], authors proposed deep learning architecture, ling sentences hierarchically. The proposed model which used attention mechanisms. In their method, achieved better results for sentence completion and they decomposed the large problem into smaller sub- response matching task. The model did not perform problems that can be easily solvable separately. Their well on paraphrase detection task. In [18], a Siamese proposed method gave better accuracy and outper- gated neural network is proposed to find similar ques- formed the complex deep neural models. A summary tions on Quora. The authors tried to detect seman- of the existing approaches is shown in Table 1. tically similar questions on online user forum and The literature revealed that a lot of methods have used the data from stack exchange forum [7]. They been proposed for detecting similarity between short presented a CNN-based approach which generat- sentences, but these methods have some problems ed vector representation of given question pairs and that need to be addressed. Different researchers ap- scored them using similarity matrix. The proposed plied different techniques to find similarity scores CNN-based method was tested on data from stack and tried to improve the accuracy. In this paper, our exchange forums. The results of the study were com- focus is to use deep learning models and develop a ro- pared with the standard machine learning algorithms bust similarity detection system for two sentences. like support vector machines. Domain word embed- The model only requires a pair of sentences and finds ding are evaluated with Wikipedia word embedding the similarity between them. For this purpose, a com- and the accuracy of CNN method was high. In [1], the bination of CNN and LSTM is used to find similarity authors attempted to detect paraphrases in short text between such sentence pairs. Information Technology and Control 2020/4/49 499

Table 1 Overview of Deep Learning Approaches and their Features

Work Method Features Dataset

Mohammad, A.-S. et Multi-layer Neural Network POS Tagger and Pre trained NN MSPR, Twitter al. [43]

Hybrid model (deep learning and Pre trained word embedding Agarwal, B. et al. [1] MSPR statistical features) and POS Tagger

Huang, J., et al. [20] Multi-layer Perceptron Pre-trained word embedding Twitter

Yin, W. et al. [61] Logistic Regression Word2vec based Embedding MSPR

GloVe vectors pre-trained on Godbole, A. et al. [15] Gated Recurrent Unit (GRU) Network Quora Dataset Wikipedia

Severyn, A. and A. TREC: QA and Convolutional neural network Word embedding Moschitti. [50] Microblog Retrieval

Reuters-21578 and Chen, G. et al. [9] CNN-RNN model word2vec embedding RCV1-v2.

Skip gram neural network Bogdanova, D. et al. [7] Convolutional neural network Quora Dataset architecture

300-dimensional GloVe vectors Homma, Y. et al. [18] Siamese Gated Recurrent Network Quora Dataset pre-trained on Wikipedia

Pre-trained word embedding by Wang, Z. et al. [57] Two-channel CNN model QASent, WikiQA Mikolov et al. (2013)

Parikh, A.P. et al. [46] Simple Attention-based model 300-dimensional GloVe vectors SNLI

300D word embedding (English Bao, W. et al. [5] Attentive Siamese LSTM and Chinese) SemEval 2014

TREC, SST1, SST2, Liu, G. and J. Guo [37] AC-BiLSTM Word embedding IMDB etc.

Google Pretrained word Deep Similarity Model LSTM and CNN embedding Quora Dataset

3. Proposed Work In the field of text similarity detection, techniques based ing. The goal is to find similar questions with reduced on deep learning moved researchers towards semanti- time and complexity. The method based on deep learn- cally distributed representations [2], [5], [37], [13], [23], ing will require minimal preprocessing and will take a [60], [17]. To find the semantic relatedness among ques- pair of sentences only hence reducing the time complex- tion pairs, we proposed a technique based on deep learn- ity. Figure 1 shows the diagram for our proposed model.

into the original form. Stemming is 3. Proposed Work performed through Snowball Stemmer 1 to In the field of text similarity detection, techniques remove the unnecessary portion of the text. A based on deep learning moved researchers few examples of regex are given in Table 2. towards semantically distributed representations [2], [5], [37], [13], [23], [60], [17]. To find the Table 2 semantic relatedness among question pairs, we Regex examples proposed a technique based on deep learning. The Original Converted Original Converted goal is to find similar questions with reduced time what’s what is ’ll will and complexity. The method based on deep learning will require minimal preprocessing and ‘ve have 0s 0 500 will take a pair of sentences oInformationnly hence Technology reducing and Control can’t cannot e -2020/4/49 mail email the time complexity. Figure 1 shows the diagram ‘nt not u s American for our proposed model. i’m i am ’re are

Figure 1 Table 2 FigureProposed 1 flow diagram for similarity detection Regex examples

Proposed flow diagram for similarity detection OriginalAlgorithmConverted 1: Process Originaltexts in datasetsConverted what’sInput: Textwhat is ’ll will

‘veOutput: Lihavest of Words 0s 0 Notation: T: text, S: stops, SW: stopwords, can’t cannot e - mail email STW: stem words, st: stemmer, stw: ‘ntstemmed_wordsnot u s American

i’m i am T, S, SW,’re STW, st, stware 1: T←𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰 T.lower().split()∶ 2: if SW removes : Process texts in datasets Algorithm3: 1 S ← set SW to English 4: Input: Textfor w in T do Output: List of Words 5: if not w Notation: T: text, S: stops, SW: stopwords, 6: stops STW: stem words, st: stemmer, stw: stemmed_ 7: words end if 8: end for T, S, SW, STW, st, stw 1: T T.lower().split() 9: 푰풏풊풕풊풂풍풊풔풂풕풊풐T ← w� ∶ 2: if SW removes 3.1.Pre-Processing 10: end← if 3.1. Pre-Processing 3: S set SW to English Cleansing Text: Preprocessing is one of the 11: T ← join T Cleansing Text: Preprocessing is one of the essential 4: for w in T do essential steps, especially in text mining. Regex ← steps, especially in text mining. Regex plays an im- 12: Clean the text using Regex plays an important role during the whole process. 5: if not w portant role during the whole process. Regex is used Regex is used to find interesting patterns in raw 6:13: if STW stops to find interesting patterns in raw text, and then per- text, and then perform the desired action. For our 7:14: T ←end split if T task,form 200000the desired unique action. words For wereour task, used 200000 in the uniquewhole words were used in the whole process due to compu- 8: end for process due to computational expenses. 15: st ← english tational expenses. 9: T w 16: stw ← stw in T InIn thethe firstfirst step step (as (as shown shown in in A lgorithmAlgorithm 11),), allall thethe 10: end if questions are tokenized and converted into 17: ← T ← join stw questions are tokenized and converted into sequenc- 11: T join T sequences.es. A threshold A threshold of no more of than no 30more words than per 30 question words 12:18: Cleanend if the text using Regex peris applied. question Stop is words,applied. punctuation Stop words and, noisepunctuation are re- ← andmoved noise from are the textremoved by using from NLTK, the and text the wordsby using are 13: if STW NLTK, and the words are converted into lower converted into lower case. Regex function is applied 14: T split T case. Regex function is applied to convert words 1 http://www.nltk.org/howto/stem.html to convert words into the original form. Stemming is 15: st english ← performed through Snowball Stemmer1 to remove the 16: stw stw in T unnecessary portion of the text. A few examples of re- ← 17: T join stw gex are given in Table 2. ← 18: end if ← 1 http://www.nltk.org/howto/stem.html 19: return T Information Technology and Control 2020/4/49 501

The input sequences are padded that are smaller than Algorithm 2: Data Processing the desired length while the longer ones are truncated. Input: Questions Algorithm 3: Converting Text to Sequences Output: Labels Notation: t1: texts_1, t2: texts_2, lb: labels, ti: test_ Input: Text ids, rd: reader, hd: header, tt1: test_texts_1, tt2: test_ Output: Number Sequence texts_2, qd: q_dict, tsf: train_feat, tsf: test_feat Notation: token: tokenizer, seq: sequences, tsseq: t1, t2, lb, ti, rd, hd, tt1, tt2 test_sequences, wi: word_index 1: with open.TDF as f 푰풏풊풕풊풂풍풊풔풂풕풊풐� ∶ : token, seq, tsseq, wi 2: rd read csv 1: token tokenize words 3: hd next of rd 푰풏풊풕풊풂풍풊풔풂풕풊풐풏 ← 2: seq tokenize words to sequences 4: for v in rd ← ← 3: tsseq tokenize words to test sequences 5: append tt1, tt2, lb ← 4: wi token.wi 6: end for ← ← 7: end with 3.3. Word Embedding 8: with open.TDF as f Word embedding is used to convert words which 9: rd read csv are then assigned to real valued vectors (known as 10: hd next of rd embedding) to retain different types of information ← 11: for v in rd about the words. They are used to represent the words ← 12: append tt1, tt2, ti in an N-dimensional vector space. Word embedding is of great importance as this embedding denotes the 13: end for meaning of the words, i.e. how words are semantically 14: end with identical. Google’s2 word2vec representation of text is 15: qd default dictionary used in our study. The main advantage of using Goo- 16: for i in questions gle’s word2vec is that it represents a word as a vector ← having 300 dimensions. They are used to learn word 17: qd concatenate questions embedding for all the words in the dataset. 18: end for ← 19: trf intersect questions Algorithm 4: Embedding Matrix Preparation 20: tsf intersect questions Input: Text ← Output: Vectors ← Algorithm 2 shows the process of collecting all the Notation: nbw: nb_words, em: embedding_matrix, vocabulary words and making a dictionary of words. w: word, wi: word_index, voc: vocab First, the training file is opened. All the questions in : nbw, em, w, wi, voc training file are being concatenated. After concatena- 1: nbw storing maximum number of words tion, the dictionary of words is formed using this text. 푰풏풊풕풊풂풍풊풔풂풕풊풐풏 2: em embed with zeros ← 3.2. Tokenization and Sequence Creation 3: for w, i in wi ← Tokenization simply divides a sentence into a list of 4: if w in voc words. We used Keras tokenizer function to tokenize 5: em[i] words vectors the strings and the use of another important function 6: end if ← ‘texts_to_sequences’ to make sequences of the words. 7: end for After converting texts to the sequences, the sequences are padded to the desired length which is done by using 2 http://mccormickml.com/2016/04/12/googles-pre- Max Length argument (in this case max length is 30). trained-word2vec-model-in-python/

tatin nbw nbwords, e ebeddingatri, coputational cost, a pooling lessens over fitting by aking features ore spatially free w word, wi wordinde, voc vocab a pooling layers can be nubered after nbw, e, w, wi, voc every convolutional layer or after soe layers nbw𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰 ← storing aiu nuber of words tatin r e eer tl aes l lee er 2 e ← ebed with eros ere re ll ree tatin r nbw re nbwords, e ebeddingatri, coputational cost, a pooling lessens over fittingn d ropoutbyl aking ler techniue, features eore ereduring spatially erfreethe training w, i in wi w word, wi wordinde, e voc vocab aeerprocess poolingll randoly layers canler beselected nuberedr er neeafterurons are nbw, e, w, wi, voc ler i w in𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰 voc ← r er r everyignored convolutional e can saylayer that or weafter “droppedsoe out” layers t aes enbw𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰 ← ←ei ee storing ← aiuwords er vectors nuber of words soe neurons randoly n siple words, tatin nbw nbwords, e ebeddingatri, coputationtal cost, aaes pooling lessens over 2 e ← ebed with eros the r contribution ee of r neurons e ris teporarily fitting by aking features ore spatially free en i w word, wi wordinde, voc vocab nre dropout rl techniue, elee during ether training re w, i ini wi reoveda pooling layerson canforward be nubered passafter during the nbw, e, w, wi, voc processreevery convolutional erandoly layer selected or eafter “dropped nesoeurons out”are en eactivation er process rl of downstrea le r neurons, i wnbw in voc←e storing ← aiu r er nuber of words ignoredlayers e can say that we “droppedout” 𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰 eand r no weight er updates errl are applied to ei ← words vectors soe neuronst aesrandoly n siple words, 2en e i ← ebed with eros ree rr r e theneuronsn dropoutcontribution ontechniue, backward of duringneurons thepass istraining teporarily 502 en i w, i in wi Information Technology and Control 2020/4/49 en reovedprocess randoly reon forward selected re passneurons during are er the ecent ea i etww in voc eatch emaiatin re le en activationignored e processcan say that of wedownstrea “droppedout” neurons, ei ← words vectors andersoe neuronsno weight r randoly updates n siple are words,applied to s are used for learning the obects which are hethe contribution net layer of neurons used is isteporarily batch noraliation ecent enea i etw ofneurons downstream on backward neurons, pass and no weight updates are occurring repeatedly3.4. Recurrent, eg tie Neural series, Network repeating reoved atchon forward maiatin pass during the appliedlayer to t neurons is used on forbackward noraliing pass. inputs of each ecentRNNs re en e are used rea ler for learning etw e thee objects which re are oc- eactivation e atch processler maiatineof downstrea neurons,rl words, etc ong horter eory networks layerand no toweight overcoe updates are covariate applied to shift proble rrcurring reeel repeatedly, e e.g. etime series,ere repeatingree words, are eplicits are usedtypes for of learning s the that obects can which learn are 3.5.3.helerneurons Batchnet on backwardelayer Normalization rused pass rl is batch noraliation e occurringretc. e Longrepeatedly Short-Term r, eger Memorytie e series, networksr repeatinger (LSTM) are lere do erenoraliation re for an rle input layer by ecent ea etw Thelayer next tatch layeris used usedmaiatin for is noraliing batch normalization inputs of layer. each It is the relationship explicitaong re el types different of e RNNs that eleents can learn thein relationship anler escaling rl and changing r the activations ler to speed words, etc ongs are hort used forer learning e theory obects networks which are usedlayer for to normalizing overcoe inputs covariate of each shift layer proble to overcome input seuencee rel namong areour eplicit different scenario, types elements erethese of s in eleentsan elee inputthat cansequence. are learn In our lhe net layer used is batch e noraliation ee occurring repeatedly, eg tie series, repeating covariateeuplayer dothe t is noraliation shiftusedlearning forproblem. noraliing process Wefor inputsdoan normalizationinput uringof each layer our byfor antraining eeescenario, these r elements er are ee words. elee Given readequate words Giventhe adeuaterelationshipwords , inforation, aongetc ong different horter deep eleents e neuralory networksin an inputscalingprocess,layer e layerto lerandovercoe bythe changing scaling batch recovariate and the noraliation rchangingactivationsshift proble r the tor activations speed we perfor r e ee are eplicit r types of s ee that erlcan learn networks areinput ehibited seuenceinformation, to n bedeepour successful scenario, neural networks these for eleents a are w exhibitedide are to upree thedo noraliationlearning e process rlfor an uringinput layer oure by errtraining er rethe eerelationship aong e el different eleentsr ein an tothe speed following up the learning tasks process. During our training wordsbe Given successful adeuate for a wideinforation, variety of deep machine neural learning escaling ll and changing the activations to speed variety of achine learninginput seuence tasks, n our scenario,including these eleents tet are process,process, the the batch batch normalization noraliation3 we we perform perfor the fol- networksretasks, aree including ehibited ler text to belearning successful tasksl for based a w eonide larger up the learning process uring our training words Given adeuate inforation, deep neural lowing the following tasks: tasks learning tasksler baseddatasets. on Fore larger this task, lrerdatasets Google’s e word2vecor rthis is one of process, the batch noraliation we perfor variety of networksachine are learning ehibited tasks, to be successful including for teta wide the following tasks task, Google’slearningtask, theword2vecGoogle’s tasksgoodvariety examples, based word2vecof achineis on because onelarger learningis ofone itdatasets provides tasks,theof theincluding goodor an good automatedthis tet irst,r1 First, e we well c calculatealcule eate mean ean and re and variance variance of input of lay input- ele ee re e er ers task, powerGoogle’slearning of meanword2vec tasks for based mining ison onesemanticlarger ofdatasets therepresentations goodor this irst,ler we calculate ean and variance of input eaples, because it task,provides Google’s an word2vec autoated is one powerof the good layers eaples, efrom r because big data it [16].eprovides With reree anthe autoated use of large-scale power r text irst, we1 calculate ean and variance of input eaples, because it provides an autoated power layersμ = m x e of ean for ining seantic erepresentations e lrele fro e layersB m 1 i=1 i of eancorpus forof ining orean any for other seantic ining large seantic representationsdataset, representations word2vec fro develops fro 1 m μ =1 mm xi Batch atch Mean ean (1) big data r ith r the eruse lreof large escale re tet μμ B= m i=1xi=1 x i atch atch ean ean big dataa vocabulary big data ith of a thefixed ith use sizethe of anduse large offigures largescale scaleout tet howtet to de- B mm ∑i=1 i eel lr e e re corpuspict or words corpusany outsideotheror any thelargeother vocabulary largedataset, dataset, by word2vec building word2vec vector m corpus or any other large dataset, word2vec 1∑ 2 developse r a vocabulary e of ae fied lr sie and figures 2 ∑∑ portrayals utilizing words from inside the vocabu- σB= m xi-μ re develops a vocabulary of a fied sie and figures 1m m 2B develops a vocabularyout howof toa depictfied words sie outside and the figures vocabulary by 2 i=1 2 l er rrl l r r σ2B= 1 xim-μB atchBatch ariance Variance 2 (2) out howlary to [44]. depict Two words popular outside networks the vocabulary types are CNNs by and σ 2=m 1 i=1 x -μ atch2 ariance 2 out how to depicte e words lrbuilding outside vector portrayals the vocabulary lr utiliing erwords by fro σB =m ∑ ( i xB )-μ atch ariance 2 buildingRNNs, vector like LSTMportrayals network. utiliing words fro B ∑ i=1( ) i B inside the vocabulary wo popular networks m i=1 building vectorinsidee re portrayalsthe vocabularytypes are s utiliing and wos,le popularwords like er networks fronetwork ∑ ( ) 2 n2 Inthee the seconde second step, estep, noraliation normalizationrl of input of input layers inside the vocabularytypes are3.5. s Convolutional and wo s, popular Neural like Network networks network ∑ ( ) nvtinanvtina ea ea etw etw 2 nlerlayers takesthe takes esecond place place le using usingstep, previously previouslynoraliation rel calculated calculated lleof statistics.input Convolutional neural network, also called convnets, statistics types are s and onvolutionals, like neural network, network also called layers takes place using previously calculated llnvtina erl ea er etw l lle 2 n thex -μ second step, noraliation of input is an interesting and important development in the xi -Bμ convnets, is an interesting and iportant statisticsx1= i B e ere r x1= σ2 +ϵ nvtinaonvolutionalmachinedevelopent ea learningneural in field.etwthenetwork, achine A standard learningalso Convolutional calledfield layersB2 takes place using previously calculated(3) xσi-Bμ+ϵ eelestandard e onvolutional e euraller etwork el ̅ √ B convnets,Neural is Network an interesting (CNN) model and encompasses iportant single or statisticsastly,x1= we s hift and sc al e the input , for r ll erl er ̅ √ 2 onvolutionaldevelopent multipleneuralodel in layers encopasthenetwork, of achine subsamplingses single learningalso or andultiple convolution, calledfield layers of fol- lobtainingσ B+ eoutputϵ of the layer le e r el esubsaplinge leand convolution, r lle followed ler by fully 4 Lastly,xi-μ we shift and scale the input, for obtaining standardlowed onvolutional by fully connected eural layers etwork which can be one or ̅y = √x +β B e ler convnets, is an connectedinteresting layers whichand can be iportantone or ore and an astly,xi1= i we s hift and sc al e the input , for l l lle ll output2 of the layer. odelmore encopasoutput and an layerses output single layer. or ultiple layers of σ +ϵ developent in the achine learning field obtainingy = xi+βB output of the layer subsaplingee ler and convolution, e e followed r re by fully i √ a in aes ̅ standard onvolutional3.5.1. ler Max Poolingeural Layers etwork y = x√i+β (4) connected layers which can be one or ore and an astly,i √ we shift and scale the input, for onvolutional networks ay utilie a pooling ettin e odel encopasoutput Convolutional seslayer a single in networksor aes ultiple may utilize layers max poolingof lay- layers a pooling is a type of non linear down obtaining odel√ is outputproposed inof this the study layer for the ers. Max pooling is a type of non- linear down sam- subsaplingll and aconvolution,sapling iner which aes followediniies le longitudinal by fullyl sie of 3.5.4.detection Setting-up ettinof siilar Model uestions e on uora, by plingconventional which minimizes layer’s output longitudinal using the size subdivision of conven - y = x +β connected layersler which l can be one e or ore ler and an Ausing modeli onvolutional iis proposed ineural this studyetwork for the detection of onvolutionaltionalof layer’s output networks outputin rectangular ay using utilie theboes, subdivision whicha poolingis only of outputthe el ettin re e r e l e ll e similar questions on Quora, by using Convolution- output layerlayers in a rectangularbest pooling value boxes,foris aeach type which filter of isnondditionally, only linear the best downbringing value for ee odel is proposedlr e in this study r for the down the uantity of connection, and hence al Neural√ Network (CNN) and LSTM. CNNs extract saplingconventional which layer’s iniies output usinglongitudinal the subdivision sie of httpsarivorgpdf2vpdf ll erl er each filter. Additionally, bringing down the quantity eitherdetection local ofor siilardeep features uestions from on natural uora, language. by aconventional in relrlayer’saes output e using the subdivision l e of connection, and hence computational cost, max Studiesusing haveonvolutional shown that CNNeural hase tworkproduced good re- ofe output le in r rectangular e ler boes, ll which is ronly the onvolutional networkspooling lessens ay over utilie fitting aby making pooling features more sults in sentence ettin classification. e In this work, we have best value e for each filter e dditionally, bringing ee spatially free. Max pooling layers can be numbered rr af - used CNN followed by dropout, dense and max pool- layers a downpooling the isuantity a type ofof connection,non linear and down hence odel is proposed in this study for the ter every convolutional layer or after some layers. httpsarivorgpdf2vpdfing, in combination with LSTM network. The model sapling which iniies longitudinal sie of detection of siilar uestions on uora, by 3.5.2. Dropout Layers comprised of one translational layer for each question. conventional layer’s output using the subdivision Twousing LSTM onvolutionallayers initialized by euralword embedding etwork and In dropout technique, during the training process ran- of output in rectangular boes, which is only the two CNN which are also initialized by word embed- domly selected neurons are ignored. We can say that ding. The inputs (known as input tensors) from all the best value for eachwe “dropped-out” filter dditionally, some neurons randomly.bringing In simple down the uantity of connection, and hence layers are concatenated and then passed to the series words, the contribution of neurons is temporarily httpsarivorgpdf2vpdf re- moved on forward pass during the activation process 3 https://arxiv.org/pdf/1502.03167v3.pdf Information Technology and Control 2020/4/49 503

of dense layers, followed by activation function to 4.3. Hyperparameter Setting make final predictions. Based on these predictions, ac- In this experiment, we have selected hyperparame- curacy, precision, recall and F1 score was calculated. ters by first investigating training data and searching for the best hyperparameters. We choose optimizer, activation unit, dropout with which our model per- 4. Experimental Results formed best. The setting of hyperparameters varies from dataset to dataset. We tested different hyper- In this section, we reported the performance of our parameters on this dataset. To our knowledge, these proposed deep neural network-based model that we selected hyperparameters performed best and gave have implemented for detecting similarity among better results than others. Table 4 shows the hyper- questions. We performed and evaluated multiple ex- parameters setting of values for our model. While de- periments shown in Table 3 to reach the best accu- signing our network, ReLU activation unit was used, racy results and to develop the best fit technique for which is widely used as activation unit in many deep similarity detection. learning models. We have applied different optimiz- We performed our experiments on Quora’s dataset, ers for training our model and evaluated the perfor- in which duplicate data is labeled as positive samples mances. Figure 2 shows that the nAdam optimizer and non-duplicate data is labeled as negative samples. gave the best accuracy on our dataset. The dropout First, we train our deep learning model on Quora’s value was set to 0.2. and applied to every layer. The dataset to detect similarity between short questions. learning rate hyperparameters was set to 5, which We investigated the best hyperparameters for our means that the model waited for 5 epochs before re- model. Our model is tested on testing dataset to calcu- ducing the learning rate. Likewise, when investigat- late evaluation scores and comparing the same with ing the size of training data, we have seen that if the previous models. size of the training data becomes less, the accuracy and F1 score slightly dropped. 4.1. Dataset 4.4. Evaluation Metrics Quora made the first dataset public in 2017. This dataset contains pairs of duplicate and non-dupli- For the evaluation of our proposed model, we have cate questions. Duplicate means that given question used standard evaluation metrics that are widely pairs have the same meaning. The dataset contained used.

404290 question pairs having approx. 37% of positive Precision: Precision indicates the proportion of pre- or duplicate samples and 63% negative or non-dupli - dicted positive cases that are real positives. cate samples. eca ecall is the proportion of actual positive ur and odels are cobined in cases that were correctly predicted this study for which the accuracy gained is 4.2. Implementation (5) better than eisting ethods or etraction

The proposed method was implemented in Keras of local features, is used, while is Recall: RecallTP is the proportion of actual positive cas- using TensorFlow backend. The model was trained Recall= used to learn long ter dependencies and get es that wereTP + correctlyFN predicted. for 100 epochs, while patience value was set to 5. sentence level representations ur The process is set to stop if the performance stopped eperiental results on Quora’s dataset are (6) improving. There are total 149262 duplicate pairs sce he haronic ean between precision suaried in ables able shows the and 255028 non-duplicate pairs. The dataset is split accuracy obtained through the proposed F1and score: recall The harmonic mean between precision and into 80/20 (training/validation) featuring 323432 recall. odel hereas, , 2 and values instances for training and 64687 for validation set, for precision, recall and score, respectively, which is further split into validation and testing sets Precision * Recall are shown in able s we can see, the F1=2* (7) Precision + Recall having 16171 instances for testing set. The model is accuracy of our presented odel is evaluated at the end of each epoch. By doing this, the which is better than that of eisting Accuracy: Accuracy represents the number of true model with the best score on the validation set is used approaches n addition, the eperiental negative and true positive samples to the total num- to make predictions for testing data. ccac ccuracy represents the nuber of true results showed that proposed odel obtained ber of samples. negative and true positive saples to the total best precision, recall and score, , 2 nuber of saples and respectively he results are better than that of previous tetual siilarity odels (TP + TN) Accuracy= (TP + FP + TN + FN)

ests

ae ifferent odels applied on dataset ame eth ccac ave aes eatures istance easures

anm est eatures istance easures

everyn, and oschitti only

with Glove with retrained Google’s word to vector

an with etaine e’s w t vect

and 2 fter the collection of all features, ave e have applied different odels while doing ayes and ando forest algoriths were our research able shows different ethods applied to build a odel and calculated results applied during our research e have used ando forest perfored better that ave achine learning algoriths like ave ayes ayes and gave an accuracy of , greater and ando orest or using our data on than ave ayes having accuracy with achine learning algoriths, we need feature these features set or this purpose, different features were collected like length of uestion, difference in n addition to these odels, we have also length, coon words etc n addition to these applied and odels individually features, distance easures, eg osine and their results are shown in able iple distance, accard distance, etc were used to odel using pretrained Google’s easure distance between vectors of uestions word2vec gave an accuracy of e have

eall s te roorton o atual oste ur and odels are oned n ases tat ere orretl redted ts stud or te aura aned s etter tan estn etods or etraton

o loal eatures s used le s TP Recall= used to learn lon ter deendenes and et TP + FN sentene leel reresentatons ur eerental results on Quora’s dataset are s e aron ean eteen reson suared n ales ale sos te and reall aura otaned trou te roosed odel ereas and alues or reson reall and sore resetel Precision * Recall are son n ale s e an see te F1=2* Precision + Recall aura o our resented odel s s etter tan tat o estn aroaes n addton te eerental ura reresents te nuer o true results soed tat roosed odel otaned 504 neate and true oste salesInformation Technologyto te total and Control est reson reall and2020/4/49 sore nuer o sales and resetel e results are etter tan tat o reous tetual slart odels (TP + TN) as, 83.15, 87.02 and 85.04 values for precision, recall Accuracy= (TP + FP + TN + FN) (8) and F1 score, respectively, are shown in Table 6. As

we can see, the accuracy of our presented model is

87.50 which is better than that of existing approach- 4.5. Resultsss es. In addition, the experimental results showed that Our CNN and LSTM models are combined in this proposed model obtained best precision, recall and F1 study for which the accuracy gained is better than score, 83.15, 87.02 and 85.04 respectively. The results existing methods. For extraction of local features, are better than that of previous textual similarity erentCNN is used, odels while aled LSTM on is datasetused to learn long term models. dependencies and get sentence level representa- We have applied different models while doing our tions. Our experimental results on Quora’s dataset research. Table 3 shows different methods applied are summarized s in Tables eatures 5-6. Table stane5 shows theeas acures- during our research. We have used machine learning curacy obtained through the proposed model. Where- algorithms like Naïve Bayes and Random Forest. For s eatures stane easures

Table 3 eern and ostt onl Different models applied on dataset t loe Name Method Accuracy t retraned oole’s ord to etor Naïve Bayes Features + Distance measures 76.33 % ’s Random Forest Features + Distance measures 77.14 %

CNN 1D Severyn, A. and A. Moschitti. [50] CNN only 82.23 %

LSTM LSTM with Glove 82.60 % and ter te olleton o all eatures ae e ae aled derent odels le don LSTM LSTM with Pre-trained Google’saes wordand to vectorando orest alorts85.90 % ere our resear ale sos derent etods aled to uld a odel and alulated results aled durn our resear e ae used LSTM + CNN LSTM and CNN with Pre-trained Google’sando word orest to vector erored etter87.50 tat % ae ane learnn alorts le ae aes aes and ae an aura o reater and ando orest or usn our data on tan ae aes an aura t Tableane 4 learnn alorts e need eature tese eatures Hyperparameterset or ts valuesurose for our derent model eatures ere olleted le lent o ueston derene n n addton to tese odels e ae also lentHyperparameter oon ords et n addtonValue to tese aled andRemarks odels ndduall Maxeatures Sequence dstane Length easures e30 osne and ter results are -son n ale le dstane aard dstane et ere used to odel usn retraned oole’s Numeasure of filters dstane (CNN) eteen etors o ueston64 s orde ae an aura- o e ae Filter length (CNN) 5 - Batch size 2048 - Max number of Words 200,000 - Epochs 100 - Loss Binary cross entropy - Optimizer nAdam nAdam is a good optimizer to use Activation unit ReLU -Rectified linear units Applied on almost all deep learning models. Dropped out visible or hidden units in Neural Dropout 0.2 Networks to avoid overfitting LR patience 5 Epochs to wait before reducing LR Information Technology and Control 2020/4/49 505

using our data on machine learning algorithms, we Figure 2 need feature set. For this purpose, different features Comparison of different optimizers applied on the dataset were collected like length of question, difference in length, common words etc. In addition to these fea- tures, distance measures, e.g. Cosine distance, Jac- card distance, etc. were used to measure distance between vectors of questions 1 and 2. After the col- lection of all features, Naïve Bayes and Random forest algorithms were applied to build a model and calcu- lated results. Random forest performed better that Naïve Bayes and gave an accuracy of 77.14 %, greater than Naïve Bayes having 76.33 % accuracy with these features. In addition to these models, we have also applied CNN and LSTM models individually and their results are shown in Table 3. Simple LSTM model using pre- trained Google’s word2vec gave an accuracy of 85.90. We have seen that the performance of CNN + LSTM combined, are better than that of individuals. Table 5 Table 6 shows the results of precision, recall and F1 Comparison of proposed model with existing approaches score. According to this table, we have seen that our method performed best (comparing with these tech- Model Accuracy niques of sentence similarity) and achieved better Agarwal, B., et al. [1] 77.70 % values of precision, recall and F1 score. Yin, W. and H. Schütze. [60] 78.10 % The authors in [7] presented a CNN-based approach which generates vector representation of given ques- Parikh, A.P. et al. [46] 86.00 % tion pairs and score them using similarity metric. Their proposed CCN-based method was tested on data Wang, Z. et al. [57] 77.14% from stack exchange forums. However, they did not Jiang, J.-Y. et al. [23] 82.19 % mention F1 score, precision and recall. The authors of [57] applied a two-channel CNN to capture feature Homma, Y. et al. [18] 86.80 % vectors from component matrix. They applied their DeepSimilarity Model 87.50 % model on answer selection problems. They calculated

Table 6 Comparison of precision, recall and F1 score

Approaches Precision Recall F1 Score

Ferreira, R. et al. [13] 82.00 80.20 83.10 Jiang, J.-Y. et al. [23] 79.87 86.77 83.18 Yin, W. et al. [61] - - 84.80 Yin, W. and H. Schütze. [60] - - 84.40 Homma, Y. et al. [18] - - 81.73 Huang, J., et al. [20] 72.00 74.80 69.40 DeepSimilarity Model 83.15 87.02 85.04 506 Information Technology and Control 2020/4/49

mean average precision (MAP) and mean reciprocal Figure 4 rank (MRR) for QASent dataset and WikiQA dataset. Performance of models In [15], the authors proposed Siamese Gated units in combination with machine learning algorithms like support vector machines, adaboost and random forest for predicting similar questions on Quora dataset, but they did not calculate accuracy rather they calculate cross entropy loss. Their measured cross entropy loss was 0.29568. Likewise, a deep learning model present- ed by [20] for detecting paraphrases and semantics s s similarity among short texts like tweets. For evaluat- ing their model, they used a twitter dataset, and their results showed that character level features play an important role in finding similarity between tweets. Their max F1 score was 72.0%, whereas Pearson cor- relation was 62.6%. The authors of [18] tried to detect semantic similarity between Quora question pairs. of LSTM and CNN gave better results than that of They applied deep learning techniques and used the CNN and LSTM alone in similarity detection tasks. Siamese GRU network. They used data augmentation We take advantage of both these models and therefore techniques to improve the performance of the mod- achieve greater accuracy than existing approaches. el. Their model was trained on an augmented dataset using two layers similarity network and achieved an accuracy of 86.80% and 85.0% on validation and test set; however, the f1 score was 81.73% and 84.18% for oeer te sore as and 5. Conclusion validation and test set, respectively. or aldaton and test set resetel Measuring textual similarity between short sentenc- Compared to existing approaches and experimental ale sos te results o reson reall and es is a renowned problem in the area of NLP. In this results, we have seen that our model gave better ac- sore ordn to ts tale e ae seen tat paper, we have presented a novel deep learning ap- curacy. In addition, we explored that the joint model our etod erored est oarn t tese proach to predict duplicate question pairs. We have tenues o sentene slart and aeed odel aura ure used CNN and LSTM (Long Short-Term Memory) etter alues o reson reall and sore Figure 3 network which is good at storing long term depen- Model accuracy curve e autors n resented a ased dencies. We take advantage of both CNN and LSTM aroa enerates etor reresentaton o models and thus gained accuracy better than exist- en ueston ars and sore te usn ing methods. For our method, we have used Google’s slart etr er roosed ased word2vec representation of text. Our proposed work etod as tested on data ro sta eane on finding duplicate questions using deep learning orus oeer te dd not enton sore models performed very well and gave better results reson and reall e autors o aled a than existing approaches. The proposed method was toannel to ature eature etors ro evaluated on Quora’s dataset and gained an accuracy oonent atr e aled ter odel on of 87.50%. The main advantage of using deep learning anser seleton roles e alulated ean is that it only took sentences as inputs, thus required aerae reson and ean reroal ran minimal preprocessing and saved time by avoiding or Qent dataset and Q dataset n complex feature engineering. For our future research, te autors roosed aese ated unts n we will investigate how to apply our proposed model onaton t ane learnn alorts le on similar tasks like information retrieval, document suort etor anes adaoost and rando matching, etc. Moreover, we will apply word embed- orest or redtn slar uestons on Quora ding other than Google’s Word to vectors and investi- dataset ut te dd not alulate aura rater gate their impact. te alulate ross entro loss er easured ross entro loss as ese a dee learnn odel resented or detetn ararases and seants slart aon sort tets le teets or ealuatn ter odel te used a ttter dataset and ter results soed tat arater leel eatures la an ortant role n ndn slart eteen teets er a sore as ereas earson orrelaton as erorane o odels e autors o tred to detet seant slart eteen Quora ueston ars e aled dee learnn tenues and used te aese netor e used data auentaton tenues to roe te erorane o te odel er odel as traned on an auented dataset usn to laers slart netor and aeed an aura o and on aldaton and test set Information Technology and Control 2020/4/49 507

References 1. Agarwal, B., Ramampiaro, H., Langseth, H., Ruocco, M. A 10. Das, D., Smith, N. A. Paraphrase Identification as Prob- Deep Network Model for Paraphrase Detection in Short abilistic Quasi-Synchronous Recognition. In Proceed- Text Messages. Information Processing & Manage- ings of the Joint Conference of the 47th Annual Meeting ment, 2018, 54(6), 922-937. https://doi.org/10.1016/j. of the ACL and the 4th International Joint Conference ipm.2018.06.005 on Natural Language Processing of the AFNLP: Volume 1-Volume 1, 2009. Association for Computational Lin- 2. Arora, S., Liang, Y., Ma, T. A Simple but Tough-to-Beat guistics. https://doi.org/10.3115/1687878.1687944 Baseline for Sentence Embeddings. ICLR, 2017. 11. Dos Santos, C., Gatti, M. Deep Convolutional Neural 3. Arshad, H., Khan, M. A., Sharif, M. I., Yasmin, M., Tav- Networks for Sentiment Analysis of Short Texts. In ares, J. M. R., Zhang, Y. D., Satapathy, S. C. A Multilev- Proceedings of COLING 2014, the 25th International el Paradigm for Deep Convolutional Neural Network Conference on Computational Linguistics: Technical Features Selection with an Application to Human Gait Papers, 2014. Recognition. Expert Systems, 2020, e12541. https://doi. org/10.1111/exsy.12541 12. Eyecioglu, A., Keller, B. Twitter Paraphrase Identifi- cation with Simple Overlap Features and SVMs. In 4. Aurangzeb, K., F. Akmal, M. A. Khan, M. Sharif, Javed, Proceedings of the 9th International Workshop on Se- M. Y. Advanced Machine Learning Algorithm Based mantic Evaluation (SemEval 2015), 2015. https://doi. System for Crops Leaf Diseases Recognition. In 2020 org/10.18653/v1/S15-2011 6th IEEE Conference on Data Science and Machine 13. Ferreira, R., Cavalcanti, G. D., Freitas, F., Lins, R. D., Learning Applications (CDMA), 2020. https://doi. Simske, S. J., Riss, M. Combining Sentence Similarities org/10.1109/CDMA47397.2020.00031 Measures to Identify Paraphrases. Computer Speech 5. Bao, W., Bao, W., Du, J., Yang, Y., Zhao, X. Attentive Sia- & Language, 2018, 47, 59-73. https://doi.org/10.1016/j. mese LSTM Network for Semantic Textual Similarity csl.2017.07.002 Measure. In 2018 IEEE International Conference on 14. Gardenfors, P., Conceptual Spaces-the Geome- Asian Language Processing (IALP). 2018. https://doi. try of Thought. Cambridge, 2000, MA. https://doi. org/10.1109/IALP.2018.8629212 org/10.7551/mitpress/2076.001.0001 6. Batool, F. E., Attique, M., Sharif, M., Javed, K., Nazir, M., 15. Godbole, A., Dalmia, A., Sahu, S. K. Siamese Neural Net- Abbasi, A. A., Iqbal, Z., Riaz, N. Offline Signature Verifi- works with Random Forest for Detecting Duplicate cation System: A Novel Technique of Fusion of GLCM Question Pairs. arXiv preprint arXiv:1801.07288, 2018. and Geometric Features Using SVM. Multimedia Tools and Applications, 2020, 1-20. https://doi.org/10.1007/ 16. Goldberg, Y., Levy, O. word2vec Explained: Deriving s11042-020-08851-4 Mikolov et al.'s Negative-Sampling Word-Embedding Method. arXiv preprint arXiv:1402.3722, 2014. 7. Bogdanova, D., dos Santos, C., Barbosa, L., Zadrozny, B. Detecting Semantically Equivalent Questions in Online 17. Hasan, S. A., Farri, O. Clinical Natural Language Pro- User Forums. In Proceedings of the Nineteenth Con- cessing with Deep Learning. In Data Science for Health- ference on Computational Natural Language Learning, care, 2019, 147-171. https://doi.org/10.1007/978-3-030- 2015. https://doi.org/10.18653/v1/K15-1013 05249-2_5 8. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T. Enrich- 18. Homma, Y., Sy, S., Yeh, C. Detecting duplicate Questions ing Word Vectors with Subword Information. Transac- with Deep Learning. In Proceedings of the Internation- al Conference on Neural Information Processing Sys- tions of the Association for Computational Linguistics, tems, NIPS, 2016. 2017, 5, 135-146. https://doi.org/10.1162/tacl_a_00051 19. Hu, B., Lu, Z., Li, H., Chen, Q. Convolutional Neural 9. Chen, G., Ye, D., Xing, Z., Chen, J., Cambria, E. Ensem- Network Architectures for Matching Natural Language ble Application of Convolutional and Recurrent Neu- Sentences. In Advances in Neural Information Process- ral Networks for Multi-Label Text Categorization. In ing Systems, 2014. IEEE 2017 International Joint Conference on Neu- ral Networks (IJCNN), 2017. https://doi.org/10.1109/ 20. Huang, J., Yao, S., Lyu, C., Ji, D. Multi-Granulari- IJCNN.2017.7966144 ty Neural Sentence Model for Measuring Short Text 508 Information Technology and Control 2020/4/49

Similarity. In International Conference on Database 29. Khan, M. A., Sharif, M. I., Raza, M., Anjum, A., Saba, T., Systems for Advanced Applications, 2017. https://doi. Shad, S.A. Skin Lesion Segmentation and Classifica- org/10.1007/978-3-319-55753-3_28 tion: A Unified Framework of Deep Neural Network 21. Hussain, N., Khan, M. A., Sharif, M., Khan, S. A., Albes- Features Fusion and Selection. Expert Systems, 2019, her, A. A., Saba, T., Armaghan, A. A Deep Neural Net- e12497. https://doi.org/10.1111/exsy.12497 work and Classical Features Based Scheme for Objects 30. Khan, S. A., Hussain, A., Usman, M. Reliable Facial Ex- Recognition: An Application for Machine Inspection. pression Recognition for Multi-Scale Images Using Multimedia Tools and Applications, 2020. https://doi. Weber Local Binary Image Based Cosine Transform Fea- org/10.1007/s11042-020-08852-3 tures. Multimedia Tools and Applications, 2018, 77(1), 1133-1165. https://doi.org/10.1007/s11042-016-4324-z 22. Janani, R., Vijayarani, S. Text Document Clustering Using Spectral Clustering Algorithm with Particle 31. Khan, S. A., Hussain, S., Yang, S. Contrast Enhance- Swarm Optimization. Expert Systems with Applica- ment of Low-Contrast Medical Images Using Modified tions, 2019, 134, 192-200. https://doi.org/10.1016/j. Contrast Limited Adaptive Histogram Equalization. eswa.2019.05.030 Journal of Medical Imaging and Health Informatics, 2020, 10(8), 1795-1803. https://doi.org/10.1166/jmi- 23. Jiang, J.-Y., Zhang, M., Li, C., Bendersky, M., Golbandi, hi.2020.3196 N., Najork, M. Semantic Text Matching for Long-Form Documents. In The World Wide Web Conference, 2019. 32. Khan, S. A., Ishtiaq, M., Nazir, M., Shaheen, M. Face https://doi.org/10.1145/3308558.3313707 Recognition Under Varying Expressions and Illumi- nation Using Particle Swarm Optimization. Journal of 24. Joshi, A., Fidalgo, E., Alegre, E., Fernández-Robles, L. Computational Science, 2018, 28, 94-100. https://doi. SummCoder: An Unsupervised Framework for Ex- org/10.1016/j.jocs.2018.08.005 tractive Text Summarization Based on Deep Auto-En- coders. Expert Systems with Applications, 2019, 129, 33. Kim, Y., Jernite, Y., Sontag, D., Rush, A. M. Charac- 200-215. https://doi.org/10.1016/j.eswa.2019.03.045 ter-Aware Neural Language Models. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. 25. Khan, M. A., Akram, T., Sharif, M., Javed, K., Raza, M., Saba, T. An Automated System for Cucumber Leaf 34. Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R. Ur- Diseased Spot Detection and Classification Using Im- tasun, R., Torralba, A., Fidler, S. Skip-Thought Vectors. proved Saliency Method and Deep Features Selection. In Advances in Neural Information Processing Sys- Multimedia Tools and Applications, 2020, 1-30. tems, 2015. 26. Khan, M. A., Javed, K., Khan, S. A., Saba, T., Habib, U., 35. Li, Y., McLean, D., Bandar, Z. A., O'shea, J. D., Crockett, Khan, J. A., Abbasi, A. A. Human Action Recognition K. Sentence Similarity Based on Semantic Nets and Using Fusion of Multiview and Deep Features: An Corpus Statistics. IEEE Transactions on Knowledge Application to Video Surveillance. Multimedia Tools and Data Engineering, 2006, 18(8), 1138-1150. https:// and Applications, 2020, 1-27. https://doi.org/10.1007/ doi.org/10.1109/TKDE.2006.130 s11042-020-08806-9 36. Liang, D., Zhang, F., Zhang, W., Zhang, Q., Fu, J., Peng, 27. Khan, M. A., Khan, M. A., Ahmed, F., Mittal, M., Goyal, M., Gui, T., Huang, X. Adaptive Multi-Attention Net- L. M., Hemanth, D. J., Satapathy, S. C. Gastrointesti- work Incorporating Answer Information for Duplicate nal Diseases Segmentation and Classification Based Question Detection. In Proceedings of the 42nd Inter- on Duo-Deep Architectures. Pattern Recognition Let- national ACM SIGIR Conference on Research and De- ters, 2020, 131, 193-204. https://doi.org/10.1016/j.pa- velopment in Information Retrieval, 2019. https://doi. trec.2019.12.024 org/10.1145/3331184.3331228 28. Khan, M. A., Lali, M. I. U., Sharif, M., Javed, K., Aurang- 37. Liu, G., Guo, J. Bidirectional LSTM with Attention zeb, K., Haider, S. I., Altamrah, A. S., Akram, T. An Op- Mechanism and Convolutional Layer for Text Classi- timized Method for Segmentation and Classification fication. Neurocomputing, 2019, 337, 325-338. https:// doi.org/10.1016/j.neucom.2019.01.078 of Apple Diseases Based on Strong Correlation and Genetic Algorithm Based Feature Selection. IEEE Ac- 38. Lopez-Gazpio, I., Maritxalar, M., Gonzalez-Agirre, A., cess, 2019, 7, 46261-46277. https://doi.org/10.1109/AC- Rigau, G., Uria, L., Agirre, E. Interpretable Semantic Tex- CESS.2019.2908040 tual Similarity: Finding and Explaining Differences Be- Information Technology and Control 2020/4/49 509

tween Sentences. Knowledge-Based Systems, 2017, 119, 48. Rehman, A., Khan, M. A., Mehmood, Z., Saba, T., Sarda- 186-199. https://doi.org/10.1016/j.knosys.2016.12.013 raz, M., Rashid, M. Microscopic Melanoma Detection ‐ 39. Ltaifa, I. B., Hlaoua, L., Romdhane, L. B. Hybrid Deep and Classification: A Framework of Pixel Based Fusion Neural Network-Based Text Representation Model to and Multilevel Features Reduction. Microscopy Re- Improve Microblog Retrieval. Cybernetics and Sys- search and Technique, 2020. https://doi.org/10.1002/ tems, 2019, 1-25. jemt.23429 40. Madnani, N., Tetreault, J., Chodorow, M. Re-examining 49. Sahi, M., Gupta, V. A Novel Technique for Detecting Pla- Machine Translation Metrics for Paraphrase Identi- giarism in Documents Exploiting Information Sources. fication. In Proceedings of the 2012 Conference of the Cognitive Computation, 2017, 9(6), 852-867. https://doi. North American Chapter of the Association for Com- org/10.1007/s12559-017-9502-4 putational Linguistics: Human Language Technologies, 50. Severyn, A., Moschitti, A. Learning to Rank Short Text 2012. Association for Computational Linguistics. Pairs with Convolutional Deep Neural Networks. In Pro- 41. Mahmood, A., Khan, S. A., Hussain, S., Almaghayreh, ceedings of the 38th International ACM SIGIR Confer- E. M. An Adaptive Image Contrast Enhancement ence on Research and Development in Information Re- Technique for Low-Contrast Images. IEEE Access, trieval, 2015. https://doi.org/10.1145/2766462.2767738 2019, 7, 161584-161593. https://doi.org/10.1109/AC- CESS.2019.2951468 51. Sharif, M., Akram, T., Raza, M., Saba, T., Rehman, A. Hand-crafted and Deep Convolutional Neural Network 42. Majid, A., Khan, M. A., Yasmin, M., Rehman, A. Features Fusion and Selection Strategy: An Application Yousafzai, A., Tariq, U. Classification of Stomach In- to Intelligent Human Action Recognition. Applied Soft fections: A Paradigm of Convolutional Neural Network Computing, 2020, 87, 105986. https://doi.org/10.1016/j. Along with Classical Features Fusion and Selection. asoc.2019.105986 Microscopy Research and Technique, 2020. https://doi. org/10.1002/jemt.23447 52. Sharif, M. I., Li, J. P., Khan, M. A., Saleem, M. A. Active Deep Neural Network Features Selection for Segmen- 43. Mohammad, A.-S., Jaradat, Z., Mahmoud, A.-A., Jarar- weh, Y. Paraphrase Identification and Semantic Text tation and Recognition of Brain Tumors Using MRI Im- Similarity Analysis in Arabic News Tweets Using Lexi- ages. Pattern Recognition Letters, 2020, 129, 181-189. cal, Syntactic, and Semantic Features. Information Pro- https://doi.org/10.1016/j.patrec.2019.11.019 cessing & Management, 2017, 53(3), 640-652. https:// 53. Smith, L. B., Heise, D. Perceptual Similarity and Concep- doi.org/10.1016/j.ipm.2017.01.002 tual Structure. In Advances in Psychology. 1992, 233- 44. Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., 272. https://doi.org/10.1016/S0166-4115(08)61009-2 Seliya, N., Wald, R., Muharemagic, E. Deep Learning Ap- 54. Vo, N. P. A., Magnolini, S., Popescu, O. Paraphrase Iden- plications and Challenges in Big Data Analytics. Jour- tification and Semantic Similarity in Twitter with Sim- nal of Big Data, 2015, 2(1), 1. https://doi.org/10.1186/ ple Features. In Proceedings of the Third International s40537-014-0007-7 Workshop on Natural Language Processing for Social 45. Onan, A. Two-Stage Topic Extraction Model for Bib- Media, 2015. https://doi.org/10.3115/v1/W15-1702 liometric Data Analysis Based on Word Embeddings 55. Wang, X., Jiang, W., Luo, Z. Combination of Convolu- and Clustering. IEEE Access, 2019, 7, 145614-145633. tional and Recurrent Neural Network for Sentiment https://doi.org/10.1109/ACCESS.2019.2945911 Analysis of Short Texts. In Proceedings of COLING 46. Parikh, A. P., Täckström, O., Das, D., Uszkoreit, J. A De- 2016, the 26th International Conference on Computa- composable Attention Model for Natural Language In- tional Linguistics: Technical papers, 2016. ference. arXiv preprint arXiv:1606.01933, 2016. https:// doi.org/10.18653/v1/D16-1244 56. Wang, Z., Hamza, W., Florian, R. Bilateral Multi-Per- spective Matching for Natural Language Sentences. 47. Rashid, M., Khan, M. A., Sharif, M., Raza, M., Sarfraz, arXiv preprint arXiv:1702.03814, 2017. https://doi. M., Afza, F. Object Detection and Classification: A Joint org/10.24963/ijcai.2017/579 Selection and Fusion Strategy of Deep Convolutional Neural Network and SIFT Point Features. Multime- 57. Wang, Z., Mi, H., Ittycheriah, A. Sentence Similarity dia Tools and Applications, 2019, 78(12), 15751-15777. Learning by Lexical Decomposition and Composition. https://doi.org/10.1007/s11042-018-7031-0 arXiv preprint arXiv:1602.07019, 2016. 510 Information Technology and Control 2020/4/49

58. Wieting, J., Gimpel, K. Revisiting Recurrent Networks for Conference of the North American Chapter of the As- Paraphrastic Sentence Embeddings. arXiv preprint arX- sociation for Computational Linguistics: Human Lan- iv:1705.00364, 2017. ttps://doi.org/10.18653/v1/P17-1190 guage Technologies, 2015. https://doi.org/10.3115/v1/ 59. Xu, W., Ritter, A., Callison-Burch, C., Dolan, W. B., Ji, Y. N15-1091 Extracting Lexically Divergent Paraphrases from Twit- 61. Yin, W., Schütze, H., Xiang, B., Zhou, B. Abcnn: Atten- ter. Transactions of the Association for Computational tion-based Convolutional Neural Network for Model- Linguistics, 2014, 2, 435-448. https://doi.org/10.1162/ ing Sentence Pairs. Transactions of the Association for tacl_a_00194 Computational Linguistics, 2016, 4, 259-272. https:// 60. Yin, W., Schütze, H. Convolutional Neural Network for doi.org/10.1162/tacl_a_00097 Paraphrase Identification. In Proceedings of the 2015

This article is an Open Access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 (CC BY 4.0) License (http://creativecommons.org/licenses/by/4.0/).