Comparing the Quality and Speed of Sentence Classification with Modern Language Models
Total Page:16
File Type:pdf, Size:1020Kb
applied sciences Article Comparing the Quality and Speed of Sentence Classification with Modern Language Models Krzysztof Fiok 1 , Waldemar Karwowski 1 , Edgar Gutierrez 1,2,* and Mohammad Reza-Davahli 1 1 Department of Industrial Engineering and Management Systems, University of Central Florida, Orlando, FL 32816, USA; fi[email protected] (K.F.); [email protected] (W.K.); [email protected] (M.R.-D.) 2 Center for Latin-American Logistics Innovation, LOGyCA, 110111 Bogota, Colombia * Correspondence: [email protected]; Tel.: +1-515-441-5519 Received: 3 April 2020; Accepted: 11 May 2020; Published: 14 May 2020 Abstract: After the advent of Glove and Word2vec, the dynamic development of language models (LMs) used to generate word embeddings has enabled the creation of better text classifier frameworks. With the vector representations of words generated by newer LMs, embeddings are no longer static but are context-aware. However, the quality of results provided by state-of-the-art LMs comes at the price of speed. Our goal was to present a benchmark to provide insight into the speed–quality trade-off of a sentence classifier framework based on word embeddings provided by selected LMs. We used a recurrent neural network with gated recurrent units to create sentence-level vector representations from word embeddings provided by an LM and a single fully connected layer for classification. Benchmarking was performed on two sentence classification data sets: The Sixth Text REtrieval Conference (TREC6)set and a 1000-sentence data set of our design. Our Monte Carlo cross-validated results based on these two data sources demonstrated that the newest deep learning LMs provided improvements over Glove and FastText in terms of weighted Matthews correlation coefficient (MCC) scores. We postulate that progress in LMs is more apparent when more difficult classification tasks are addressed. Keywords: natural language processing; text classification; language model; deep learning; classification data set 1. Introduction Recent years have seen substantial development in natural language processing (NLP), owing to the creation of vector representations of text called embeddings. The breakthrough language models (LMs) Glove [1] and Word2 Vec [2] enabled the creation of static word-level embeddings, which have the same form regardless of the context in which the word is found. Over time, researchers developed LMs capable of creating vector representations of text entities of different lengths, e.g., by analyzing text at the character [3–5] or sub-word level [6]. Recent progress in this field is exemplified by the development of many LMs that create contextualized word embeddings, i.e., their form depends on the context in which the token is found, as in Embeddings from Language Models (ELMo) [7], Bidirectional Encoder Representations from Transformers (BERT) [8], or Generalized Autoregressive Pretraining for Language Understanding (XLNet) [9]. The modern sentence classification approach relies on token level vector representations provided by a chosen LM from which sentence level embeddings are created. LMs providing more sophisticated contextualized embeddings should increase the quality of sentence classification but at the expense of speed. Maintaining a balance between quality and speed is considered important by many researchers, as evidenced by the struggles in developing efficient NLP frameworks [10–12]. However, the time cost Appl. Sci. 2020, 10, 3386; doi:10.3390/app10103386 www.mdpi.com/journal/applsci Appl.Appl. Sci.Sci. 20202020,, 1010,, 3386x FOR PEER REVIEW 22 ofof 1213 results in downstream tasks remain unclear. Determining this information is not trivial because when anda new the LM extent is topublished, which newer authors LMs (ratherusually thanprovide static comparisons embeddings) of improve quality results but not in downstreaminformation tasksregarding remain the unclear. training Determining time (TT) and this inference information time is(IT). not Comparing trivial because results when from a new various LM isframeworks published, authorscan also usuallybe difficult provide because comparisons of the use of of quality different but methods not information for creating regarding sentence the trainingembeddings time from (TT) andlower inference-level embeddings, time (IT). Comparing e.g., by naïvely results comp fromuting various averages frameworks of word can embeddings also be di ffiorcult by becausecomputing of thethe useelement of di-wisefferent sum methods of the forrepresentations creating sentence at each embeddings word position from [13] lower-level, or by using embeddings, a deep neural e.g., bynetwork naïvely [13], computing recurrent averages neural ofnetwork word embeddings(RNN) [14], orbidirectional by computing long the short element-wise-term memory sum ofRNN the representations(BiLSTM) [15], o atr hierarchy each word by position max pooling [13], or (HBMP) by using [16] a deepmodel. neural Therefore, network two [13 factors], recurrent often change neural networkat one time (RNN) and [influence14], bidirectional the quality long of short-term sentence-level memory embeddings, RNN (BiLSTM) namely [ 15the], LM or hierarchy providing by token max poolingembeddings (HBMP) and [ 16the] model.method Therefore, of creating two sentence factors-level often embeddings. change at one time and influence the quality of sentence-levelAnother hin embeddings,drance when namely comparing the LM the providing quality of token LMs embeddingsis that authors and often the method choose ofto creatingpublish sentence-levelevaluations carried embeddings. out on a variety of data sets yet provide results for only the best model obtained afterAnother many training hindrance runs when without comparing reporting the relevant quality ofstatistical LMs is that informat authorsion often regarding choose the to publishaverage evaluationsmodels trained carried in their out onapplied a variety frameworks. of data sets yet provide results for only the best model obtained afterIn many the present training study, runs withoutwe compared reporting the embeddings relevant statistical created informationby various LMs regarding in a single the sentence average modelsclassification trained framework in their applied that used frameworks. the same RNN to create sentence-level embeddings to ensure that the overallIn the presentperformance study, was we comparedunbiased by the the embeddings choice of createdalgorithm by for various the creation LMs in aof single sentence sentence-level classificationembeddings. frameworkAdditionally that, to used improve the same the RNN comparability to create sentence-level of the selected embeddings LMs, we to ensureperformed that thebenchmarking overall performance analysis on was the unbiased same data by sets, the choicenamely of The algorithm Sixth Text for REtrieval the creation Conference of sentence-level (TREC6) embeddings.set [17] and our Additionally, preliminary tomeetings improve preliminary the comparability data set (MPD) of the [18] selected consisting LMs, of we 1000 performed labeled benchmarkingsentences, which analysis are published on the samealong data with sets, this namelypaper. Furthermore, The Sixth Text we REtrieval performed Conference Monte Carlo (TREC6) cross- setvalidation [17] and (MCCV) our preliminary [19] to assess meetings the quality preliminary of predictions data set (MPD)and the [18 TT] consistingand IT achieved of 1000 with labeled the sentences,selected LMs. which All are tests published were conducted along with on this the paper. same Furthermore,computing machine. we performed To further Monte improve Carlo cross-validationcomparability, for (MCCV) all compared [19] to assessLMs, we the used quality the of same predictions fine-tuning and theregimen TT and developed IT achieved during with a thepreli selectedminary LMs.study. All tests were conducted on the same computing machine. To further improve comparability, for all compared LMs, we used the same fine-tuning regimen developed during a preliminary2. General Framework study. 2.2.1. General Text Classification Framework 2.1. TextIn this Classification research, we used Flair v. 0.4.4 [11], an NLP framework written in PyTorch and published by the authors of the Flair LM [20]; this program enables the creation of text classification models basedIn on this various research, state we-of used-the-art Flair LMs. v. 0.4.4 With [11 this], an framework, NLP framework we instantiated written in sentence PyTorch-level and publishedclassifiers byby theusing authors a unidirectional of the Flair LMRNN [20 with]; this gated program recurrent enables units the (GRU creation RNN) of text with classification a single fully models connected based onlayer various for final state-of-the-art sentence classification. LMs. With The this GRU framework, RNN task we was instantiated to create sentence-levela constant-length classifiers, sentence by- usinglevel vector a unidirectional representation RNN withfrom gatedtoken recurrentlevel embeddings units (GRU provided RNN) with by the a single selected fully LMs. connected An overall layer forscheme final presenting sentence classification. this workflow The is presented GRU RNN in taskFigure was 1. toIn createthe Flair a constant-length, framework, if we sentence-level chose an LM vectorenabling representation fine-tuning fromon the token downstream level embeddings task, e.g., BERT, provided the byprocess the selected of training LMs. the An GRU overall RNN scheme along presentingwith the classification