Humor Detection and Sentiment Analysis of Comic Texts Using Fine-Tuned BERT Models

DilBERT2: Humor Detection and Sentiment Analysis of Comic Texts Using Fine-Tuned BERT Models Krzysztof Garbowicz , Lydia Y. Chen , Zilong Zhao TU Delft Abstract layers afterward to achieve a network able to perform different language understanding tasks. Calibration of the The field of Natural Language Processing (NLP) output layers, also called fine-tuning, is performed on an techniques has progressed rapidly over the recent already pre-trained language model and can be used to get years. With new advancements in transfer learn- predictions for text classification or question answering ing and the creation of open-source projects like tasks. Fine-tuning can be performed with domain-specific, BERT, solutions and research projects emerged labeled data in much smaller quantities than data in the implementing new ideas in a variety of domains, pre-training phase, and it can be done in much less time. for tasks including text classification or question answering. This research focuses on the task of humor detection and sentiment analysis of comic This research tries to answer if it is possible to fine-tune texts with the use of a fine-tuned BERT model. a pre-trained BERT model on a domain of comic text, Never before has a fine-tuned BERT model been on task of humor detection and sentiment analysis. The used for the task of humor detection in a text performance of the proposed models is compared to coming from an artwork. Comic text features different configurations and selected baselines. Performance domain-specific language, affecting the meaning is evaluated on datasets from selected comics and other and structure of commonly used grammar and humor-detection studies. The further goal is to answer if vocabulary. This may differ from the language the classification results can be used as a condition for people use every day, and from the language comic image generation generative adversarial networks the existing humor classifiers used for training. (GANs), developed concurrently with the presented study. This research contributes to the NLP field with The goal of the surrounding research is to extract images phdcomics.com dilbert.com new models and datasets for humor detection and and text from and , perform pre- sentiment analysis, and reports on techniques to processing or classification of the obtained data, and feed improve the training times and the accuracy of a selected information into the GANs which would create pre-trained BERT model on small datasets. The new comic illustrations. The classification of dialogues and proposed solution trained on comic datasets out- descriptions in the comics can deliver additional information performs the chosen baselines and could be used as about the topic or the sentiment of a statement. An example a reliable classifier in the given domain. Moreover, use of that information could be varying the background the results indicate that techniques reducing the color of the generated comic based on the sentiment or the training time of the model can positively contribute humor of the provided text. to its performance. Previous contributions include ColBERT [3], a classifier using BERT tokens trained for humor detection task, and I. Introduction ref. [4] which uses a fine-tuning approach. Additional The recent developments in natural language processing research in the field of solving NLP tasks with a pre-trained (NLP) techniques using pre-trained language models have BERT model at their core includes data pre-processing outperformed previously used methods for tasks related to techniques that could significantly decrease the time of language understanding [1], [2]. One of the most popular training a classifier [5]. Other focus on improving the and successful models created in recent years called BERT accuracy when trained on smaller datasets [6][7]. (Bidirectional Encoder Representations from Transformers) [1], introduced a technique of unsupervised pre-training of This research combines the techniques of fine-tuning deep, contextual, and bidirectional language representations. a pre-trained BERT model on a small dataset [6][7], and The pre-training was done on a dataset of over 3 billion training the model using batches of data of dynamic size words [1], which allows for learning complex linguistic [5]. The proposed models are pre-trained BERT models features in the training phase and adjusting the output with adjusted output layers. The input text is tokenized to # Characters # Words # Unique Words # Punctuation # Duplicate Words # Sentences a format required by BERT, and fed into the model which mean 65 12 12 2 0 1 std 22.73 4.482 4.460 1.515 0.854 0.685 produces a classification label. By training the models with min 4 1 1 0 0 1 median 65 12 12 2 0 1 different training parameters, on datasets from comics and max 145 30 30 19 6 5 previous humor-detection studies, the accuracy and F1-score TABLE I: General statistics of the grammatically correct Dilbert are measured. The goal is to achieve a reliable humor and comics dataset used for the study. sentiment classifier, in the new domain of comic texts, that results could be used as one of the input conditions of the comic generating GAN. architecture and capabilities of the BERT model. The report is laid out in the following structure. Section II Humor Detection The structure of humor chosen for describes the previous work in the domain of fine-tuning a this study follows The Script-based Semantic Theory of BERT model and humor detection. Section III describes the Humor (SSTH) [17]. By its main hypothesis, it can be creation of the datasets and the proposed models. Section assumed that a text is humorous if the context of its parts IV explains how the experiments were performed, and is compatible with each other, and when it contains a describes the experimental setup. Section V presents the punchline, characterized by some parts being opposite to achieved results. Next, in section VI the aspects of ethics each other. The punchline could be distinguished by a and reproducibility are reflected upon. Section VII contains change in the use of grammar, vocabulary, or punctuation. the discussion and anlysis of obtained results. In section VIII There are different theories of humor that rely on more conclusions and suggestions for future work are made. complex data, such as monitoring human reactions or laughter. On the contrary, SSTH focuses exclusively on II. Related Work linguistics. Over the recent years, research surrounding applying pre- Humor Classifiers Previous studies based on a pre- trained BERT models brought novel approaches allowing trained BERT model, solving the task of humor detection, for trimming the initial architecture, and faster training ColBERT [3] and ”Humor Detection: A Transformer Gets without losing accuracy in a given task [6][7]. Concurrently, the Last Laugh” (TGtLL) [4], achieved state-of-the-art studies in different text domains were performed to explore results when trained on large, prepared datasets. ColBERT the possibilities of applying BERT embeddings to solve uses BERT sentence embeddings and an eight-layer a variety of tasks like text classification [2], question classification network. It achieves state-of-the-art results, answering [8], or named entity recognition (NER) [9]. being trained on a dataset of 200K examples of humorous The related work section describes the Bert model, humor and nonhumorous text. TGtLL uses a fine-tuned BERT detection methodology, existing humor detection models, model on big collections of jokes and nonhumorous text. In and the technique of fine-tuning a pre-trained BERT model. contrast with previous contributions, new datasets for this study are much smaller, and consist of text scraped from BERT Model The Bidirectional Encoder Representation popular comics. from Transformers (BERT) [1] is a model trained in an unsupervised fashion on tasks of masked language modeling Fine-Tuning By adjusting the output layers of pre- and next sentence prediction on a dataset consisting of trained BERT models architectures were developed to BookCorpus [10] and English Wikipedia. Once the model solve different NLP tasks. Datasets of around 10,000 - was pre-trained it was distributed as an open-source project 20,000 training examples were often used for fine-tuning. and has been one of the state-of-the-art models in NLP The effects of using smaller datasets, different training since. Two versions of the model exist BERTLARGE (24 functions, and training parameters have been recently encoder layers, 340M parameters) and BERTBASE (12 investigated. Studies show that a dataset of 1000 examples encoder layers, 110M parameters). The model learns the is often sufficient in different text classification tasks [6] [7]. context from tokenized representations of the input text, and Other studies focused on discovering techniques enabling stores different information at different encoder layers [11]. a faster training process [5]. The new comic datasets for The format of the input tokens has a specified structure. the study comprise of around 1000 training examples, The encodings begin with a [CLS] token, followed by and experiments with the newly proposed techniques are tokens representing sub-words of the input text, and performed. might end with [PAD] tokens in case the encoded text is shorter than the specified batch length (max=512). By III. Methodology using transfer learning and attention [12], after pre-training the architecture distinguishes between complex linguistic This section explains the chosen approaches, describes the features. That knowledge comes at a high cost of financial process of creating the datasets, and building the models for and computer resources. However, since pre-training needs classification tasks. to be done only once, models implementing different architectures or pre-training techniques, such as RoBERTa A. Datasets [13], TaBERT [14], CamemBERT [15], or ALBERT [16] Three different comics were proposed as sources of were developed, and broadened the understanding of the new data for the study, PhdComics.com dilbert.com, and Fig.

Load more