Computational Analysis of

Thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science in Exact Humanities by Research

by

Vikram Ahuja 201256040 [email protected]

International Institute of Information Technology Hyderabad - 500 032, INDIA July 2019 Copyright c Vikram Ahuja, 2019 All Rights Reserved International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Computational Analysis of Humour” by Vikram Ahuja, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Prof. Radhika Mamidi To my Parents and Late Prof. Navjyoti Singh Acknowledgments

I would like to thank Prof. Radhika Mamidi for accepting me to complete my thesis under her guidance. I would like to thank Late Prof. Navjoyti Singh, my advisor for accepting me in IIIT-H and for his constant support, guidance and motivation. Working under him was a great learning experience. He promoted free thought, exploration and has pushed me to think out of the box. I thank my parents and Rubal for their unconditional love and support throughout the journey. I would like to extend my warm regards to all the research members at CEH for their help. I would specially like to thank Taradheesh Bali for his inputs, being an awesome research partner and for being my co-author. I would like to thank all my friends and my hostel mates for making my journey in IIIT-H more exciting. A special shoutout to Manas Tewari for helping me reviewing my thesis and most of my research work. I am also grateful for my friends Gaurav Singh, Durgesh Pandey and Amit Kumar Jha for helping me with our late night discussions and cardio sessions in meta states. Special mention to VRV, Vinit, Rathi, Maju, Tau, Priyanka and whole of the silent wing for helping me throughout this journey. I would also like to thank Dr. Albert Hofmann for all his contribution in the field of consciousness and finally the greens on the foothills of Himalaya.

v Abstract

In this thesis we mainly focus on three major aspects of computational humour recognition. We start with categorizing humour based on the classical theories of humour along with features like theme, emo- tions and topics. We then look at the problem of recognizing humour in conversations and broadcasted speeches which are more complex and large than short . Finally, we try to differentiate between different types of off-color humour and try to detect insulting remarks from off-colour humour in which dark humour is often misclassified as insulting humour. Most scholarly works in the field of computational detection of humour derive their inspiration from the incongruity theory. Incongruity is an indispensable facet in drawing a line between humorous and non-humorous occurrences but is immensely inadequate in shedding light on what actually made the par- ticular occurrence a funny one. Classical theories like Script based Semantic Theory of Humour(SSTH) and General Verbal Theory of Humour(GVTH) try and achieve this feat to an adequate extent. We ad- here to a more holistic approach towards classification of humour based on these classical theories with a few improvements and revisions. Through experiments based on our linear approach and performed on large data-sets of short jokes, we are able to demonstrate the adaptability and show componentizabil- ity of our model, and that a host of classification techniques can be used to overcome the challenging problem of distinguishing between various categories and sub-categories of jokes. Almost all the studies done in the field ofcomputational humour recognition has been done on datasets consisting of short jokes, tweets and puns. We try to detect humour in conversations and broadcasted speeches as they are complex and contains more contextual information when compared to short jokes. For the purpose of automatic humour detection in monologues we built a corpus contain- ing humorous utterances of TED talks and for dialogues we analysed data from a popular TV- Friends whose canned laughter gives an indication of when the audience would react. We classified dialogues/monologues into humorous and non-humorous by using multiple deep learning methods. Our experiments on the data show that such deep learning methods outperform the baseline by 21 accuracy points respectively on the TED Talk dataset. Off colour humour is a category of humour which is considered by many to be in poor taste or overly vulgar. Most commonly, off-colour humour contains remarks on particular ethnic group or gender, violence, domestic abuse, acts concerned with sex, excessive swearing or profanity. Blue humour, black humour and insult humour are types of off-colour humour. Blue and black humour unlike insult humour are not outrightly insulting in nature but are often misclassified because of the presence of insults and harmful speech. We then provide an original data-set consisting of nearly 15,000 instances and a novel

vi vii approach towards resolving the problem of separating black and blue humour from offensive humour which is essential so that free speech on the internet is not curtailed. Our experiments show that deep learning methods outperforms other n-grams based approaches like SVMs, Naive Bayes and Logistic Regression by a large margin. Contents

Chapter Page

Abstract ...... vi

1 Introduction ...... 1 1.1 Motivation ...... 2 1.2 Contribution ...... 3 1.3 Organisation of thesis ...... 3

2 Related Work ...... 5 2.1 Theories Of Humor ...... 5 2.1.1 Classical Theories of Humour ...... 5 2.1.2 Linguistic Theories Of humour ...... 7 2.2 Computational Humour ...... 8 2.2.1 Humour Recognition ...... 8

3 Automatic Humour Classification of Jokes ...... 13 3.1 Overview ...... 13 3.2 Related Work ...... 14 3.3 Proposed Framework ...... 15 3.4 Dataset ...... 17 3.5 Experiments ...... 19 3.6 Analysis ...... 21 3.7 Future Work ...... 22

4 Humour Detection in Conversations ...... 24 4.1 Overview ...... 24 4.2 Related Work ...... 25 4.3 Dataset ...... 26 4.4 Methodology ...... 27 4.5 Experiment and Results ...... 28 4.6 Discussion ...... 29 4.7 Future Work ...... 30

5 Computational Analysis of Off-Colour Humour ...... 31 5.1 Overview ...... 31 5.2 Related Work ...... 32 5.3 Proposed Framework ...... 33

viii CONTENTS ix

5.4 Dataset ...... 35 5.5 Experiments ...... 36 5.6 Analysis ...... 37 5.7 Future Work ...... 38

6 Conclusions ...... 39

Related Publications ...... 40

Bibliography ...... 41 List of Figures

Figure Page

5.1 Differentiating between Insulting and Non-Insulting Humor ...... 34 5.2 Jokes and Insults differentiation ...... 35 5.3 Graph showing the results of the classifiers used ...... 38

x List of Tables

Table Page

3.1 Computationally Detectable Characteristics in Jokes ...... 17 3.2 Example of few jokes in the dataset ...... 18 3.3 Result for Topic Detection ...... 22 3.4 Result for Sarcastic Jokes ...... 22 3.5 Result for Dark Jokes ...... 22 3.6 Result for Adult Slang/Sexual Jokes ...... 23 3.7 Result for Gross Jokes ...... 23 3.8 Result for Insult Jokes ...... 23

4.1 An excerpt from TED talk Tim Urban: Inside the mind of a master procrastinator . . . 26 4.2 An excerpt from S01E01 from ”Friends” TV Show ...... 27 4.3 Results of various classifiers on the dataset ...... 29

5.1 Few examples of jokes used in our dataset ...... 35 5.2 Table showing the accuracies of various classifiers used ...... 38

xi Chapter 1

Introduction

Humor is the tendency of particular cognitive experiences to provoke laughter and provide amuse- ment. Humor is an essential element of all verbal communication. The Oxford English Dictionary defines humour as that quality of action, speech, or writing which excites amusement; oddity, jocularity, facetiousness, comicality, fun”[65]. Humour has been studied by philosophers since a long time starting from the classical Greek thinkers Plato[13] and Aristotle[26] leading to multiple different definitions of humour as different disciplines see humour differently.

Plato is considered as the first theorist of Humour[4] and described humour as a mixed feeling of soul[51, 4], i.e a mixture of pleasure and pain. Some linguists describe humour as any object or event that elicits laughter, amuses or is felt to be funny[4] while some describe humour as something which sometimes elicits laughter and sometimes a smile[48]. Many researchers have proposed different types of humour while some researchers reduce humour to just one, i.e., incongruity and its resolution[58], while some believe that a global anthropological theory of humor and laughter is not possible[3] and it is impossible to define humour[61]. Humour is considered to be a multidisciplinary field as it’s research have had contributions from various disciplines including psychology, anthropology, sociology, literature, philosophy, philology, semiotics and linguistics.[4]

The Greek thinkers highly influenced Latin authors like Cicero[16] and Quintilian[52]. Cicero’s work on humour is considered an important work and is a first attempt on taxonomy of humour from a linguistics perspective[4]. Cicero describes the distinction between verbal and referential humour which has been used by many other theorists like Freud[21] and Raskin[53]. Verbal humour is the type of humour which is expressed verbally using a language or text unlike physical and visual humour which does not need a language to be represented. Verbal humour is of interest to linguists and NLP researchers.

There can multiple definitions and categorization of humour which makes it a very challenging and an interesting domain to work in. On the top of that humour is also incredibly subjective and highly contextual. A can be hysterical to one person while the other person might find it offensive. Also

1 there can be types of humour which makes sense in a scenario but can be inappropriate in another, for example: dark humour and self-deprecating humour.

1.1 Motivation

In the past works in the domain of humour, consensus is yet to be achieved regarding the definition and categorization of humour and there is no agreement or a single theoretical framework to define humour. Amongst the classical theories of humour the Incongruity theory is an indispensable facet in drawing a line between humorous and non-humorous occurrences but is immensely inadequate in shed- ding light on what actually made the particular occurrence a funny one, i.e., the factors that motivates us to laugh. Linguistic theories of verbal humour like Script-based Semantic Theory of Humour and Gen- eral Verbal Theory of Humour try and achieve this feat to an adequate extent. We posit that a different framework based on theme and emotions can adequately help us understand humour on this lines.

Over the past 4-5 years, chat-bots and artificial and virtual assistants have grown larger and their capabilities have been improved and grown complex. One of the most ambitious and useful appli- cation of computer humour is to include comical elements in chatbots, virtual assistants and robots adding humour in conversations can make more them interesting and human computer interaction more natural[20]. Humour is an important part of our conversations and interactions with others and by in- corporating humour in our machines and AI we improve our conversational systems by providing them more empathetic power and thus is an important part of Human Computer Interaction(HCI). There have been attempts to incorporate humour into machines like Siri, Alexa and Google Home but they are very far from perfect. In order to achieve this feat there is a need to study how humour is evoked in conversations so that it could be replicated or detected using current computational models.

In the last decade, there has been an exponential increase in the volume of social media interactions (twitter, reddit, facebook etc). It took over three years, until the end of May 2009, to reach the billionth tweet 1. Today, it takes less than two days for one billion tweets to be sent. Social media has increasingly become the staple medium of communication via the internet. However, due to the non- personal nature of online communication, it presents a unique set of challenges. Social media has become a breeding ground for hate speech and insults as there is a lack of accountability that can be abused. Humour is an essential part of communication and allows us to convey our emotions and feelings. We need to discern between content which is an honest attempt at humour as opposed to content which is purely derogatory and insulting which is essential so that free-speech on the internet is not curtailed.

1#Numbers, Twitter Official Blog. March 14, 2011

2 1.2 Contribution

The major contribution of this thesis are as follows:

• Describing a theoretical framework for understanding humour which also provides the base for the task of computational classification of a vast array of types of jokes into categories and sub- categories based on the theme they express and the emotion that they evoke. We then present a comparative study of a wide range of topic detection as well as classification methods on large data sets of one-liner jokes.

• Analysing the use of deep learning methods to detect automatically detect humour in spoken utterances like dialogues and monologues. Our study beat the state of the art humour recognition system by a margin of 21 accuracy points.

• Describing a theoretical framework on analysing off color humour and a first ever computational study to our best knowledge towards resolving the problem of separating black and blue humour, both which are a part of off-color humour from insulting and offensive humour and we present a dataset of nearly 15,000 jokes belonging .

1.3 Organisation of thesis

This chapter gives an introduction to the thesis work, defines the problem statement and .The different parts of this thesis work are organised in the following chapters.

Chapter 2 - gives a background into various theories and perspective of humour given by researchers and philosophers .The chapter also describes the different publications or the research work done in the field of computation humour detection which have been crucial to experiments we have worked on.

Chapter 3 - describes our attempts at a more holistic approach towards classification of humour based on classical theories like GVTH and SSTH along with a few improvements and revisions to analyze jokes based on the theme that they expresses and the emotion that they evoke is discussed. Multiple experiments performed on large data-sets of short jokes show the adaptability and componentizability of our model, and that a host of classification techniques can be used to overcome the challenging problem of distinguishing between various categories and sub-categories of jokes.

Chapter 4- Describes our model of automatic humour recognition in dialogues and monologues. Hu- mour is evoked differently in utterances compared to short jokes due to presence of more context. This chapter also elucidates the dataset that has been created by us and also describes multiple experiments using state of the art deep learning methods to automatically predict humour in such utterances.

3 Chapter 5 - describes the problem of creating computation recognition models to detect off-color humour. Off color humour is type of humour which contains remarks on particular ethnic group or gender, violence, domestic abuse, acts concerned with sex, excessive swearing or profanity are not outrightly insulting in nature but are often misclassified because of the presence of taboo/insulting and hurtful words.

Chapter 6- Conclusions is a brief summary of the work presented in this thesis and the plausible future work that can be done in this field. It consists of the conclusions and inferences drawn on the basis of the experiments conducted by us.

4 Chapter 2

Related Work

2.1 Theories Of Humor

2.1.1 Classical Theories of Humour

Theories of humour can be categorised into three main groups. They are as follows:

1. Relief Theory Relief/release based theory maintains that humour as a psychological tension, psyche energy re- lease model [6, 4]. Instead of defining humour this theory connects humour to laughter. The two most prominent relief theorists are Herbert Spencer and Sigmund Freud.

Herbert Spencer developed his theory of laughter as an hydraulic expression in which laughter does in the nervous system the same thing what a pressure-relief valve does in a steam boiler [64]. He argues that ”nervous excitation always tends to beget muscular motion.” As a form of physical movement, laughter can serve as the expressive route of various forms of nervous energy [64] , i.e., laughter releases the build up energy. One of the main drawbacks of this theory was that it was not able to explain types of humour which does not involve any particular form of build up energy, for eg, anti-jokes, witty humour, visual humour.[1]

Freud also developed relief theory in his book Jokes and Their Relation to the Unconscious [21]. His theory of laughter is based on the idea that emotions take the physical form of nervous energy. He describes three sources of laughter- der witz(the joke), the comic and humour [21]. The idea of relief theory was used by Freud to explain why we find forbidden topics or taboo topics humorous to acknowledge.

Relief theory of humour has also has important linguistic behaviour because they account for the liberation from the rules of language, wordplay-based jokes like puns [4], or humour created due to the infractions of Grice principle of cooperation [25] typical of humour at large.

5 Thriller and adventure movies or plays are a good way to see the use of relief theory. The plot of such movies often tend to include comic relief when there is high tension scene in play. The tension has built up a suspense which is broken down allowing the viewer to relieve themselves from the high -tension emotions.

2. Superiority Theory The superiority theory states that we usually we find the misfortunes and shortcoming of others humorous because it makes us feel better about ourselves [27]. Plato (Philebus) and Aristotle (Poetics) were the first ones to study this class of humour and developed a theory of laughter [42]. In Philebus plato suggested that vice is what makes a person laughable and that laughter is related to pain and pleasure. Aristotle in Poetics further developed Plato’s theory and related laughter to ugliness. [47]

Thomas Hobbes can be credited for developing the most famous version of the superiority Theory in his book Leviathan which was further built upon the theories of Plato and Aristotle . Hobbes stated that the passion of laughter is nothing else but sudden glory arising from some sudden conception of some eminency in ourselves, by comparison with the infirmity of others, or with our own formerly [28]. He proposed that laughter arises from a sense of superiority of the laughter towards some object(butt of the joke) [4]. Critiques of superiority theory have stated that feelings of superiority are neither necessary nor sufficient for laughter as one can often feel superior to animals without laughing at them [30].

In superiority theory the speaker feel themselves to be at a higher standard or at least on the same level as the listener when their misfortunes or mistakes are pointed. There is also a sense of detachment from the situation due to the superior feeling that is experienced and that is shown in the form of laughter.

3. Incongruous Juxtaposition Theory Incongruity/surprise theory refers to the incompatibility in the expected pattern of the relation- ships between components of an object, event, idea and social expectation. This incompatibility, disconnect and expectancy violation leads to humour [4]. Incongruity based theories have been discussed by Aristotle as well as during Renaissance period by Cicero. Aristotle along with the superiority theory also thought that humour can also be invoked by creating an expectation and then violating it, thus it is the disappointment which makes something humorous [47].

Kant and Schopenhauer were the main proponents of this class of humour. Kant described humour as an affection arising from sudden transformation of a strained expectation into nothingness [33, 47]. The attention is drawn from the set-up line which sets up the expectation to the punchline

6 which violates it , thus the expectation is turned into nothingness. Schopenhauer was the first person to use the term incongruity to describe his theory of humour. Schopenhauer stated that the cause of laughter in every case is simply the the sudden perception of the incongruity between a concept and the real objects which have been through it in some relation and the laughter is in itself the expression of this incongruity [47, 62]. Most of the cognitive theories of humour are based on Incongruity theory because of the presence of mismatch of ideas [4].

Incongruity theory is the most widely accepted theory of humour in philosophy, psychology and linguistics [47] and most of the computational humour recognition work is done on the basis of Incongruity theory.

2.1.2 Linguistic Theories Of humour

1. Script Semantic Theory of Humour (SSTH) Victor Raskin divulges the concept of Script Based Semantic Theory of Humour [54] and it is the first formal theories of verbal humour [57]. Each concept expressed by a word which is internalized by the native speaker of a language, is related to a semantic script via some cognitive architecture to all the surrounding pieces of information. Thereafter, he posits that in order to produce the humor of a verbal joke, the following 2 conditions must be met:

• The text is compatible, fully or in part, with two different (semantic) scripts. [54] • The two scripts with which the text is compatible are opposite. The two scripts with which the text is compatible are said to overlap fully or in part on this text.”[54]

Humor is evoked when a trigger at the end of the joke, the , causes the audience to abruptly shift its understanding from the primary (or more obvious) script to the secondary, op- posing script.

2. General Verbal Theory of Humour (GVTH) The key idea behind GVTH are the 6 levels of independent Knowledge Resources (KRs) defined by Attardo [5]. These KRs could be used to model individual jokes and act as the distinguish- ing factors in order to determine the similarity or differences between types of jokes. The KRs are ranked below in the order of their ability to determine/restrict the options available for the instantiation of the parameters below them:

• Script Opposition (SO) • Logical Mechanism (LM) • Situation (SI) • Target (TA)

7 • Narrative Strategy (NS) • Language (LA)

2.2 Computational Humour

Computational humour can be categorised into two main groups. computational humour generation and computational humour recognition. Over multiple years researchers have worked on humour gener- ation by studying jokes, making templates from it and using those templates and structures to generate new instances of humour. One of the earliest model of humour generation is called Joke Analysis and Production Engine (JAPE) [10, 9, 56] developed in the mid 1990’s later followed by a Homonym Com- mon Phrase Pun Generator [67] developed by Christopher Venour in 1999 which are out of scope of this thesis.

2.2.1 Humour Recognition

Julia Taylor’s [66] work was the first study in the domain of computational humour detection. Their work mainly focused on all possible jokes that are wordplay based(especially Knock Knock jokes). Wordplay jokes are the types of jokes in which there are words which have similar sounds but have different meanings or Homonyms. The comical effect is created because of this conflict of different meanings or subject. A typical Knock Knock (KK) joke is a dialogue that uses wordplay in the punchline and can be summarised using the below structure.

Template: Line1: Knock, Knock” Line2: Who is there?” Line3: any phrase Line4: Line3 followed by who?” Line5: One or several sentences containing one of the following: Type1: Line3 Type2: a wordplay on Line3 Type3: a meaningful response to Line3.

Joke Example : Knock, Knock Who’s there? Water Water who? Water you doing tonight?

8 Their method uses Raskin’s theory(Semantic Theory of Verbal Humor) [54] of humour as it’s the- oretical foundation and works on the assumption that a wordplay joke can be divided into two parts, setup and punchline. In case of Knock Knock jokes, the original sentence is the setup and the word- play sentence is the punchline and they both have different scripts which leads to a comical effect. The task of recognizing humour reduced to detecting wordplay in a joke. The paper uses a ngram based approach to store sequences of words from a dataset of knock knock jokes which they generated on their own in order to recognize them as wordplay. They used a wordplay generator which creates new utterances/words based on the similarity of sound. In the above example if a letter ’w’ in a word ’water’ is replaced with ’wh’, ’e’ is replaced with ’a’, and ’r’ is replaced with ’re’, the new utterance, ’what are’ sounds similar to ’water. Then a wordplay recognizer detects if the newly created utterance is meaning- ful to the whole text or to the utterance it is trying to replace by using a bigram table every two-word sequence along with their count. All the valid new utterances are considered and a joke recognizer is used to determine if the newly created text is a joke. This work was able to recognize wordplays as they found the wordplay matched the intended wordplay but was not able to recognize jokes.

Mihalcea and Strapparava [45] were the first to use empirical methods to investigate the use of auto- matic classifiers in computational humor recognition. Most of the computational humor work previous to them focused mainly on the task of humor generation. They considered humor recognition problem as a traditional classification task and made a binary classifier to distinguish humorous and non humor- ous content. They restricted their dataset to one liners jokes only as it can produce a good comical effect in very few words of words and also they noticed that all one liners have a similar linguistic structure. Their dataset consisted of one liners jokes( 16,000) which were mined from various jokes websites as humorous data and for non humorous data they collected data which consisted of news headlines from Reuters, proverbs and British national Corpus sentences because they have same size as of one liner jokes. Their classification model was based on using heuristics based on various humor specific stylis- tic features like alliteration, antonymy and adult slang and content-based features through experiments where the humor-recognition task is formulated as a traditional text classification and used specifically Naive Bayes and SVM for this task. They found that one liner jokes makes use of words such as man, woman, person, you more often and also found that one liners often use negative word forms like don’t, isn’t, can’t to give a deprecating connotation and are based on human weakness, negation, negative orientations, professional communities and human-centric vocabularies.

Mihalcea et al.[46] experimented with various linear computational models for the task of incon- gruity resolution. Incongruity theory says that the comical effect produced by a joke is because of the surprise element that is generated by the incoherency of the punch line and the setup line. They performed their experiments on one-liners because they have creative language construction so as to produce a good comical effect in a very few words. The dataset consisted of 150 set ups, each of which had 4 different punchlines and out of it only one created a comical effect. Example:

9 Dont drink and drive. You might hit a bump and

1. spill your drink.(Comical)

2. get a flat tire.

3. have an accident.

4. hit your head.

Their model consisted of testing two classes of model, one consisted of knowledge based semantic model, corpus based semantic model to identify degree of relatedness in set up and punchline and domain based(like Medicine, sports) semantic model and the second model was based on joke specific features such as polysemy, alliteration and adult slangs. They found that features such as polysemy and alliteration in the punchline boots up the performance and considered them to be important features of humour. They also concluded that LSA model used was able to capture the surprise element that is produced by the incongruity and it’s result was nearly similar to when using SVM.

In Yang et al.[71] the authors used 4 aspects or features which they considered the indicators of . The four features used in their study for humour recognition are:

• Incongruous structure - Presence of disconnect(semantic distance) in set-up line and punchline leads to laughter. In this paper, disconnection(the maximum distance in meaning of word pair) and repetition(the minimum distance in meaning of the word pair) are used to determine incongruity.

• Ambiguity - Presence of ambiguity in setup and punch also leads to humour as the listener is expecting another meaning. To detect this sense combination, sense far most and sense closest which is based on the wordnet tree was used.

• Impersonal features - The authors used the hypothesis that there is humour associated with strong sentiment and subjectivity.

• Phonetic features : Alliteration and rhyme words were used as a feature.

They used one liners jokes and puns as their dataset which have been used in previous studies of humour[45]. In their experiments they concluded that Incongruity features gives the best result of all the latent semantic features mentioned above to distinguish non humorous sentences from humorous ones, Ambiguity and Phonetic features gave a reasonably good result. The impersonal features did not provide a substantial performance and was because of the nature of the dataset used. They also concluded that the nature of dataset is a far too important factor and that the humour characters are expressed differently in them.

Kiddon and Brun’s[34] study investigated euphemisms that are often used in humour and identified double entendre in humour. A double entendre is a type of expressionism used in jokes which have

10 two meanings, a straightforward meaning and an indecent meaning. They conducted experiments on That’s what she said(TWSS) types on jokes. They considered this problem as a metaphor detection problem and also mapped the straightforward/innocuous meaning to it’s erotic/indecent meaning. Their method consisted of detecting nouns that are euphemisms for sexually explicit words and they used the hypothesis that TWSSs share common structure with sentences in the erotic domain. They concluded that euphemism and erotic-domain-structure features contribute to improving the precision of TWSS identification and that their SVM based model gives a better result than using ngram based model to de- tect TWSS jokes and the metaphorical mapping created can be used to detect other types of euphemisms or humour.

Chen and Soo’s[15] study investigated the use of CNN in humour recognition. It is one of the first study to use deep learning methods to draw a distinction between humour and non humorous content in english as well as chinese language on a very large dataset. Their dataset consisted of 16000 one- liners, pun of the day dataset(used in previous studies of humour detection), short jokes dataset collected from Kaggle1 and Chinese jokes. Their proposed deep learning approach outperformed all the previous humour detection work by a significant margin and were able to achieve mean accuracy of 90% on a very large dataset of jokes.

Chen and Lee’s[14] work was the first investigation in humour detection which uses conversations as a dataset rather than tweets and jokes which have been used in most of the humour recognition studies so far. They investigated the use of convolutional neural network in humour detection. Their dataset consisted of Ted Talk transcripts, which is a form of a conversation along with a pun dataset. They tried and tested out two methods for the task of Humor Detection in Ted Talk dataset. The first method consisted of using a random forest classifier to detect humorous sentences by using stylistic human- engineered humour features such as Incongruity , Ambiguity , Interpersonal Effect , and Phonetic Pattern along with semantic distance features which were learned from a KNN for this model. The second model consisted of using an end to end CNN to predict two possible labels, i.e, laughter and non-laughter. They achieved an accuracy of 86% in pun detection which was at par with the previous studies and achieved an accuracy of 56% for the task of laughter detection in Ted Talks. They concluded that CNN achieved a much better performance than its counterpart and CNN is an efficient method when encountering a new dataset because of its representation learning.

Bertero and Fung[7] investigated the use of Long Short Term Memory for the task of detecting humour in funny dialogues in the transcripts of TV sitcom shows. Humour in TV sitcom shows are determined by the canned laughter or background laughter that is included in the audio. The background canned laughter is used to create a distinction between humorous and non humorous dialogues. Their model consisted of using a convolution neural network for each utterance which was then followed

1https://www.kaggle.com/abhinavmoudgil95/short-jokes

11 by a Long Short Term Memory Framework to model the linear sequence of sentences/utterances. For the Convolution Neural Network they used three input specific features like word tokens, character trigrams along with word2vec which is then used to model the likeliness of humour. Their LSTM model included a set of high level features like structural features(average word length, sentence length), part of speech proportion( noun, verbs, adjectives and adverbs), antonyms, sentiment score and speaker identity. They achieved an accuracy of 70% and a fscore of 63% using LSTM with the high level features. They concluded that using ngram model to detect jokes and punchlines has a problem of false positives which the LSTM model overcomes by filtering them out and that CNN and LSTM model has far better accuracy to detect canned laughter than classical ngram approach.

12 Chapter 3

Automatic Humour Classification of Jokes

3.1 Overview

In this chapter we describe a approach towards classification of humour based on the classical the- ories of humour. Incongruity is an indispensable facet in drawing a line between humorous and non- humorous occurrences but is immensely inadequate in shedding light on what actually made the partic- ular occurrence a funny one. Classical theories like Script based Semantic Theory of Humour(SSTH) and General Verbal Theory of Humour(GVTH) try and achieve this feat to an adequate extent. With a few improvements and revisions in these theories a theoretical framework is provided for the task of performing experiments on large data-sets of jokes to demonstrate the adaptability and show componen- tizability of our model and use a host of classification techniques to overcome the challenging problem of distinguishing between various categories and sub-categories of jokes.

We are trying to formulate the problem of determining different types of humor as a traditional classification task by feeding positive and negative datasets to a classifier. The data-set consists of one liners jokes of different types collected from many jokes websites, multiple subreddits and multiple twitter handles.

In short, the contributions of this chapter can be summarized as follows:

• We present a theoretical framework which also provides the base for the task of computational classification of a vast array of types of jokes into categories and sub-categories

• We present a comparative study of a wide range of topic detection methods on large data sets of one-liner jokes.

• We analyze jokes based on the theme that they expresses and the emotion that they evoke.

The rest of the chapter is divided into 6 more sections. Section 3.2 provides an overview of related work and their shortcomings. Section 3.3 presents the framework. Section 3.4 presents the dataset along

13 with some pre-processing steps. Section 3.5 presents the various experiments conducted on the data set. Section 3.6 discusses the results, while Section 3.7 concludes the chapter.

3.2 Related Work

Research in humour is a field of interest pertaining not only to linguistics and literature but neu- roscience and evolutionary psychology as well. Research in humor has been done to understand the psychological and physiological effects, both positive and negative, on a person or groups of people. Research in humour have given rise to many different theories as well as different types of humour(dark humour, anti-humour etc.) and their influence and effect on human lives and society.

Historically, humour has been synonymous with laughter but major empirical findings suggest that laughter and humour do not always have a one-to-one association. For example, Non-Duchenne laughter[22]. At the same point of time it is also well documented that even though humour might not have a direct correlation with laughter it certainly has an influence in evoking certain emotions as a reaction to something that is considered humorous[60]. Through the ages there have been many theories of humour which attempt to explain what humor is, what social functions it serves, and what would be considered humorous. Though among the three main rival theories of humour, incongruity theory is the more widely accepted as compared to relief theory and superiority theories, it is necessary but not sufficient in containing the scope of what constitutes humour. Amongst Script Semantic Theory of Humour (SSTH) by Raskin and General Verbal Theory of Humour (GVTH) by Attardo, owing to the use of Knowledge Resources GVTH has a much higher coverage as a theory of humour as compared to SSTH, but there still are a few aspects where GVTH comes up short. In prior sections we have established that humour has a direct correlation with the emotions that it evokes. In a similar manner emotions also act as a trigger to a humorous event. In such said events because the reason for inception of the humorous content lies with the post-facto realization/resolution of the incongruity caused by the emotion rather than the event itself applying script opposition is out of line. For example, fear, a negative emotion that can stem as a result of some incongruity in the expected behaviour of our surroundings. Our primary emotion to such a situation is fear. Even so, the result of this incongruity caused in our emotional state, which incipiently was caused by the incongruity in our physical surroundings, can lead to humour. It must be noted that the trigger here is neither the situation nor any LM or script opposition, but the emotional incongruity.

Correspondingly, humour can also prompt itself in form of meta-humour just as emotions do. For example, one way to appreciate a bad joke can be the poorness of the joke. Another major point of contention in GVTH is Logical Mechanism. Here, logical does not stand for deductive logic or strict formal logicality but rather should be understood in some looser quotidian sense ’rational thinking and acting’ or even ’ontological possibility’.

14 Krikmann[37] correctly points out that in SSTH and GVTH both, Raskin’s concept of script is merely a loose and coarse approximation, borrowed from cognitive psychology which attempts to explain what actually happens in human consciousness. Such scripts encapsulate not only direct word meanings, but also semantic information presupposed by linguistic units as well as the encyclopaedic knowledge asso- ciated to them. Even so, in order to explain certain instances where direct or indirect script opposition is missing we need to inject an inference mechanism and a script elaborator to the current cognitive model, which would work off of the pre-existing script and ones that are newly formed through the inference mechanism. These two features become indispensable as, it is not always the case that opposing scripts are readily available to us.

3.3 Proposed Framework

Having Script Opposition as the only derivative bedrock behind the start of a humorous event proves deleterious in SSTH and GVTH’s ability to be able to adapt to different kinds of incongruities. Further, due to the inability of GVTH to accommodate emotions at any level, uncertainty surrounding Logical Mechanism with its really vague identity, and the order of the Knowledge resources instigate us to diverge from SSTH and GVTH as the foundation for our computational setup. Rather, in order to address such shortcomings we have kept the structure of our theory to be much more consequence driven.

Having an approach solely derived from the existing types of humour, would be subject to changes and alterations with the addition of every new type of humor and will add the limitation of the model being either too rigid, which might lead to overfitting while performing computational analysis or can lead to a model which becomes unstable as it is unable to sustain new types after more and more changes. In preference to this we proceed with caution keeping in mind the scope of this problem, drawing from the successes of the previous theories such as SSTH and GVTH with a more holistic approach in mind.

From the outset, Attardo’s and Raskin’s theories of humour, i.e., GVTH and SSTH had their features focused towards recognizing the distinguishing parameters of various degrees of similarity among jokes. In a similar manner we recognize three major marked characteristics which are reflected across all types of jokes, viz.

1. Mode : Each joke whether verbal, textual or graphic has a way in which it is put across to the respective audience. This mode of delivery of a joke can be (but not always) decided upon by the performer of the humorous act. The mode can be a matter of conscious choice or the spontaneous culmination of a dialogue. Different situations might warrant for different modes of delivery leading to varied effects after the humour behind the joke is resolved. For example, the delivery of joke can be sarcastic, where the speaker might want to retort to someone in a conversation or it can be , where the triviality of speaker’s reaction becomes the source of

15 humour. As compared to SSTH and GVTH which investigate the reason behind the incongruity (incongruity being the single source of humour) in the scripts or situations in such scenarios, we embrace incongruity as one of the many mechanisms that can be possible and keep the scope open for all categories which encompass far greater types of humour including and not being limited to juxtaposition of opposing scripts.Thus, the tools that are at the disposal to bring about variations in the mode become more than mere language based artifacts like puns, alliterations etc. The mode can be based on the phonetics of the words such as in a limerick. Two unique sub-categories that can be addressed here which would otherwise cause problems in SSTH and GVTH, due to their structure of logical mechanism are Anti-Humour and Non-Sequitur. Both are unconventional forms of humour and posit a stringent challenge to such theories. Non-Sequitur is difficult to accommodate even for GVTH due to its reliance on Logical Mechanisms(LM). While all the jokes which follow any sort of logical structure could have been classified according to GVTH due to LM, Non-Sequitur does not follow any logical structure whatsoever. The entire point of a non-sequitur is that it is absurd in it’s reason and it also makes no sense according to semantics or meaning. The case with anti-humour could not be more different as it is not a play on the logical structure of the normal conversation but on that of the joke. Hence, as we have also mentioned in the criticisms section, there does not exist a mechanism in the previous theories to deal with such second order humour and meta-jokes.

2. Theme :Each joke through the use of its language and the subject matter conveys a feeling or an emotion along with it. As we have discussed at lengths in the previous sections emotion plays a very important role in a humorous event. It can by itself spur a new thread for a joke as well as act as the conclusive feeling that we get along with the humorous effect. For example, the feeling of disgust on hearing a joke about a gross situation or thing. Hence, the function that the ‘theme’ of a joke can serve is, as a pointer towards the overall affect the joke has during its delivery and after its resolution. In this way we are able to tackle the aspects of a humorous event which are content and language dependent.

3. Topic Most jokes have some central element, which can be regarded as the butt of the joke. This element is the key concept around which the joke revolves. It can be based on stereotypes, such as in blonde jokes or can be based off of a situation such as walks into a bar. As can be observed in the latter case it is mostly but not always the case that the central element be single object or a person. The walks into a bar might further lead to a topic or a situation which ends up with the punchline being on the dumb blonde stereotype. Hence, a single joke can therefore, without such restrictions on its definition can have multiple topics at the same time. Also by not restricting ourselves to only stereotypes about things, situations and beings we can also play with cases where the topic is the stereotype of a particular type of joke itself, leading to humour about stereotypes of humour. For example, a joke about a bad knock knock joke.

16 Categories Sub-Categories Sarcastic Exaggeration/Hyperbole Mode Phonetics Assisted Semantic Opposites Secondary Meaning Dark Joke Gross Joke Theme Adult/Sexual Joke Insults Animal, Blonde, Fat, Food, Profession, Kids, Marriage, Money, Nationality, Sports, News/politics, Police/military, Technology,Height,Men/ Topics Women, Celebrities, Pop Culture, Travel, Doctor, Lawyer, God/religion, Pick up lines, School, Parties, Walks into a bar, Yo-mama

Table 3.1 Computationally Detectable Characteristics in Jokes

On inspection of the aforementioned categories we can clearly observe that unlike GVTH giving a hierarchical structure to these metrics is unsustainable. This works in our favour as we get rid of estab- lishing problematic dependencies like ontological superiority for each category. Instead, we provide a flatter approach where a joke can be bred out of various combinations from each category and belong to multiple sub-categories at the same time.

The culmination of our work towards creating computationally detectable entities leads us to recog- nizing a sub-set in each of the categories that we have defined above. In the coming sections we venture towards testing our theoretical framework in real-life scenarios extracted through various social-media. Table 3.1 provides a catalogue of the sub-categories that we detect in each category.

3.4 Dataset

The following types of jokes are mined from various websites. Table 3.2 contains few examples of jokes in our dataset.

i. Topic Detection : For the task of topic detection in Jokes we mined many jokes websites and collected their tags and considered those our topics. We have restricted our Jokes to the fol- lowing categories: Animal, Blonde, Fat, Food, Profession, Kids, Marriage, Money, Nationality,

17 I asked my North Korean friend how it was there? Sarcastic Joke He said he couldn’t complain. You know what, we need a huge spoon to take care of this. Exaggeration/Hyperbole Guy who invented shovels Coca Cola went to town, Diet Pepsi shot him down. Phonetics Assisted Dr. Pepper fixed him up, Now we are drinking 7up. Humpty Dumpty had a great fall - and a pretty good Semantic Opposites spring and summer , too . Secondary Meaning Those who like the sport fishing can really get hooked Why don’t black people go on cruises? Dark Jokes They are not falling for that one. Q: Why did the skeleton burp? Gross Joke A: It didn’t have the guts to fart. Does time fly when you’re having sex Adult/Sexual Joke or was it really just one minute? Insults You are proof that evolution can go in reverse.

Table 3.2 Example of few jokes in the dataset

Sports, News/politics, Police/military , Technology, Height, Men/Women, Celebrities/Pop Cul- ture, Travel, Doctor, Lawyer, God/religion, Pick up lines, school, party, Walks into a bar, Yo- mama. Most of the jokes websites had the above topics as common topics. We mined nearly 40,000 one liners jokes belonging to these 25 categories for the use of Topic Detection. Since they were collected automatically, it is possible to have noise in the dataset.

ii. Sarcastic Jokes : For the task of Sarcasm Detection we mined Sarcastic jokes(positive) from reddit and other jokes websites which had sarcasm tags in it. For negative data we considered data under tags other than Sarcasm and manually verified the jokes. We created a dataset of 5000 jokes with 2500 belonging to the the positive set and and equal amount of negative instances and manually verified them iii. NSFW Jokes : These are the types of jokes which are most famous on the online media.These types of jokes are mainly associated with heavy nudity, sexual content, heavy profanity and adult slangs. We collected multiple one liner jokes from subreddit /r/dirtyjokes and took jokes from various jokes websites with tags NSFW, dirty, adult and sexual. We created a dataset of 5000 jokes with 2500 belonging to the positive instances and equal number of negative instances verified manually.

iv. Insults : These kinds of jokes mainly consists mainly of offensive insults directed someone else or towards the speaker itself[44]. Typical targets for insult include individuals in the show’s audience, or the subject of a . The speaker of an insult joke often maintains a competitive relationship with the listener. We collected multiple jokes from the subreddit /r/roastme and

18 after manual verification we had 2000 jokes of positive instances and for negative instances we manually created a dataset of 2000 one liner jokes.

v. Gross : A joke having to do with disgusting acts or other things people might find grotesque. We extracted 500 jokes various jokes website which had a ”gross” category/tag in it. We selected equal number of non gross jokes from the above datatset. After manual verification we had a total of 1000 jokes in this category, 500 belonging to both positive and negative sets.

vi. Dark Humor : It’s a form of humor involving a twist or joke making the joke seen as offensive, harsh, horrid, yet the joke is still funny. We collected multiple jokes from subreddit /r/darkjokes as well as as many jokes websites containing the tag Dark Humor. After removing duplicates we had a dataset of 3500 dark jokes. For negative samples we randomly selected 3500 jokes from the jokes websites which did not contain Dark Humor in their tags and manually verified them.

The content of user created jokes on Twitter and Reddit can be noisy. They could contain elements like @RT, links, dates, ID’s, name of users, HTML Tags and hashtags to name a few. To reduce the amount of noise before the classification task , the data is subjected to the following pre processing tasks.

• Tokenization : In a raw post, terms can be combined with any sort of punctuation and hyphenation and can contain abbreviations, typos, or conventional word variations. We use the NLTK tokenizer package to extract tokens from the joke by removing stop words, punctuation, extra white space and hashtags and removing mentions, i.e., IDs or names of other users included in the joke and converting to lowercase.

• Stemming : Stemming is the process of reducing words to their root (or stem), so that related words map to the same stem or root form. This process naturally reduces the number of words associated with each document, thus simplifying the feature space. We used the NLTK Porter stemmer in our experiments.

3.5 Experiments

We performed various experiments on our dataset. For the evaluation we randomly divided our dataset into 90% training and 10% testing. All the experiments were conducted 10 fold and the final performance is reported by averaging the result.

i. Topic Detection : There are a wide variety of methods and variables and they greatly affect the quality of results. We compare results from three topic detection methods on our dataset to detect topics of these jokes. We use LDA, Naive Bayes and SVM along with lexical and Pragmatic features and compared their results. We also augment the used approaches by boosting proper nouns and then, recalculating the experiment results on the same dataset. The boosting techniques

19 that we have used are duplication proper nouns. This boosting technique was chosen keeping in mind the need to give priority to the tweet semantic.

ii. Sarcastic : We treat sarcasm detection as a classification problem. After pre-processing the data we extracted n-grams more precisely, unigrams and bigrams from the dataset and then were added to the feature dictionary. Along with this we used brown clustering which helped us to put similar kinds of words in same cluster. Along with these features we also took sentiment values of the different parts of joke(here 3) as a feature because there is usually a great difference in sentiment scores in different part of a sarcastic joke or a tweet. Using these lexical as well as pragmatic features as in Gonzales-Ibanez et al.[23] we train a logistic regression and a SVM to distinguish between sarcastic jokes from non sarcastic jokes. iii. Exaggeration : These are types of statements that represents something as better or worse than it really is. They can create a comical effect when used appropriately. For eg: In the joke ”You grandma is as old as mountains”, the intensity of the statement is increased by using phrase like ”as old as”. We detect such intense phrases in jokes to categorize under this category by getting sentiment score of every token. Individual sentiment score of every token in phase as well the combined sentiment score will be in positive range to generate an exaggeration effect.

iv. Antonyms/Semantic Opposites : An antonym is one of a pair of words with opposite meanings. Each word in the pair is the antithesis of the other. We use the antonym relation in WORDNET among noun, adjectives and verbs and used approach similar to Mihalcea et al.[45].

v. Phonetic Features : Rhyming words also create a joke. For instance the joke - Coca Cola went to town, Diet Pepsi shot him down. Dr. Pepper fixed him up, Now we are drinking 7up creates a comical effect due the fact that town and down, up and 7up are rhyming words. Similar rhetorical devices play an important role in such wordplay jokes and are often used to create humorous effect. We used CMU Pronunciation Dictionary1 to detect rhyming words. vi. Secondary Meaning :These are the types of the jokes where we find that there is semantic relation among words in a jokes and that relation could be in a form located in, part of, type of, related to, has, etc. For eg: In the joke ”Those who like the sport fishing can really get hooked” comical effect is created due to the relation between ”hook” and ”fishing”. In order to detect these relations in a joke we are using Concept Net[63]. It is a multilingual knowledge base, representing words and phrases that people use and the common-sense relationships between them. So, using concept net we are able to give a used in relationship between hook and fishing. We are going upto three levels to detect secondary relationship between different terms in a joke. vii. Dark Humor : It is a comic style that makes light of subject matter that is generally consid- ered taboo, particularly subjects that are normally considered serious or painful to discuss such

1http://www.speech.cs.cmu.edu/cgi-bin/cmudict

20 as death. Some use it as a tool for exploring vulgar issues, thus provoking discomfort and serious thought as well as amusement in their audience. Popular themes of the genre include violence, discrimination, disease, religion and barbarism. Treating it as a classification problem, we extracted unigrams from the dataset. We also extracted sentiment scores of the sentence be- cause of the hypothesis that dark humor tends to have a very negative sentiment score throughout the joke. We then compared the accuracies of classification techniques such as SVM and Logistic Regression.

viii. Adult Slangs/Sexual Jokes : These types of jokes are most famous on the internet.After pre- processing we extracted unigrams and bigrams. To detect these types of jokes we used a slang dictionary called Slang SD [69]. It contains over 90,000 slang words/phrases along with their sentiment scores. We used these features and compared accuracies of classification methods such as SVM and Logistic Regression.

ix. Gross : Treating the problem of detecting Gross Jokes as a classification problem, unigrams are extracted after pre-processing. We kept a list of top 100 gross words according to their tf-idf score. This feature indicated the presence of gross words. Along with this we also maintain sentiment scores because of the hypothesis that gross jokes tends to have a negative sentiment. Using all these features we compare accuracies using SVM and Logistic Regression.

x. Insults: After pre-processing we are extracting unigrams and bigrams from the dataset. Along with this we are creating a list of insulting words using top 100 words according to their Tfidf score. Along with this we calculated semantic scores of each of the joke and used these features in a Naive Bayes Classifier and a SVM.

3.6 Analysis

In Tables 3.3, 3.4, 3.5 ,3.6 , 3.7 and 3.8 we can see results of our classifiers. We see that SVM has a better accuracy in all the cases than Naive Bayes and Logistic Regression. In the case of Topic Detection, Proper noun boosting increases the accuracy furthermore. In the case of sarcasm detection, we see the sentiment scores as well as unigrams and bigrams given to a SVM gave the best possible result. In the case of detection of dark humor we see that there is significant increase in in accuracy as sentiment values are introduced. These maybe because of the fact the sentiment values in the negative instances are opposites to what it is in positive instances. This result is expected because dark jokes tend to have negative sentiment values. In case of adult slang detection we are getting a very good accuracy as soon as a slang dictionary is introduced. In detection of gross jokes, the accuracy is increased as soon as sentiment and common gross words are introduced. In short,we find that sentiment values prove to be a very important feature in detection of various sub categories. We are also able to detect intense phrases which lead to exaggeration as well as jokes in which there is some kind of a semantic relation among different terms. Using these subcategories we have covered a lot in our ground in categorization

21 Classifier Accuracy LDA 59% Naive Bayes 63% SVM 72% SVM + Proper 76% Noun Boosting

Table 3.3 Result for Topic Detection

Results Features Acc. Logistic Regression (LR) 68% LR + (1,2) grams 71% LR + (1,2) grams + Brown Clustering 71.5% LR + (1,2) grams + Brown Clustering 75.2% + Sentiment Scores SVM + Sentiment Scores + N garms 77%

Table 3.4 Result for Sarcastic Jokes of jokes. The results that we achieve act as binary indicators for each subcategory in our experiment, thus giving multiple tags according to topic, theme and mode to a joke, making our approach more extensive and unique as compared to our counterparts.

3.7 Future Work

Given the constraints of the scope of our paper as well as our research we have tried to assimilate as many sub-categories as possible to include as a part of our computational framework, but at the same point of time we also make an ambitious yet modest assumption that it is still possible to add a few more sub-categories. As our model is versatile enough to handle the addition of such sub-categories seamlessly, the only impediment would the the feasibility of the effort and availability of the compu- tational tools for them to be integrated. With the addition of more and diverse data the model can be made more robust and accurate as well. In future, the framework can also be extended to distinguish between humorous and non-humorous events, allowing us to use the complete tool on various types of

Results Features Accuracy Logistic Regression (LR) 59% LR + Sentiment Scores 63% SVM + Sentiment Scores 64%

Table 3.5 Result for Dark Jokes

22 Results Features Accuracy Logistic Regression (LR) 71% LR + (1,2)grams + Slang SD 85% SVM + (1,2)grams + Slang SD 88%

Table 3.6 Result for Adult Slang/Sexual Jokes

Results Features Accuracy Logistic Regression (LR) 56% LR + Common Gross Words 65% + Sentiment SVM + Common Gross Words 67% + Sentiment

Table 3.7 Result for Gross Jokes data, such as, movie or television show scripts to detect the occurrences of various types of humour and hence, giving birth to a more holistic classification of said media.

Features Accuracy Naive Bayes + (1,2) grams 72% SVM + (1,2) grams 72% SVM + insulting words 79% sentiment values +(1,2) grams

Table 3.8 Result for Insult Jokes

23 Chapter 4

Humour Detection in Conversations

4.1 Overview

In this chapter we analyse and detect humour in conversations, as conversations are more complex and contains more contextual information when compared to short jokes. For this purpose we built a dataset of monologues and dialogues. For the purpose of automatic humour detection in monologues we built a corpus containing humorous utterances based on TED talks and for dialogues we analysed data from a popular TV-sitcom Friends, whose canned laughter give an indication of when the audience would react. We classified dialogues/monologues into humourous and non-humourous by using multiple deep learning methods.

Although many studies have shown promising result in predicting humour in such short jokes and tweets most of the times such short jokes lacks content and don’t replicate real conversations. This is motivation for trying to detect humour in dialogues and monologues as they follow a pattern similar to human conversations. We experiment on two different datasets with multiple classifiers such as SVMs and multiple Deep Neural Networks (DNNs) such as Convolutional Neural Networks, Long Short-Term Memory Networks (LSTMs), FastText, BiLSTM. We have used Chen and Lee’s study [14] as a baseline for detecting humour in monologues and Bertero and Fung’s study [7] as a baseline for detecting humour in dialogues. In short, our contributions can be summaries as follows:

• We investigate the use of deep learning classification algorithms for the task of detecting humour in large dataset of utterances and conversations.

• Our experiments using such deep learning methods outperform the baseline method by 21 accu- racy points.

The rest of the chapter is divided into 6 more sections. Section 4.2 provides details about related work along with their criticisms. Section 4.3 presents the framework. Section 4.4 presents the dataset used ,Section 4.5 presents the experiments conducted. Section 4.6 gives the results and analysis of the experiment conducted in this study and Section 4.7 concludes the chapter.

24 4.2 Related Work

The task of automatic classification refers to deciding whether a sentence/spoken-utterance expresses a certain degree of humor. The study by Mihalcea and Strapparava [45] was one of the first studies to investigate computational Humour detection by considering in as a binary classification problem. They created a dataset of 16,000 short jokes from various jokes websites as humorous instances and used news headlines from Reuters, proverbs and British national Corpus sentences as non humorous instances. They used humor specific stylistic features like alliteration, antonyms and adult slang and formulated humour detecting problem as a binary classification problem. Their created dataset has been further used in multiple further studies like Yang et al. [71] and Chen and Soo [15].

Yang et al. [71] used the same 16,000 jokes dataset along with a pun dataset created by them and used features such as Incongruous structure, ambiguity, Impersonal features and Phonetic features which they considered as indicators of humour and concluded that Incongruity features gives the best result of all the latent semantic features. The pun dataset created by them was further used by Chen and Soo [15] and Chen and Lee [14] in their respective papers.

Since Convolutional Neural Networks (CNNs) have been successfully used in many text categoriza- tion task, deep learning methods have been recently applied for humour detection as well. Chen and Soo [15] investigated the use of deep learning methods to detect humour in short jokes and puns in En- glish and Chinese. They worked on the 16,000 jokes dataset as well as the pun dataset. Their conducted experiments showed a improve in results across the complete dataset and is currently the benchmark for detecting humour in short jokes.

The study by Bertero and Fung [7] was one of the first study in computational humour detection on a dataset consisting on utterances and conversations. They showed that LSTM perform better on modeling sequential information than Conditional Random Fields (CRFs) but had limitations as they did not compare their work with the state of the art CNN method examined in [71] and their dataset is not publicly available to replicate the results. Chen and Lee [14] attempted to study humour in conversations and broadcasted speeches by using TED Talk transcripts as a dataset. Their proposed CNN method predicted humour with an accuracy of 59%. Although their dataset had advantages over short jokes dataset they did not use any sequence to sequence model which have been shown to provide better results as conversations are better modeled using such frameworks [72]. Therefore, the present study is meant to address these limitations.

25 Utterance Label He has no memory of the past, no knowledge of the future, and he only cares about two 0 things: easy and fun. Now, in the animal world, that works fine 0 If you’re a dog and you spend your whole life doing nothing other than easy and fun things, 1 you’re a huge success! (Laughter) And to the Monkey, humans are just another animal species. 0 Now ... here’s my brain. (Laughter) 1 There is a difference. Both brains have a Rational Decision-Maker in them, but the procrastinator’s brain also has an Instant Gratification Monkey. Now, what does this 0 mean for the procrastinator? Well, it means everything’s fine until this happens.

Table 4.1 An excerpt from TED talk Tim Urban: Inside the mind of a master procrastinator

4.3 Dataset

For the task of humour detection in monologues we decided to use TED Talks which are freely available and created a corpus from it and for the task of humour detection in dialogues we used the Friends TV series Transcript.

i. Ted Talk dataset: TED is a media organization that posts talks online for free distribution under the slogan ”ideas worth spreading”1. Over 2,600 TED Talks are freely available on the website and have been previously used in studies relating to computational humour. Since a ted talk is mainly a speech broadcasted to an audience, we considered it a monologue. We focused on the transcripts of the Ted Talks in order to detect humorous utterances/sentences. All the Ted Talk transcripts contains various tags such as Laughter for whenever the audience laughed aloud during the talk and clap for whenever the audience clapped in between or in the end of the talk. We used the laughter tag present in the transcript to determine humorous sentences and built a dataset from it. The rest of the sentences are considered as non humorous sentences. We collected transcripts of 2085 ted talks consisting of 48666 sentences out of which 8914 were humorous and 39,752 were non humorous(not having laughter tag). Table 2 contains an example of the utterances used in the dataset.

ii. Friends TV Series dataset: Friends2 is a famous American television sitcom, which aired from September 22, 1994, to May 6, 2004, lasting ten seasons. This sitcom has multiple characters and their utterances create a comical and humorous effect of different types. The jokes in this TV series consists of Insults, Sarcasm, Dark jokes, Puns, Adult Jokes as well as abuses. Since the utterances represent a wide variety of jokes and are used in day to day lives we decided to used it for the purpose of humour detection in dialogues. We built a corpus from season 1 to 10. All the humorous content/ jokes in this sitcom were represented by canned background laughter

1http://www.ted.com 2https://en.wikipedia.org/wiki/Friends

26 Speaker Utterance Label Monica There’s nothing to tell 0 C’mon, you’re going out with the guy! There’s gotta be something wrong Joey 1 with him! Chandler All right Joey, be nice.,So does he have a hump? A hump and a hairpiece? 0 Phoebe Wait, does he eat chalk?, 1 Phoebe Just, ’cause, I don’t want her to go through what I went through with Carl- oh!, 0 Okay, everybody relax. This is not even a date. It’s just two people going out Monica 0 to dinner and- not having sex., Chandler Sounds like a date to me. 1

Table 4.2 An excerpt from S01E01 from ”Friends” TV Show

that the audience would react to. We used those utterances as positive humorous statements and the utterances where there is no canned laughter are considered as non humorous. The dataset was manually annotated after watching the whole TV series and the transcript3 was given the necessary labels. Table 1 contains an example of the utterances used in the dataset.

4.4 Methodology

We investigate different architectures for the task of humour detection such as SVM, CNN, LSTM, Fasttext, which are described in detail below. For each of the Neural Network methods methods de- scribed below, we initialize the word embeddings with either random embeddings or GloVe embed- dings. Glove embeddings [50] trained on a large tweet corpus (2B tweets, 27B tokens, 1.2M vocab, uncased) are used to obtain the vector representation of words. These are the following methods:

i. SVM : Support Vector Machines [17] are a class of supervised binary classification methods that are used to find the hyperplane which best separates positive and negative examples in a dataset. We used the TFIDF [59] Term-frequency-inverse document frequency as a feature to classify sentences/utterances from our dataset.

ii. CNN: Convolutional Neural Networks is widely used in NLP for the task of text classification and has shown to give successful results [31]. For this paper, we take the CNN model proposed by [35] for text classification and use it for humour detection. We first converted the tokenized text data with variables L and d, where L represents the maximum length of the sentence and d represents word embedding dimension to a 2D matrix of shape L d which was done using Glove embedding or random embedding for different experiments as the embedding layer. These embedding where then fed to a convolutional network where they are experimented on different filters sizes. Taking n filters, after the convolutional layer, the output is flattened out to a 1D vector of n dimensions after the max pooling layer.

3Transcripts collected from https://fangj.github.io/friends/

27 iii. LSTM: Long short term memory [29] are a type of RNN’s which unlike feed forward neural networks use their internal memory to process sequences of input. Since our dataset is consisting of monologues and dialogues, we find that one utterance can lead to humorous effect in another utterance. This type of long range dependencies can be captured by a LSTM for the task of humour detection in such a context-dependent dataset.

iv. BiLSTM: In unidirectional LSTM, only information from the past is preserved, but using a bidi- rectional LSTM [24] will run the input from both past to future and future to past. Thus a BiLSTM has complete sequential information of both the past and the future input at a given time.

v. Fasttext: Fasttext [32] classifier, is a text classifier created by Facebook’s AI Research (FAIR) lab has been used in the task of test classification and is proven to be efficient and is as good as classical deep learning classifiers in terms of accuracy and is much faster for training and evaluation. Fasttext represents a document by average of word vector similar to using features like bag of words and bag of n grams for the task of text classification. Fasttext allows update of word vectors through Back propagation during training allowing the model to fine-tune the word representations according to the task.

4.5 Experiment and Results

We performed various experiments on both of our datasets. For evaluation we randomly divided our dataset into 80% training and 20% testing. The experiment settings and the results are given below.

i. SVM: We used SVM with TF-IDF scores after pre-processing of the data to detect unseen sen- tences into 2 categories, humorous and non-humourous.The implementation of SVM was done using scikit library [49]. We averaged the results after performing the experiments 10 fold. Table 3 shows the accuracy, recall, precision and fscore.

ii. CNN: We implemented the described CNN model has been done using Tensorflow. We experi- mented using word embedding size of 200, with a batch size of 64 for 500 epoch and used adam optimizer [36] for CNN. The results are shown in table 3.

iii. LSTM: We implemented our LSTM model using a LSTM layer on the sequential model of keras. We experimented using word embedding size of 200, with a batch size of 128 and used adam optimizer. Table 3 shows the result of our LSTM classifier on the dataset.

iv. BiLSTM: We implemented our BiLSTM model using bidirectional layer on the sequential model of keras. We experimented using word embedding size of 200, with a batch size of 128 used adam optimizer. Table 3 shows the result of our Bi-LSTM classifier on the dataset.

28 Method Accuracy Ted Talk Dataset Baseline 58.9% CNN+ Random Embedding 64.28% CNN+ GloVe 63.12% LSTM + Random Embedding 80.96% LSTM+ GloVe 78.86% Fasttext + Random Embedding 76.93% Fasttext + GloVe 76.81% BiLSTM 78.10% Friends TV series Transcript CNN+ Random Embedding 57.34% CNN+ GloVe 59.67% LSTM + Random Embedding 75.93% LSTM+ GloVe 71.60% Fasttext + Random Embedding 65.51% Fasttext + GloVe 66.32% BiLSTM 66.89%

Table 4.3 Results of various classifiers on the dataset

v. Fasttext: We implemented our fasttext model using fasttext library [11]. We used 200 dimension word vectors with a batch size of 64 for 15 epochs with a learning rate of 0.1, set the size of context window as 5 and used and RMS-Prop optimizer. Table 3 shows the accuracy, recall, precision and fscore.

4.6 Discussion

For the ted talk dataset with baseline being 58.9% we implemented our own CNN model to verify the dataset and achieved nearly the same result. As shown in the table we achieve the best performance when using LSTM along with random embeddings, which outperform the baseline by a margin of 22 accuracy points. For Friends TV series dataset we find the best performance when using LSTM along with Glove embeddings An LSTM allows us to put each utterance in relation to past utterances thus filtering out many false positives and CNN is better in modeling the lexical content of the utterances. We also find that using a bidirectional LSTM and fasttext showed a drastic performance improvement over the baseline but was not able to outperform a LSTM. In terms of training time, we find that fasttext performs the best as it gives the best result with shortest training time. The accuracies of Friends TV Series dataset is not upto par with Ted Talk dataset because Friends TV series contains canned laughter to make audience laugh and sometimes the utterance with canned laughter may not have any humour.

29 4.7 Future Work

As a future work we plan to conduct more rigorous comparative evaluation with other humour recog- nition methods and also look into the generation of humorous utterances and incorporate both these frameworks to create a humorous chat-box capable of generating humour based on the conversation going on.

30 Chapter 5

Computational Analysis of Off-Colour Humour

5.1 Overview

In this chapter we analyse off-colour humour, which is a category of humour which is considered by many to be in poor taste or overly vulgar. Most commonly, off-colour humour contains remarks on particular ethnic group or gender, violence, domestic abuse, acts concerned with sex, excessive swearing or profanity. Blue humour, dark humour and insult humour are types of off-colour humour. Blue and dark humour unlike insult humour are not outrightly insulting in nature but are often misclassified because of the presence of insults and harmful speech.

Humour that is sometimes considered to be purely offensive, insulting or a form of hate speech is described as off-colour humour. Off-colour humour (also known as vulgar humour, crude humour, or ) is humour that deals with topics that may be considered to be in poor taste or overly vulgar. It primarily consists of three sub-categories, dark humour, blue humour and insult humour.

Dark Humour has been more frequently discussed in literary, social research as well as psychology but not much attention has been given to it in linguistics. It is a comic style that makes light of subject matter that is generally considered taboo, particularly subjects that are normally considered serious or painful to discuss such as death as defined by wikipedia. Dark humour aims at making fun of situations usually regarded as tragic, such as death, sickness, disability, and extreme violence, or of the people involved or subject to them [12]. It is inspired by or related to these tragic events and do not in any way make fun of them [39]. In dark humour a gruesome or a tragic topic is mixed with an innocuous topic which creates shock and inappropriateness. This invoked inappropriateness or shock generally amusing to the listeners [2].Dynel [19] shows that dark humour inspired by tragic events such as a terrorist attacks just addresses topics tangential to them and do not in any way make fun of them directly. Blue humour is a style of humour that is indecent or profane and is largely about sex. It contains profanity or sexual imagery that may shock. It is also referred to as . Insult humour is that kind of humour which consists of offensive insults directed to a person or a group. Roasting is a form of insult in which specific individual, a guest of honor, is subjected to jokes at their expense, intended to amuse the

31 event’s wider audience as defined by Wikipedia. All the three categories mentioned above seem to be interrelated to each other but have very fine differences. Dark humour is different from straightforward obscenity (blue humour) in the way that it is more subtle. Both dark and blue humour are different from insult humour in the sense that there is no intent of offending someone in the former two whereas in insult humour the main aim is to jokingly offend or insult the other person or a group of people [38].

People often get offended on such misunderstood instances of humour more than would otherwise be the case. The significance of our contribution can be fully conceived only when we realise that such occurrences can lead to gratuitous censorship and therefore curtailment of free speech. It is in this context that we are trying to formulate our problem of separating dark humour and blue humour from insult humour. In short the contributions of this chapter can be summarised as follows:

• We present a dataset of nearly 15,000 jokes belonging to different categories of off color humour out of which 4,000 are of positive types.

• A novel approach towards resolving the problem of separating dark and blue humour from offen- sive humour.

The rest of the chapter is divided into 6 more sections. Section 5.2 provides details about related work along with their criticisms. Section 5.3 presents the framework. Section 5.4 presents the dataset used, Section 5.5 presents the experiments conducted. Section 5.6 gives the results and analysis of the experiment conducted in this study and Section 5.7 concludes the chapter.

5.2 Related Work

Humour has always been an important topic for researchers. There has been a lot of study in humour in the field of linguistics, literature, neuroscience, psychology and sociology. Research in humour has revealed many different theories of humour and many different kinds of humour including their functions and effects personally, in relationships, and in society. For the scope of this paper we are restricting ourselves to off-colour humour that has been explained in the above sections.

There has been some studies on offensive humour which is usually used a form of resistance in tragic situations. Weaver [68] in their paper talks how racism could be undermined by using racial stereotypes by blacks and minority ethnic comedians. Lockyer [40] in their study analyse how disabled comedians have also ridiculed stereotypes of the disabled by reversing the offensive comments of the non-disabled.

The study by Billig, 2001 [8] examines the relationship between humour and hatred, which it claims is the topic that is often ignored by researchers of prejudice. It analyses websites that present racist humour and display sympathies with the Ku Klux Klan. The analysis emphasizes the importance of

32 examining the metadiscourse, which presents and justifies the humour and also suggests that the extreme language of racist hatred is indicated to be a matter for enjoyment.

In the book Jokes and their Relation to the Unconscious[21], Freud refers to off-colour humour as the economy of pity and claims that it is one of the most frequent sources of humourous pleasure and these jokes(off-colour) provides a socially accepted means of breaking taboos, particularly in relation to sex and aggression. Dynel [19] analyses the importance of off colour humour(dark humour) in post terrorist attack discourse. The paper claims that dark humour is a coping mechanism under oppressive regimes and in crisis situations. Davies [18] argues that those who engage in racist and sexist jokes do not necessarily believe the stereotypes that the jokes express. Maxwell,2003 [43] brings forth the importance of dark humour as a cognitive and/or behavioral coping strategy which is considered to be a reaction to a traumatic event and proposes a model including progressive steps of humour, ranging from respectful to sarcastic.

Similarly there has been a lot of studies in the field of insult detection. Mahmud et. al [41] in their paper create a set of rules to extract the semantic information of a given sentence from the general semantic structure of that sentence to separate information from abusive language but is very limited in the sense that this system can annotate and distinguish any abusive or insulting sentence only bearing related words or phrases that must exist in the lexicon entry. So, it only looks at insulting words and not sentences that are used in an insulting manner.

Xiang et. al [70] in their work dealt with offensive tweets with the help of Topical Feature Discovery over a Large Scale Twitter Corpus by using Latent Dirichlet Allocation model. The work by Ravazi et al. [55] describe an automatic flame detection method which extracts features at different conceptual levels and applies multi level classification for flame detection but is very limited due to the dataset used by them and does not consider the syntactical structure of the messages explicitly.

The above works tell us there has been quite a lot of studies in both these fields but no such compu- tational study in the intersection of these two topics. This study is the first such attempt which tries to create a separating boundary between different types of off-colour humor and insults. We discuss our framework used to separate different types of off-colour humour in the next section.

5.3 Proposed Framework

The domain of jokes we are dealing with, viz. dark humour, blue humour and insult humour all are generally classified under the umbrella category of NSFW or Off-Colour Humour. It is due to their apparent similarities that on one glance they can be dismissed as being of one and the same type. As we go to finer levels of granularity it becomes evident that two separate buckets can be defined even

33 Figure 5.1 Differentiating between Insulting and Non-Insulting Humor inside off-colour humour, one pertaining to insults resulting in insult humour and the other consisting of dark humour and blue humour as shown in the figure 5.1 above. Insults being the common denominator does not mean all insults and non-jokes can be classified as insult humour, thus defining a demarcation separating the two is also required.

As mentioned above we are able to define clear boundaries between within off-colour humour be- tween insulting and non-insulting humour. But in order to further differentiate between dark and blue humour we identify more features that can give us a clear distinction between the two. One of the primary indicators in helping us draw this line is the ability to detect and extract sexual terms leading us to blue humour and dark themes such as violence (murder, abuse, domestic violence, rape, torture, war, genocide, terrorism, corruption), discrimination (chauvinism, racism, sexism, homophobia, trans- phobia), disease (anxiety, depression, suicide, nightmares, drug abuse, mutilation, disability, terminal illness, insanity), sexuality (sodomy, homosexuality, incest, infidelity, fornication), religion and bar- barism 1 leading us to dark humour. In order to define strong outlines we also ensure that even if an insulting joke or an insult contains sexual content or dark themes the primary focus of such content is on the insult part and not its sexual content or dark theme (which more than anything aides in providing a backdrop).

The focus of this paper is to classify between these three categories of off-colour and hence, inte- grating the task of classification of text between humourous and non-humourous has been deemed out of scope and is limited by the creation of a dataset which consists of only the humourous content.

1https://en.wikipedia.org/wiki/Black comedy.

34 Figure 5.2 Jokes and Insults differentiation

1. My girlfriend is a porn star. She will kill me if she finds out Dark Joke 2. When someone says,”Rape jokes are not funny,” I don’t care. It’s not like I asked for their consent anyway. 1. Sex is not the answer. Sex is the question. ”Yes” is the answer. Blue Joke 2. How can you tell if a man is sexually excited? He’s breathing. 1. Your mama so fat when she stepped on the weighing scale it said: ”I need your weight, not your phone number.” Insult Joke 2. You were beautiful in my dreams, but a fucking nightmare in reality. 1. What’s red and bad for your teeth? A brick Normal/Safe Joke 2. What did the German air force eat for breakfast during WW2? Luftwaffle

Table 5.1 Few examples of jokes used in our dataset

5.4 Dataset

To test our hypothesis that automatic classification models can classify between dark humour, blue humour, Insult humour and (normal humour), we needed a dataset consisting of all types of above mentioned examples. Since there is no such corpus available for the task because of limited study in this field, we collected and labeled our own data. The only data source that was available is the twitter dataset [70] which has been used to detect offensive tweets but due to the fact it was very limited in terms of themes and could not have been used for our study.

The dataset that we created consisted of multiple one-liners. We used one-liners because they are very small generally and must produce humourous effect, unlike longer jokes which usually have a relatively complex narrative structure thus making these types of jokes suitable to be used in our models. Table 5.1 contains examples of jokes used in our dataset. The dataset is defined as follows:

i. Insult jokes: We collected multiple one-liners jokes from the subreddits /r/insults and /r/roastme. Apart from those we also mined various jokes websites and collected jokes with tags Insult. After removing duplicates and verifying manually we were left with nearly 4000 jokes belonging to the category of insult jokes.

35 ii. Dark Jokes: We collected multiple jokes from the subreddit /r/darkjokes and /r/sickipedia. These subreddits contains one-liner jokes which are highly moderated. Apart from that we mined various jokes websites and collected jokes with the tags dark, darkjokes and darkhumour. After removing the duplicates and manual verification we were left with a final dataset of approximately 3500 jokes under the category of dark jokes.

iii. Blue Jokes: Blue jokes are the types of jokes which are most famous on the internet. Since these types of jokes are mainly associated with heavy nudity, sexual content and slangs we collected one liner jokes from subreddit /r/dirtyjokes and apart from that we took jokes from various jokes web- sites with tags NSFW, dirty, adult and sexual. After duplicates removal and manual verification we were left with approximately 2500 jokes under the category of blue jokes.

iv. Normal Jokes/Safe jokes: We collected jokes from subreddits r/cleanjokes and /r/ oneliners. These subreddits contain clean, non offensive jokes and non disrespectful jokes. Jokes in this category does not belong to any of the above category. This types of jokes are referred to as SFW(safe for work) jokes for all future references. After collecting these jokes we searched for insult words(talk about a lexical dictionary from the paper opened in one of the tabs). After duplicates removal we were left were approximately 5000 jokes under this category.

The dataset collected is important because of the fact that it is multidimensional in the sense that it contains insulting and non insulting jokes as well jokes with taboo topics(mentioned in the above section) as well as jokes without these topics. This leaves us with a combined total of nearly 4000 jokes under the category of insult jokes, 3500 under the category of dark jokes(non-insult), 2500 jokes in the category of blue jokes (non-insult) and 5000 jokes from the category of safe jokes(non insult). Thus we have a 4000 positive examples and 11,000 negative examples. All the dataset that has been mined has been taken from websites which have strict moderation policies, thus leaving very little room for error in our dataset.

5.5 Experiments

Treating the problem of separating different types of jokes as a classification problem, there are wide variety of methods that can be used which can greatly affect the results. Some of the features explicitly used were:

i. Dark jokes are usually limited to or their main topic is violence (murder, abuse, domestic violence, rape, torture, war, genocide, terrorism, corruption), discrimination (chauvinism, racism, sexism, homophobia, transphobia), disease (anxiety, depression, suicide, nightmares, drug abuse, muti- lation, disability, terminal illness,insanity), sexuality (sodomy, homosexuality, incest, infidelity, fornication), religion and barbarism. In order to detect these relations a common sense engine called Concept Net [63]. It is a multilingual knowledge base, representing words and phrases that

36 people use and the common-sense relationships between them. Concept Net was used too see use of words related to the above mentioned topics.

ii. Sentiment score of every joke was calculated because of the hypothesis that off-colour humour tends to have a negative sentiment compared to jokes without any vulgar or any such topics mentioned above.

iii. It is our hypothesis that most of the insult jokes have mainly first(self-deprecating humour) and second(directed towards someone) person feature words like I, you, your. This is done in order to detect insulting jokes with contains phrases like your mother, your father which are usually meant to be insults.

After preprocessing of the data we extracted n-grams from the dataset, precisely speaking unigrams, bigrams and trigrams and a feature dictionary was created using of these collected ngrams. We compare results from five different classification algorithms mentioned below with different settings along with features mentioned above.

i. We used LDA, which provides a transitive relationship between words such that each document has a set of words and each word belong to multiple topics(here categories of jokes) which tran- sitively indicates that each document is a collection of topics.

ii. Used n-grams trained a logistic regression and naive bayes with it.

iii. Along with this we used brown clustering which helped us to put similar kinds of words in same cluster and trained a SVM with it.

iv. We also experimented with CNN’s like Kim et al. [35] work which uses CNN for sentence clas- sification. A model as shown in [35] was created with pre-trained vectors from word2vec, which has been trained on google news corpus and the vectors have dimensionality of 300.

Various experiments were performed on our dataset. For evaluation metrics, the dataset was randomly divided into 90% training and 10% testing. All the experiments were performed 10 fold and the final result was then taken to be average of those results.

5.6 Analysis

We can see the results of our classifiers in the table 5.2 below. In the case of Logistic Regression, the introduction of features suggested proved to be an important factor as it increased the accuracy by nearly 10%. LDA had a slight better result than Logistic Regression without any features but is outperformed by Logistic Regression when the features such as sentiment scores, first, second person features and dark words are added. Naive bayes outperforms both LDA and Logistic Regression when features such as sentiment scores, first and second person features and dark word are not added, but we see equal

37 Results Features Accuracy Logistic Regression 59% LR + Ngrams 62% LR + Ngrams + features mentioned 69% LDA 61% Naive Bayes + Ngrams + features mentioned 69% SVM 68% SVM + Ngrams + features 74% CNN + word2vec 81%

Table 5.2 Table showing the accuracies of various classifiers used

Figure 5.3 Graph showing the results of the classifiers used results when those are added. SVM outperforms LDA, Naive bayes and Logistic Regression by a big margin and thus proving to be a better algorithm to classify. We see that the feature set used proved to be a very valuable addition to our experiment and an increase in accuracy in every case when those features are introduced. Finally, CNN’s are introduced along with word2vec outperforms every other classifier used in this study. Thus we achieve the best accuracy of 81% using CNN’s with word2vec.

5.7 Future Work

Given the constraints of the scope of the paper as well as the research conducted we have not at- tempted to integrate the differentiate humourous and non-humourous text in our study. This could also be incorporated in the pipeline to match with other studies. Also, in this paper we have restricted sexual topics in dark humour(not to be confused with sexuality in blue humour) but in reality, there are some dark jokes which have some common features with blue jokes or talks about nudity and profanity. This can be taken up for future work and this whole system could be implemented on various social media platforms to a more holistic classification to insults in social media and practice free speech.

38 Chapter 6

Conclusions

This thesis consists of three parts : For understanding different categories and subcategories of humour we presented a theoretical frame- work to study and categorize textual jokes on the basis of theme they express and the emotion they evoke by performing experiments on the collected dataset. The mined jokes range over a variety of topics and were collected from various online source. The 51,000 extracted jokes were labeled with tags like sar- castic, nsfw, insults, gross, etc. We then performed experiments to classify these jokes into various topics and subcategories including sarcasm, exaggeration, antonyms/semantically opposites, secondary meaning jokes, gross, dark humour, nsfw, phonetic feature jokes by using their specific features and us- ing classification methods like SVM, Logistic regression and Naive Bayes. Our conducted experiments gave a binary indicator to jokes for each subcategory thus giving multiple tags to a joke and giving a more unique classification of humour compared to the counterparts.

We then analysed humour in utterances and conversations which unlike short jokes and one-liners contains more contextual information. This was followed by creation of dataset of conversations from dialogues and monologues. We mined conversations from TED Talk and Friends TV series and our combined final dataset consisted of total 110,000 utterances. We then performed experiments using deep learning methods on the dataset. Our experiments showed that sequence-to-sequence model like LSTM is the best method in terms of accuracy to predict humour in conversations as it able to model a sequence. We achieved a 81% accuracy beating the previous study on the same dataset by 21 accuracy points.

We also presented a first ever theoretical as well as a computational analysis of off-colour humour in order to draw a line between what constitutes as a attempt to humour and insulting remarks. This was followed by creation of a dataset of jokes from multiple online sites belonging to the category of dark jokes, blue jokes, insulting jokes as well as safe jokes. Our dataset consisted of nearly 15,000 jokes. Our experiments showed that CNN’s outperform other classification methods like LDA, SVM and naive bayes by a large margin and we achieved an accuracy of nearly 81%.

39 Related Publications

1. What makes us laugh? Investigations into Automatic Humor Classification, Vikram Ahuja, Taradheesh Bali and Navjyoti Singh, Proceedings of the Second Workshop on Computational Modeling of Peoples Opinions, Personality, and Emotions in Social Media (PEOPLES), North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (NAACL-HLT), 2018

2. From Humour to Hatred: A Computational Analysis of Off-Colour Humour, Vikram Ahuja, Radhika Mamidi, and Navjyoti Singh, International Conference on Natural Language Processing and Chinese Computing(NLPCC), 2018

3. Humour Detection in Conversations, Vikram Ahuja and Radhika Mamidi, Student Research Workshop(SRW) of 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019 (Under Review)

40 Bibliography

[1] V. Ahuja, T. Bali, and N. Singh. What makes us laugh? investigations into automatic humor classification. In Proceedings of the Second Workshop on Computational Modeling of Peoples Opinions, Personality, and Emotions in Social Media, pages 1–9, 2018.

[2] M. Aillaud and A. Piolat. Influence of gender on judgment of dark and nondark humor. Individual Differences Research, 10(4):211–222, 2012.

[3] M. L. Apte. Humor and laughter: An anthropological approach. Cornell Univ Pr, 1985.

[4] S. Attardo. Linguistic theories of humor, volume 1. Walter de Gruyter, 2010.

[5] S. Attardo and V. Raskin. Script theory revis (it) ed: Joke similarity and joke representation model. Humor-International Journal of Humor Research, 4(3-4):293–348, 1991.

[6] D. E. Berlyne. Humor and its kin. The psychology of humor: Theoretical perspectives and empir- ical issues, pages 43–60, 1972.

[7] D. Bertero and P. Fung. A long short-term memory framework for predicting humor in dialogues. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 130–135, 2016.

[8] M. Billig. Humour and hatred: The racist jokes of the ku klux klan. Discourse & Society, 12(3): 267–289, 2001.

[9] K. Binsted and G. Ritchie. An implemented model of punning riddles. Technical report, University of Edinburgh, Department of Artificial Intelligence, 1994.

[10] K. Binsted and G. Ritchie. Computational rules for generating punning riddles. HUMOR- International Journal of Humor Research, 10(1):25–76, 1997.

[11] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword infor- mation. Transactions of the Association for Computational Linguistics, 5:135–146, 2017. ISSN 2307-387X.

[12] C. Bucaria. Dubbing dark humour: A case study in audiovisual translation. Lodz Papers in Pragmatics, 4(2):215–240, 2008.

41 [13] R. G. Bury et al. The Philebus of Plato. University Press, 1897.

[14] L. Chen and C. M. Lee. Predicting audience’s laughter using convolutional neural network. arXiv preprint arXiv:1702.02584, 2017.

[15] P.-Y. Chen and V.-W. Soo. Humor recognition using deep learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 2 (Short Papers), pages 113–117, 2018.

[16] M. T. Cicero and K. W. Piderit. De oratore. BG Teubner, 1886.

[17] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.

[18] C. Davies. Ethnic humor around the world: A comparative analysis. Indiana University Press, 1990.

[19] M. Dynel and F. I. Poppi. In tragoedia risus: Analysis of dark humour in post-terrorist attack discourse. Discourse & Communication, 12(4):382–400, 2018.

[20] H. C. Foot. 10 humour and laughter. The handbook of communication skills, page 259, 1997.

[21] S. Freud and J. Strachey. Jokes and their relation to the unconscious. the standard edition of the complete psychological works of sigmund freud. Ed. James Strachey, 8, 1905.

[22] M. Gervais and D. S. Wilson. The evolution and functions of laughter and humor: A synthetic approach. The Quarterly review of biology, 80(4):395–430, 2005.

[23] R. Gonzalez-Ib´ anez,´ S. Muresan, and N. Wacholder. Identifying sarcasm in twitter: a closer look. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguis- tics: Human Language Technologies: Short Papers-Volume 2, pages 581–586. Association for Computational Linguistics, 2011.

[24] A. Graves, N. Jaitly, and A.-r. Mohamed. Hybrid speech recognition with deep bidirectional lstm. In 2013 IEEE workshop on automatic speech recognition and understanding, pages 273–278. IEEE, 2013.

[25] H. P. Grice, P. Cole, J. L. Morgan, et al. Logic and conversation. 1975, pages 41–58, 1975.

[26] S. Halliwell et al. Aristotle’s poetics. University of Chicago Press, 1998.

[27] D. Heyd. The place of laughter in hobbes’s theory of emotions. Journal of the History of Ideas, pages 285–295, 1982.

[28] T. Hobbes. Leviathan (1651). Glasgow 1974, 1980.

42 [29] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997.

[30] F. Hutcheson. Reflections upon Laughter, and Remarks on the Fable of the Bees by B. de Mandev- ille. Robert&Andrew Foulis, 1758.

[31] R. Johnson and T. Zhang. Semi-supervised convolutional neural networks for text categorization via region embedding. In Advances in neural information processing systems, pages 919–927, 2015.

[32] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.

[33] I. Kant. Kant’s Critique of judgement. Createspace Independent Publishing Platform, 1892.

[34] C. Kiddon and Y. Brun. That’s what she said: double entendre identification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 89–94. Association for Computational Linguistics, 2011.

[35] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.

[36] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[37] A. Krikmann. Contemporary linguistic theories of humour. Folklore, 33(2006):27–58, 2006.

[38] G. Kuipers. Where was king kong when we needed him? Public discourse, digital disaster jokes, and the functions of laughter after, 9(11):20–46, 2011.

[39] P. Lewis. Three jews and a blindfold: The politics of gallows humor. Semites and Stereotypes: Characteristics of Jewish Humor, pages 47–58, 1993.

[40] S. Lockyer. From comedy targets to comedy-makers: disability and comedy in live performance. Disability & Society, 30(9):1397–1412, 2015.

[41] A. Mahmud, K. Z. Ahmed, and M. Khan. Detecting flames and insults in text. 2008.

[42] R. A. Martin. Approaches to the sense of humor: A historical review. The sense of humor: Explorations of a personality characteristic, 15, 1998.

[43] W. Maxwell. The use of gallows humor and dark humor during crisis situation. International journal of emergency mental health, 2003.

[44] J. Mendrinos. The Complete Idiot’s Guide to Comedy Writing. Penguin, 2004.

43 [45] R. Mihalcea and C. Strapparava. Making computers laugh: Investigations in automatic humor recognition. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 531–538. Association for Computational Lin- guistics, 2005.

[46] R. Mihalcea, C. Strapparava, and S. Pulman. Computational models for incongruity detection in humour. In International Conference on Intelligent Text Processing and Computational Linguis- tics, pages 364–374. Springer, 2010.

[47] J. Morreall. The philosophy of laughter and humor. 1986.

[48] L. Olbrechts-Tyteca. Le comique du discours, volume 16. Universite´ de Bruxelles, 1974.

[49] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.

[50] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Pro- ceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[51] R. Piddington. The psychology of laughter. a study in social adaptation. 1933.

[52] M. F. Quintilianus and K. Halm. Institutio oratoria, volume 2. Teubner, 1869.

[53] V. Raskin. Linguistic heuristics of humor: a script-based semantic approach. International journal of the sociology of language, 1987(65):11–26, 1987.

[54] V. Raskin. Semantic mechanisms of humor, volume 24. Springer Science & Business Media, 2012.

[55] A. H. Razavi, D. Inkpen, S. Uritsky, and S. Matwin. Offensive language detection using multi-level classification. In Canadian Conference on Artificial Intelligence, pages 16–27. Springer, 2010.

[56] G. Ritchie. The jape riddle generator: technical specification. Institute for Communicating and Collaborative Systems, 2003.

[57] W. Ruch. The perception of humor. In Emotions, qualia, and consciousness, pages 410–425. World Scientific, 2001.

[58] W. Ruch and A. Carrell. Trait cheerfulness and the sense of humour. Personality and Individual Differences, 24(4):551–558, 1998.

[59] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.

44 [60] A. C. Samson and J. J. Gross. Humour as emotion regulation: The differential consequences of negative versus positive humour. Cognition & emotion, 26(2):375–384, 2012.

[61] N. Schaeffer. The art of laughter. 1981.

[62] A. Schopenhauer. The world as will and idea, volume 1. Library of Alexandria, 1891.

[63] R. Speer, J. Chin, and C. Havasi. Conceptnet 5.5: An open multilingual graph of general knowl- edge. In AAAI, pages 4444–4451, 2017.

[64] H. Spencer. The physiology of laughter. Macmillan’s magazine, 1859-1907, (5):395–402, 1860.

[65] A. Stevenson. Oxford dictionary of English. Oxford University Press, USA, 2010.

[66] J. M. Taylor and L. J. Mazlack. Computationally recognizing wordplay in jokes. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 26, 2004.

[67] C. Venour. The computational generation of a class of pun. 2000.

[68] S. Weaver. Developing a rhetorical analysis of racist humour: Examining anti-black jokes on the internet. Social Semiotics, 20(5):537–555, 2010.

[69] L. Wu, F. Morstatter, and H. Liu. Slangsd: Building and using a sentiment dictionary of slang words for short-text sentiment classification. arXiv preprint arXiv:1608.05129, 2016.

[70] G. Xiang, B. Fan, L. Wang, J. Hong, and C. Rose. Detecting offensive tweets via topical fea- ture discovery over a large scale twitter corpus. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 1980–1984. ACM, 2012.

[71] D. Yang, A. Lavie, C. Dyer, and E. Hovy. Humor recognition and humor anchor extraction. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2367–2376, 2015.

[72] M. Zhang, Y. Zhang, and G. Fu. Tweet sarcasm detection using deep neural network. In Proceed- ings of COLING 2016, The 26th International Conference on Computational Linguistics: Techni- cal Papers, pages 2449–2460, 2016.

45