Linguistic Cues to Deception and Perceived Deception in Interview Dialogues

Sarah Ita Levitan, Angel Maredia, & Julia Hirschberg Department of Computer Science Columbia University New York, NY, USA sarahita@cs, asm2221, julia@cs .columbia.edu { }

Abstract set (e.g. LIWC categories) is useful for deception classification, but we lack insight about the nature We explore deception detection in interview of the deceptive and truthful language that makes dialogues. We analyze a set of linguistic fea- the feature set useful, and whether the differences tures in both truthful and deceptive responses to interview questions. We also study the per- in language use are statistically significant. In this ception of deception, identifying characteris- work we conduct an empirical analysis of feature tics of statements that are perceived as truth- sets and report on the different characteristics of ful or deceptive by interviewers. Our analysis truthful and deceptive language. In addition, pre- show significant differences between truthful vious work has focused on the characteristics of and deceptive question responses, as well as deceptive language, and not on the characteristics variations in deception patterns across gender of perceived deceptive language. We are also in- and native language. This analysis motivated our selection of features for machine learning terested in human perception of deception; that is, experiments aimed at classifying globally de- what are the characteristics of language that lis- ceptive speech. Our best classification perfor- teners perceive as truthful or deceptive? We ex- mance is 72.74 F1-Score (about 27% better amine a unique dataset that includes information than human performance), which is achieved about both the deceiver and the interviewer, along using a combination of linguistic features and with interviewer judgments of deception. Along individual traits. with an analysis of deceptive and truthful speech, we analyze the believed and disbelieved speech, 1 Introduction according to reported interviewer judgments. Fi- Deception detection is a critical problem studied nally, previous work has focused on general infer- by psychologists, criminologists, and computer ences about deception; here we include analysis scientists. In recent years the NLP and speech of gender and native language, to study their ef- communities have increased their interest in de- fect on deceptive behavior, and also their effect on ception detection. Language cues are inexpen- perception of deception. This work contributes to sive and easy to collect, and research examining the critical problem of automatic deception detec- text-based and speech-based cues to deception has tion, and increases our scientific understanding of been quite promising. Prior work has examined deception, deception perception, and speaker dif- deceptive language in several domains, including ferences in deceptive behavior. fake reviews, mock crime scenes, and opinions The paper is organized as follows: In Section about topics such as abortion or the death penalty. 2 we review related work in language-based cues In this work we explore the domain of interview to deception. Section3 describes the dataset used dialogues, which are similar to many real-world for this work, and Section4 details the different deception conditions. feature sets we employ. In Section5, we report Previous work has presented the results of clas- on the results of our empirical study of indicators sification experiments using linguistic features, of deception and perceived deception, as well as attempting to identify which features contribute gender and native language differences. Section most to classification accuracy. However, stud- 6 presents our machine learning classification re- ies often do not include an empirical analysis of sults using the deception indicator feature sets. We features. We might know that a particular feature conclude in Section7 with a discussion and ideas

1941 Proceedings of NAACL-HLT 2018, pages 1941–1950 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics for future work. 3 Data

3.1 Corpus For this work, we examined the Columbia X- 2 Related Work Cultural Deception (CXD) Corpus (Levitan et al., 2015a) a collection of within-subject deceptive Language-based cues to deception have been ana- and non-deceptive speech from native speakers of lyzed in many genres. Ott et al. (2011) compared Standard American English (SAE) and Mandarin approaches to automatically detecting deceptive Chinese (MC), all speaking in English. The corpus opinion spam, using a crowdsourced dataset of contains dialogues between 340 subjects. A varia- fake hotel reviews. Several studies use a fake tion of a fake resume paradigm was used to collect opinion paradigm for collecting data, instructing the data. Previously unacquainted pairs of sub- subjects to write or record deceptive and truth- jects played a ”lying game” with each other. Each ful opinions about controversial topics such as the subject filled out a 24-item biographical question- death penalty or abortion, or about a person that naire and were instructed to create false answers they like/dislike (Newman et al., 2003; Mihalcea for a random half of the questions. They also re- and Strapparava, 2009). Other research has fo- ported demographic information including gender cused on real-world data obtained from court tes- and native language, and completed the NEO-FFI timonies and depositions (Fornaciari and Poesio, personality inventory (Costa and McCrae, 1989). 2013; Bachenko et al., 2008;P erez-Rosas´ et al., The lying game was recorded in a sound booth. 2015). Real-world deceptive situations are high- For the first half of the game, one subject assumed stakes, where there is much to be gained or lost if the role of the interviewer, while the other an- deception succeeds or fails; it is hypothesized that swered the biographical questions, lying for half these conditions are more likely to elicit strong and telling the truth for the other; questions cho- cues to deception. However, working with such sen in each category were balanced across the cor- data requires extensive research to annotate each pus. For the second half of the game, the subjects utterance for veracity, so such datasets are often roles were reversed, and the interviewer became quite small and not always reliable. the interviewee. During the game, the interviewer was allowed to ask the 24 questions in any order Linguistic features such as n-grams and lan- s/he chose; the interviewer was also encouraged to guage complexity have been analyzed as cues ask follow-up questions to aid them in determin- to deception (Perez-Rosas´ and Mihalcea, 2015; ing the truth of the interviewees answers. Inter- Yancheva and Rudzicz, 2013). Syntactic fea- viewers recorded their judgments for each of the tures such as part of speech tags have also been 24 questions, providing information about human found to be useful for structured data (Ott et al., perception of deception. The entire corpus was or- 2011; Feng et al., 2012). Statement Analysis thographically transcribed using the Amazon Me- (Adams, 1996) is a text-based deception detec- chanical Turk (AMT)1 crowd-sourcing platform, tion approach that combines lexical and syntac- and the speech was segmented into inter-pausal tic features. An especially useful resource for units (IPUs), defined as pause-free segments of text-based deception detection is the Linguistic In- speech separated by a minimum pause length of quiry and Word Count (LIWC) (Pennebaker and 50 ms. The speech was also segmented into turn King, 1999), which groups words into psycholog- units, where a turn is defined as a maximal se- ically motivated categories. In addition to lexi- quence of IPUs from a single speaker without cal features, some studies have examined acoustic- any interlocutor speech that is not a backchan- prosodic cues to deception (Rockwell et al., 1997; nel. There are two forms of deception annota- Enos, 2009; Mendels et al., 2017). (Benus et al., tions in the corpus: local and global. Interviewees 2006) studied pause behavior in deceptive speech. labeled their responses with local annotations by This work is very promising, but it is more dif- pressing a ”T” or ”F” key for each utterance as ficult to obtain large, cleanly recorded speech cor- they spoke. These keypresses were automatically pora with deception annotations than to obtain text aligned with speaker IPUs and turns. Global la- corpora. An excellent meta-study of verbal cues to deception can be found in (DePaulo et al., 2003). 1https://www.mturk.com/mturk/

1942 bels were provided by the biographical question- due to turns that were unrelated to any question). naire, where each of the 24 questions was labeled We performed our analysis and classification on as truthful or deceptive. two segmentations of the data using this tagging Consider the following dialogue: method: (1) first turn: we analyzed only the sin- Interviewer: What is your mother’s job? gle interviewee turn directly following the original Interviewee: My mother is a doctor (F). She has question, and (2) multiple turns we analyzed the always worked very late hours and I felt neglected entire segment of interviewee turns that were re- as a child (T). sponding to the original interviewer question and Is the interviewee response true or false? We subsequent follow-up questions. In our classifica- differentiate between global and local deception. tion experiments, we explore whether a deceptive Globally, the response to the question is deceptive. answer is be better classified by the interviewee’s However, it contains local instances of both truth initial response or by all of the follow-up conver- and deception. In this work we focus on dialogue- sation between interviewer and interviewee. based deception, using global deception labels. 4 Features 3.2 Global Segmentation LIWC Previous work has found that deceivers Previous work with the CXD corpus has focused tend to use different word usage patterns when on IPU-level and turn-level analysis and classifi- they are lying (Newman et al., 2003). We used cation of local deception, mostly with acoustic- LIWC (Pennebaker et al., 2001) to extract seman- prosodic features (Levitan et al., 2015b; Mendels tic features from each utterance. LIWC is a text et al., 2017). Here we are interested in exploring analysis program that computes features consist- global deception at the dialogue-level for the first ing of normalized word counts for 93 semantic time in this corpus. We define response-segments classes. LIWC dimensions have been used in as sets of turns that are related to a single question many studies to predict outcomes including per- (of the 24 interview questions). In order to anno- sonality (Pennebaker and King, 1999), deception tate these segments, we first used a question de- (Newman et al., 2003), and health (Pennebaker tection and identification system (Maredia et al., et al., 1997). We extracted a total of 93 features 2 2017) that uses word embeddings to match se- using LIWC 2015 , including standard linguis- mantically similar variations of questions to a tar- tic dimensions (e.g. percentage of words that are get question list. This was necessary because in- pronouns, articles), markers of psychological pro- terviewers asked the 24 questions using different cesses (e.g. affect, social, cognitive), punctuation wording from the original list of questions. On categories (e.g periods, commas), and formality this corpus, (Maredia et al., 2017) obtained an F1- measures (e.g. fillers, swear words). score of .95%. Linguistic We extracted 23 linguistic features 3 After tagging interviewer turns with this sys- which we adopted from previous deception tem, we labeled the set of interviewee turns be- studies such as (Enos, 2009; Bachenko et al., tween two interviewer questions q1 and q2 as cor- 2008). Included in this list are binary and responding to question q1. The intuition behind numeric features capturing hedge words, filled this was that those turns were responses to follow pauses, laughter, complexity, contractions, and up questions related to q1, and while the ques- denials. We include Dictionary of Affect Lan- tion detection and identification system discussed guage (DAL) (Whissell et al., 1986) scores that above did not identify follow up questions, we measure the emotional meaning of texts, and a found that most of the follow up questions after specificity score which measures level of detail an interviewer question q1 would be related to q1 (Li and Nenkova, 2015). The full list of features in our hand annotation. We evaluated this global is: ’hasAbsolutelyReally’, ’hasContraction’, segmentation on a hand-annotated test set of 17 ’hasI’, ’hasWe’, ’hasYes’, ’hasNAposT’ (turns interviews (about 10% of the corpus) consisting 2A full description of the features is found here: https: of 2,671 interviewee turns, 408 interviewer ques- //s3-us-west-2.amazonaws.com/downloads. tions, and 977 follow up questions. Our global liwc.net/LIWC2015_OperatorManual.pdf 3A detailed explanation of these linguistic features and segmentation approach resulted in 77.8% accuracy how they were computed is found here: http://www.cs. on our hand-labeled test set (errors were mostly columbia.edu/speech/cxd/features.html

1943 that contain words with the contraction ”n’t”), and calculated a series of paired t-tests between ’hasNo’, ’hasNot’, ’isJustYes’, ’isJustNo’, ’noYe- the features of truthful speech and deceptive sOrNo’, ’specificDenial’, ’thirdPersonPronouns’, speech. All tests for significance correct for ’hasFalseStart’, ’hasFilledPause’, ’numFilled- family-wise Type I error by controlling the false Pauses’, ’hasCuePhrase’, ’numCuePhrases’, discovery rate (FDR) at α = 0.05. The kth small- ’hasHedgePhrase’, ’numHedgePhrases’, est p value is considered significant if it is less than k α ’hasLaugh’, ’complexity’, ’numLaugh’, ’DAL- n∗ . wc’, ’DAL-pleasant’, ’DAL-activate’, ’DAL- imagery’, ’specScores’ (specificity score). 5.1 Interviewee Responses Response Length Previous work has found that Table1 shows the features that were statistically response length, in seconds, is shorter in deceptive significant indicators of truth and deception in in- speech, and that the difference in number of words terviewee response-segments consisting of multi- in a segment of speech is insignificant between de- ple turns. Below, we highlight some interesting ceptive and truthful speech (DePaulo et al., 2003). findings. For our question-level analysis, we used four different measures for response length: the total In contrast to (DePaulo et al., 2003), we found number of seconds of an interviewee response- that the total duration of an interviewee response- segment, the total number of words in an intervie- segment was longer for deceptive speech than for wee response-segment, the average response time truthful speech. Additionally, while (DePaulo of a turn in an interviewee response-segment, and et al., 2003) showed that the number of words the average number of words per turn in an inter- in a segment of speech was not significantly dif- viewee response-segment. ferent between deceptive and truthful speech, we found that deceptive response-segments had more Individual Traits We analyzed gender and native language of the speakers to determine if these words than truthful response-segments. Further- traits were related to ability to deceive and to de- more, we found that longer average response time tect deception. We also analyzed linguistic cues to per turn and more words per sentence were signif- deception across gender and native language, and icant indicators of deception. These results show used gender and native language information in that when interviewees are trying to deceive, not our classification experiments. All speakers were only is their aggregate response longer in dura- either male or female, and their native language tion and number of words, but their individual re- was either Standard American English or Man- sponses to each follow-up question are also longer. darin Chinese. In addition, we used the NEO-FFI Consistent with (DePaulo et al., 2003), we found (5 factor) personality inventory scores as features that more filled pauses in an interviewee response- in classification experiments, but not for the statis- segment was a significant indicator of deception. tical analysis in this paper. Deceivers are hypothesized to experience an increase in cognitive load (Vrij et al., 1996), and this Follow-up Questions Follow-up questions are can result in difficulties in speech planning, which questions that an interviewer asks after they ask can be signaled by filled pauses. Although (Be- a question from the original prescribed set of nus et al., 2006) found that, in general, the use of questions. We hypothesized that if an inter- pauses correlates more with truthful than with de- viewer asked more follow-up questions, they were ceptive speech, we found that filled pauses such as more likely to identify deceptive responses, be- ”um” were correlated with deceptive speech. The cause asking follow-up questions indicated inter- LIWC cogproc (cognitive processes) dimension, viewer doubt of the interviewee’s truthfulness. For which includes words such as ”cause”, ”know”, each interviewee response-segment, we counted ”ought” was significantly more frequent in truth- the number of follow-up questions interviewees ful speech, also supporting the theory that cogni- were asked by the interviewer. tive load is increased while practicing deception. 5 Analysis We found that increased DALimagery scores, which compute words often used in speech to cre- In order to analyze the differences between decep- ate vivid descriptions, were indicators of deceptive and truthful speech, we extracted the above tion. We also found that the LIWC language sum- features from each question response-segment, mary variables of authenticity and adjectives

1944 Feature Deception Truth Neutral Lexical DAL.activate, DAL.imagery, DAL.pleasant isJustNo complexity, DAL.wc noYesOrNo, numCuePhrase, isJustYes numFilledPauses, numHedgePhrases numLaugh specScores, thirdPersonPronouns specificDenial LIWC achieve, adj, adverb, affiliation certain, dic affect, apostro, assent analytic, article, authentic function, negate, netspeak auxverb, body, cogproc cause, clout, compare, conj colon, comma, death dash, discrep, drives, family differ, female, filler feel, focusfuture, focuspast, friend i, ingest, insight, leisure health, interrog, ipron, male posemo, quant, quote motion, percept, ppron, prep relig, sad, see pronoun, power, relativ, reward sixltr, they, tone, work shehe, social, space, swear verb, WC, we, WPS, you Response length num words, response length avg response length, avg num words Followup num turns

Table 1: Statistically signiﬁcant indicators of truth and deception in interviewee response-segments consisting of multiple turns related to a single question.

Feature Deception Truth Neutral Lexical DAL.imagery, DAL.pleasant DAL.actvate complexity, DAL.wc numCuePhrases, numFilledPauses isJustNo isJustYes, noYesOrNo numHedgePhrases, speciﬁcDenial numLaugh specScores, thirdPersonPronoun LIWC adverb, article,authentic, body negate apostro, bio, cause conj, focuspast, interrog, ipron certain, clout, cogproc, compare prep, pronoun, WC, WPS discrep, focusfuture, function insight, money, motion negemo, nonﬂu, number posemo, ppron, relative Response length num words response length avg num words avg response length Followup num turns

Table 2: Statistically significant indicators of perceived truth and deception in interviewer judgments of interviewee responses. were indicators of deception: in an effort to develop an answer (e.g. ”Can you repeat the ques- sound more truthful and authentic, interviewees tion?”), or to deflect the interviewer’s attention may have provided a level of detail that is un- from their deception and put the interviewer on the characteristic of truthful speech. Similarly, the spot. We observed that hedge words and phrases, specificity metric was indicative of deception: which speakers use to distance themselves from deceptive responses contained more detailed lan- a proposition, were more frequent in deceptive guage. Words in the LIWC clout category - a cate- speech. This is consistent with Statement Analysis gory describing words that indicate power of influ- (Adams, 1996), which posits that hedge words are ence - were more prevalent in deceptive responses, used in deceptive statements to intentionally cre- suggesting that subjects sounded more confident ate vagueness that obscures facts. Consistent with while lying. Interrogatives were an indicator this finding, certainty in language (words such of deception. In the context of the interviewer- as ”always” or ”never”) was a strong indicator of interviewee paradigm, these are interviewee ques- truthfulness. tions to the interviewer. Perhaps this was a tech- nique used to stall so that they had more time to It is also interesting to note the features that were not significant indicators of truth or decep-

1945 tion. For example, there was no significant differ- as false. For example, the LIWC dimensions ence in laughter frequency or apostrophes (used clout and certain were not significantly differ- for contractions in this corpus) between truthful ent in believed vs. disbelieved interviewee re- and deceptive responses. sponses, but clout was increased in deceptive When we compared indicators of truth vs. de- speech and certain language was increased in ception across multiple turns to indicators of truth truthful speech. There were also features that were vs. deception in just the first turns of interviewee significantly different between believed and disbe- response-segments, we found that, generally, indi- lieved statements, but were not indicators of de- cators in first turns are a subset of indicators across ception. For example, statements that were per- multiple turns. In some cases there were inter- ceived as false by interviewers had a greater pro- esting differences. For example, although tone portion of specificDenials (e.g. ”I did not”) than (emotional tone - higher numbers indicate more those that were perceived as true; this was not positive, and lower indicate negative) was not a a valid cue to deception. Number of turns was significant indicator of deception for the entire in- increased in dialogue segments where the inter- terviewee response-segment, negative tone was a viewer did not ultimately believe the interviewee moderate indicator of deception in first turns. This response. That is, more follow up questions were suggests that the tone of interviewees, when they asked when an interviewer did not believe their in- have just started their lie, is different from when terlocutor’s response, which is an intuitive behav- they are given the opportunity to expand on that ior. When we compared indicators of speech that lie. The findings from our analysis of first turns was perceived as deceptive across multiple turns to suggest that there might be enough information indicators of speech that was perceived as decep- in the first response alone to distinguish between tive in just the first turns, we found that, generally, deceptive and truthful speech; we test this in our indicators in first turns are a subset of indicators classification experiments in Section6. across multiple turns. On average, human accuracy at judging truth 5.2 Interviewer Judgments of Deception and deception in the CXD corpus was 56.75%, and accuracy at judging deceptive statements only In addition to analyzing the linguistic differences was 47.93%. The average F1-score for humans between truthful and deceptive speech, we were was 46. Thus, although some cues were correctly interested in studying the characteristics of speech perceived by interviewers, humans were generally that is believed or disbelieved. Since the CXD cor- poor at deception perception. Nonetheless, char- pus includes interviewer judgments of deception acterizing the nature of speech that is believed or for each question asked, we have the unique op- not believed is useful for applications where we portunity to study human perception of deception would ultimately like to synthesize speech that is on a large scale. Table2 shows the features that trustworthy. were statistically significant indicators of truth and deception in interviewee responses - consisting of multiple turns - that were perceived as true or false 5.3 Gender and Native Language Differences by interviewers. Here we highlight some interest- in Deception Behavior ing findings. There were many features that were Having discovered many differences between de- prevalent in speech that interviewers perceived as ceptive and truthful language across all speakers, deceptive, which were in fact cues to deception. we were interested in analyzing differences in de- For example, speech containing more words in a ceptive language across groups of speakers. Using response-segment and more words per sentence gender and native language (English or Mandarin was generally perceived as deceptive by interview- Chinese) as group traits, we conducted two types ers, and indeed, this perception was correct. Dis- of analysis. First, we directly compared decep- believed answers had a greater frequency of filled tion performance measures (ability to deceive as pauses and hedge words, and greater specificity, interviewee, and ability to detect deception as in- all of which were increased in deceptive speech. terviewer) between speakers with different traits, There were also several features that were in- to assess the effect of individual characteristics on dicators of deception, but were not found in deception abilities. In addition, we compared the higher rates in statements that were perceived features of deceptive and truthful language in sub-

1946 Group Deception Truth was no significant difference of posemo frequency Male analytic, friend, interrog posemo across gender. Female achieve, adverb, article authentic, cause compare discrep,family, feel focusfuture, percept, power 5.3.2 Native Language relativ, we English acheve, adverb, affiliation compare, interrog, power Interviewees were more successful at deceiving relativ, space, swear native Chinese speakers than at deceiving native Chinese analytic, bio cause certain English speakers (t(170) = 2.13, p = 0.033). discrep, feel, health (informal) − percep, (filler) (netspeak) This was true regardless of interviewee gender and native language, and slightly stronger for fe- Table 3: Gender-specific and language-specific in- male interviewers (t(170) = 2.22, p = 0.027). − dicators of deception and truth. We consider a re- When considering only female interviewers, inter- sult to approach significance if its uncorrected p viewees were more successful at deceiving non- value is less than 0.05 and indicate this with () in native speakers than native speakers, but this dif- the table. ference was not significant when considering only male interviewers. As with gender, there were several features that were discriminative between de- sets of the corpus, considering only people with ception and truth for only native speakers of En- a particular trait, in order to determine group- glish, or only native speakers of Mandarin. Table specific patterns of deceptive language. As be- 3 shows LIWC categories and their relation to de- fore, tests for significance correct for family-wise ception, broken down by native language. For ex- Type I error by controlling the false discovery rate ample, power words were found more frequently (FDR) at α = 0.05. The kth smallest p value is in deception statements, when considering native considered significant if it is less than k α . n∗ English speakers only. In general, power words 5.3.1 Gender were used more by native Mandarin speakers than by native English speakers. LIWC categories of There were no significant differences in deception compare, relative, and swear were more preva- ability between male and female speakers. How- lent in deceptive speech, only for English speak- ever, there were many differences in language be- ers. On the other hand, feel and perception di- tween male and female speakers. Further, some mensions were only indicators of deception for features were only discriminative between decep- native Mandarin speakers, although there was no tion and truth for a specific gender. Table3 shows significant difference in the use of these word cat- linguistic features that were significantly different egories across native language. Informal and between truthful and deceptive speech, but only netspeak word dimensions tended to be more fre- for one gender. In some cases the feature was quent in truthful speech for native Chinese speak- found in different proportions in male and females, ers only (approaching significance), and these and in other cases there was no significant differ- word categories were generally more frequent in ence. For example, family words were indicative native Mandarin speech. Finally, filler words of deception only in female speakers, and these tended to be more frequent in deceptive speech words were also used more frequently by female (approaching significance) only for native Man- speakers than male speakers. darin speakers, and these were used more fre- The LIWC category of compare was also in- quently by native Mandarin speakers than native dicative of deception for females only, and this English speakers. feature was generally found more frequently in female speech. Article usage was only significantly Overall, our findings suggest that deceptive be- different between truthful and deceptive speech havior in general, and deceptive language in par- in females (more articles were found in deceptive ticular, are affected by a person’s individual char- speech), but articles were used more frequently in acteristics, including gender and native language. male speech. On the other hand, the LIWC cate- When building a deception classification system, gory of posemo (positive emotion) was increased it is important to account for this variation across in truthful speech for male speakers only, and there speaker groups.

1947 Features Segmentation Accuracy Precision Recall F1-score Human baseline Multiple turns 56.75 56.50 40.00 46.50 LIWC Single turn 65.75 65.79 65.74 65.72 Multiple turns 72.78 72.84 72.74 72.74 Lexical Single turn 66.95 66.97 66.95 66.94 Multiple turns 70.33 70.46 70.28 70.25 LIWC+lexical Single turn 68.35 68.36 68.35 68.35 Multiple turns 71.66 71.77 71.60 71.58 LIWC+individual Single turn 67.50 67.50 67.50 67.49 Multiple turns 71.85 71.93 71.80 71.79 Lexical+individual Single turn 69.32 69.33 69.32 69.31 Multiple turns 69.95 70.06 69.89 69.86 LIWC+lexical+individual Single turn 70.87 70.87 70.87 70.87 Multiple turns 72.40 72.50 72.34 72.33

Table 4: Random Forest classiﬁcation of single turn and multiple turn segmentations, using text-based features and individual traits (gender, native language, NEO-FFI personality scores).

6 Deception Classification est classifier was consistently the best performing, and we only report those results due to space con- Motivated by our analysis showing many signif- straints. Table4 displays the classification perfor- icant differences in the language of truthful and mance for each feature set individually, as well deceptive responses to interview questions, we as feature combinations, for both single turn and trained machine learning classifiers to automati- multiple turn segmentations. It also shows the hu- cally distinguish between truthful and deceptive man baseline performance, obtained from the in- text, using the feature sets described in section terviewers’ judgments of deception in the corpus, 4. We compared classification performance for which were made after asking each question along the two segmentation methods described in sec- with related follow-up questions (i.e. multiple turn tion 3.2: first turn and multiple turns. This al- segmentation). lowed us to explore the role of context in auto- The best performance (72.74 F1-score) was ob- matic deception detection. When classifying inter- tained using LIWC features extracted from mul- viewee response-segments, should the immediate tiple turns. This is a 22.74% absolute increase response only be used for classification, or is in- over the random baseline of 50, and a 26.74% clusion of surrounding turns helpful? This has im- absolute increase over the human baseline of 46. plications not only for deception classification, but The performance of classifiers trained on multi- for practitioners as well. Should human interview- ple turns was consistently better than those trained ers make use of responses to follow up questions on single turns, for all feature sets. For multiple when determining response veracity, or should the turns, LIWC features were better than the lexi- initial response receive the most consideration? cal feature set, and combining lexical with LIWC We compared the performance of 3 classifica- features did not improve over the performance of tion algorithms: Random Forest, Logistic Regres- LIWC features alone. Adding individual traits in- sion, and SVM (sklearn implementation). In total, formation was also not beneficial. However, when there were 7,792 question segments for both sin- considering the first turn only, the best results gle turn and multiple turns segmentations. We di- (70.87 F1-score) were obtained using a combina- vided this into 66% train and 33% test, and used tion of LIWC+lexical+individual features. Using the same fixed test set in experiments for both seg- the first turns segmentation, lexical features were mentations in order to directly compare results. slightly better than LIWC features, and interest- The random baseline performance is 50, since the ingly, adding individual traits helped both feature dataset is balanced for truthful and deceptive state- sets. A combination of LIWC and lexical features ments. Another baseline is human performance, was better than each on its own. which is 46.0 F1 in this corpus. The Random For- These results suggest that contextual informa-

1948 tion, in the form of follow up questions, is ben- Thank you to Bingyan Hu for her assistance with eficial for deception classification. It seems that feature extraction. We thank the anonymous re- individual traits, including gender, native lan- viewers for their helpful comments. guage, and personality scores, are helpful in deception classification under the condition where contextual information is not available. When the References contextual information is available, the the addi- Susan H Adams. 1996. Statement analysis: What do tional lexical content is more useful than individ- suspects’ words really reveal. FBI L. Enforcement ual traits. Bull. 65:12. Joan Bachenko, Eileen Fitzpatrick, and Michael 7 Conclusions and Future Work Schonwetter. 2008. Verification and implementation of language-based deception indicators in In this paper we presented a study of deceptive civil and criminal narratives. In Proceedings of language in interview dialogues. Our analysis of the 22nd International Conference on Computa- linguistic characteristics of deceptive and truth- tional Linguistics-Volume 1. Association for Com- ful speech provides insight into the nature of de- putational Linguistics, pages 41–48. ceptive language. We also analyzed the linguis- Stefan Benus, Frank Enos, Julia Hirschberg, and Eliza- tic characteristics of speech that is perceived as beth Shriberg. 2006. Pauses in deceptive speech. In deceptive and truthful, which is important for un- Speech Prosody. volume 18, pages 2–5. derstanding the nature of trustworthy speech. We PT Costa and RR McCrae. 1989. Neo five-factor in- explored variation across gender and native lan- ventory (neo-ffi). Odessa, FL: Psychological As- guage in linguistic cues to deception, highlight- sessment Resources . ing cues that are specific to particular groups of speakers. We built classifiers that use combina- Bella M DePaulo, James J Lindsay, Brian E Mal- one, Laura Muhlenbruck, Kelly Charlton, and Harris tions of linguistic features and individual traits to Cooper. 2003. Cues to deception. American Psy- automatically identify deceptive speech. We com- chological Association, Inc. pages 74–118. pared the performance of using cues from the single first turn of an interviewee response-segment Frank Enos. 2009. Detecting deception in speech. Ph.D. thesis, Citeseer. with using cues from the full context of multiple interviewee turns, achieving performance as high Song Feng, Ritwik Banerjee, and Yejin Choi. 2012. as 72.74% F1-score (about 27% better than human Syntactic stylometry for deception detection. In detection performance). Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short This work contributes to the critical problem Papers-Volume 2. Association for Computational of automatic deception detection, and increases Linguistics, pages 171–175. our scientific understanding of deception, deception perception, and individual differences in de- Tommaso Fornaciari and Massimo Poesio. 2013. Au- tomatic deception detection in italian court cases. ceptive behavior. In future work, we plan to con- Artificial intelligence and law 21(3):303–340. duct similar analysis in additional deception corpora in other domains, in order to identify consis- Sarah I Levitan, Guzhen An, Mandi Wang, Gideon tent domain-independent deception indicators. In Mendels, Julia Hirschberg, Michelle Levine, and Andrew Rosenberg. 2015a. Cross-cultural produc- addition, we plan to conduct cross-corpus machine tion and detection of deception from speech. In Pro- learning experiments, to evaluate the robustness of ceedings of the 2015 ACM on Workshop on Multi- these and other feature sets in deception detection. modal Deception Detection. ACM, pages 1–8. We also would like to explore additional feature Sarah I Levitan, Guzhen An, Mandi Wang, Gideon combinations, such as adding acoustic-prosodic Mendels, Julia Hirschberg, Michelle Levine, and features. Finally, we plan to conduct an empirical Andrew Rosenberg. 2015b. Cross-cultural produc- analysis of deception behavior across personality tion and detection of deception from speech. In Pro- types. ceedings of the 2015 ACM on Workshop on Multi- modal Deception Detection. ACM, pages 1–8.

Acknowledgments Junyi Jessy Li and Ani Nenkova. 2015. Fast and accu- rate prediction of sentence speciﬁcity. In Proceed- This work was partially funded by AFOSR ings of the Twenty-Ninth Conference on Artiﬁcial In- FA9550-11-1-0120 and by NSF DGE-11-44155. telligence (AAAI). pages 2281–2287.

1949 Angel S Maredia, Kara Schechtman, Sarah I Levitan, dictionary of affect in language: Iv. reliability, va- and Julia Hirschberg. 2017. Comparing approaches lidity, and applications. Perceptual and Motor Skills for automatic question identiﬁcation. SEM. 62(3):875–888.

Gideon Mendels, Sarah Ita Levitan, Kai-Zhan Lee, Maria Yancheva and Frank Rudzicz. 2013. Automatic and Julia Hirschberg. 2017. Hybrid acoustic-lexical detection of deception in child-produced speech us- deep learning approach for deception detection. ing syntactic complexity features. In ACL (1). pages Proc. Interspeech 2017 pages 1472–1476. 944–953.

Rada Mihalcea and Carlo Strapparava. 2009. The lie detector: Explorations in the automatic recognition of deceptive language. In Proceedings of the ACL- IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, pages 309–312.

Matthew L Newman, James W Pennebaker, Diane S Berry, and Jane M Richards. 2003. Lying words: Predicting deception from linguistic styles. Person- ality and social psychology bulletin 29(5):665–675.

Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceed- ings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computa- tional Linguistics, pages 309–319.

James W Pennebaker, Martha E Francis, and Roger J Booth. 2001. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Asso- ciates 71:2001.

James W Pennebaker and Laura A King. 1999. Lin- guistic styles: language use as an individual difference. Journal of personality and social psychology 77(6):1296.

James W Pennebaker, Tracy J Mayne, and Martha E Francis. 1997. Linguistic predictors of adaptive be- reavement. Journal of personality and social psychology 72(4):863.

Veronica´ Perez-Rosas,´ Mohamed Abouelenien, Rada Mihalcea, and Mihai Burzo. 2015. Deception detection using real-life trial data. In Proceedings of the 2015 ACM on International Conference on Mul- timodal Interaction. ACM, pages 59–66.

Veronica´ Perez-Rosas´ and Rada Mihalcea. 2015. Ex- periments in open domain deception detection. In Proceedings of EMNLP 2015. ACL, pages 1120– 1125.

Patricia Rockwell, David B Buller, and Judee K Bur- goon. 1997. The voice of deceit: Reﬁning and ex- panding vocal cues to deception. Communication Research Reports 14(4):451–459.

Aldert Vrij, Gun R Semin, and Ray Bull. 1996. Insight into behavior displayed during deception. Human Communication Research 22(4):544–562.

Cynthia Whissell, Michael Fournier, Rene´ Pelland, Deborah Weir, and Katherine Makarec. 1986. A

1950