<<

Top-down effect of apparent humanness on vocal alignment toward and device interlocutors

Georgia Zellou ([email protected]) Phonetics Lab, Department of Linguistics, UC Davis, 1 Shields Avenue Davis, CA 95616 USA

Michelle Cohn ([email protected]) Phonetics Lab, Department of Linguistics, UC Davis, 1 Shields Avenue Davis, CA 95616 USA

Abstract (Bentley et al., 2018). Current voice-AI technology has advanced dramatically in recent years; common systems can are now regularly speaking to voice-activated artificially intelligent (voice-AI) assistants. Yet, our now generate highly naturalistic speech in a productive way understanding of the cognitive mechanisms at play during and respond to spontaneous verbal questions and commands speech interactions with a voice-AI, relative to a real human, by users. In some speech communities, voice-AI systems interlocutor is an understudied area of research. The present are omni-present and used on a daily basis to perform a study tests whether top-down guise of “apparent humanness” variety of tasks, such as send and read text messages, make affects vocal alignment patterns to human and text-to-speech phone calls, answer queries, set timers and reminders, and (TTS) voices. In a between-subjects design, participants heard either 4 naturally-produced or 4 TTS voices. Apparent control internet-enabled devices around the home (Hoy, humanness guise varied within-subject. Speaker guise was 2018). Despite the prevalence of voice-AI, our scientific manipulated via a top-down label with images, either of two understanding of the socio-cognitive mechanisms shaping pictures of voice-AI systems (Amazon Echos) or two human human-device interaction is still limited. Specifically, as talkers. Vocal alignment in vowel duration revealed top-down humans engage in dyadic speech behaviors with these non- effects of apparent humanness guise: participants showed human entities, how are our speech behaviors toward them greater alignment to TTS voices when presented with a device similar or dissimilar from how we talk with humans? In this guise (“authentic guise”), but lower alignment in the two inauthentic guises. Results suggest a dynamic interplay of paper, we examine the effect of apparent “humanness” bottom-up and top-down factors in human and voice-AI category — i.e., top-down information that the interlocutor interaction. is a device or human — on people’s vocal alignment toward text-to-speech (TTS) synthesized and naturally-produced Keywords: vocal alignment; apparent guise; voice-activated artificially intelligent (voice-AI) systems; human-computer voices. interaction Computer personification

On the one hand, classic theoretical frameworks of Introduction technology personification, such as the “Computers are social actors” (CASA) theory (Nass et al., 1997), Humans use speech as a way of conveying our abstract hypothesize that our attitudes and behavior toward thoughts and intentions. Yet, speech is more than just a technological agents mirror those toward real human signal to emit words and phrases; our productions are interactors. This has been tested by examining whether shaped by intricate cognitive and social processes social patterns of behavior observed in human-human underlying and unfolding over the interaction. One large interaction apply when people interact with a computer. For mediator of these processes is who we are talking to: their instance, people’s responses to questions vary based on the social characteristics (e.g., gender in Babel, 2012), social characteristics of the interviewer; for example, biases humanness (e.g., computer or human in Burnham et al., in responses are observed based on the gender and ethnicity 2010), and even our attitudes toward our interlocutors (e.g., of the (human) interviewer (Athey et al., 1960; Hutchinson speech accommodation in Chakrani, 2015) can shape our & Wegge, 1991). This phenomenon has been described as a productions. Examining speech behavior can serve as a social desirability effect: people seek to align their window into cognitive-social dimensions of communication responses to the perceived preferences of their interviewer and is particularly relevant for examining different types of because to do otherwise would be impolite (Finkel et al., interlocutors. 1991). Nass and colleagues (1999) examined how people Recently, humans have begun interacting with voice- apply this politeness pattern to interactive machines. After activated artificially intelligent (voice-AI) assistants, such as performing a task (either text- or voice-based) with a Amazon’s Alexa and Apple’s Siri. For example, over the computer system, consisting of a tutoring session by the past several years, tens of millions of “smart speakers” have computer on facts about culture, and then subsequently been brought into people’s homes all over the world being tested on those facts, participants were asked

3490 ©2020 The Author(s). This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY). questions about the performance of the computer. There humanness can speak to the impact of both bottom-up were two critical conditions: either the same computer that (acoustic) and top-down (guise) factors on speech behavior performed the tutoring asked for an evaluation of its toward humans and voice-AI: if it is driven by the performance as a tutor, or the evaluation of that computer characteristics of the speech (e.g., naturally produced vs. was solicited by a different computer located in another TTS) and/or by the extent to which we “believe” we are room. Nass and colleagues found that people provided more interacting with a human or a device voice. At the same positive evaluations in the same-computer condition, time, presenting an inauthentic top-down label while relative to the different-computer condition, indicating that presenting the original audio (e.g., “human” label with a the social desirability effect applies when we interact with a TTS voice, and vice versa) results in cue incongruency. In machine. From empirical observations such as this, the the present study, we test whether a match or mismatch in CASA framework argues that our interactions with voice (real human vs. TTS) and image (human vs. device) computers include a social component, positing that social shape speech behavior — and whether speakers are more responses to computers are automatic and unconscious sensitive to incongruous cues by guise: apparent human vs. especially when they display cues associated with being apparent device talker. human, in particular “the use of language” (Nass et al., While prior work has examined using moving computer 1999, p. 1105, emphasis ours). avatars (e.g., Burnham et al., 2010; Heyselaar et al., 2017), Yet, the extent to which individuals display differences in most modern voice-AI systems lack an avatar representation their socio-cognitive representations of real human versus (i.e., “Siri” and “Alexa” have no faces). As such, we ask computer interlocutors has resulted in mixed findings in the whether presenting a guise via images (either of a device or . For example, studies manipulating interlocutor a human face) can trigger differences in speech behavior guise – as human or computer – are one way to compare toward AI. Prior matched guise experiments have found that these interactions while holding features of the interaction, participants’ perception of a given voice shifts according to such as error rate and conversational context, constant. For minimal social information available. For example, seeing a example, Wizard-of-Oz studies, where a human photo depicting speakers of various ages and socio- experimenter is behind different interactions that are economic statuses impacts how listeners categorize the same apparently with either a “computer system” or a “real set of vowels (Hay et al., 2006). Thus, in the present study person”, demonstrate that people do produce different we predict that top-down influences (here, images of voice- speech patterns toward humans and technological systems. AI devices and human faces), will impact how participants In particular, Burnham and colleagues (2010) found respond to identical stimuli. In particular, we hypothesize increased vowel space hyperarticulation and slower that participants will display different speech patterns productions in speech toward an apparent computer avatar, toward the voices labeled as “devices” compared to relative to apparent human interlocutor; this suggests that “humans”. people have distinct representations for the two interlocutor types, computer versus human. Others have observed Vocal alignment greater overlap: Heyselaar and colleagues (2017) found In the current study, to test these questions, we examine similar priming effects for real humans and human-like vocal alignment, a subconscious, yet pervasive, speech avatars (presented in virtual reality, VR); yet, they behavior in human interactions, while varying top-down additionally found less priming for the computer avatar in guise (“human” or “device”). Vocal alignment is the general. Together, these results suggest that people’s phenomenon whereby an individual adopts the subtle cognitive representations for humans and computerized acoustic properties of their interlocutor’s speech during interlocutors appear to be different and, further, they may be verbal exchanges. Vocal alignment appears to be, to some gradiently, rather than categorically, distinct. extent, an automatic behavior, argued by some to stem from the human tendency to learn language in part by modeling Manipulating top-down “humanness” guise their speech patterns based on those they have experienced In many studies designed to compare of human-human and from others (Delvaux & Soquet, 2007; Meltzoff & Moore, human-computer interactions, there are multiple features 1989). Yet, vocal alignment patterns also appear to be that co-vary: the computer interlocutor has both a different mediated by factors in the linguistic and social context, form (e.g., digital avatar in Burnham et al., 2010) and a indicating that it is more than just an automatic behavior synthetic voice. This has been true for recent work (Babel, 2012; Nielsen, 2011; Zellou et al., 2016). More exploring human-voice-AI interaction as well, where specifically, we explore how human interlocutors’ vocal naturally produced and TTS voices are confounded with behavior reveals the social role they assign to apparent “apparent humanness” (Cohn et al., 2019; Snyder et al., voice-AI interlocutors, relative to apparent human 2019). One way to probe whether people have a distinct interlocutors. social category for voice-AI is to observe their speech Vocal alignment has been posited to serve pro-social behavior toward a set of voices of the same type while goals and motivations (e.g., Babel, 2012). In particular, varying the top-down label, either device or human, “Communication Accommodation Theory” (CAT, Shepard provided with each voice. Manipulating apparent et al., 2001) proposes that speech convergence is used by

3491 speakers to foster social closeness with their interlocutor, model talkers were humans and two of the talkers were increasing rapport (Babel, 2012; Pardo, 2006). One question devices. By crossing the actual status of the voices (Voice is whether the social distance between humans and AI will Type: human or TTS) with the top-down apparent guise be comparable to that between two humans. A CASA label (Apparent Guise: human or device), we aim to probe account might predict the same application of vocal the cognitive and social role of voice-AI in human-AI verbal alignment behavior to humans and to technological systems interactions (relative to human-human interactions). There that engage with participants using language. Indeed, there are several possible outcomes of this experimental design, is some support for alignment in human-computer each of which can speak to the status of voice-AI in speech interaction (Branigan et al., 2010, 2011; Cowan et al., interactions. 2015). For example, Branigan and colleagues (2003) found One hypothesis is that differences in patterns of vocal that humans display syntactic alignment toward apparent alignment will be driven by low-level acoustic differences computer and human interlocutors, but with greater across naturally produced and TTS voices, such as that used syntactic alignment toward the computer, than human, in voice-AI systems. In other words, there are inherent guise. acoustic differences in synthetic and naturally produced There is some work showing vocal alignment toward speech which may lead to different alignment patterns computers/voice-AI as well. For example, Bell and toward the voices. While advances in TTS have created colleagues (2003) found that participants vocally aligned to more naturalistic human-sounding voices, TTS voices used the speech rate produced by a computer avatar they in modern voice-AI systems (e.g., Amazon’s Alexa, Apple’s interacted with. At the same time, multiple studies have Siri) are still perceptibly distinct from natural voices, most found that individuals tend to show less vocal alignment likely since they contain less variability than human voices toward modern voice-AI systems than toward human voices for some acoustic features. For example, some have pointed (Cohn et al., 2019; Raveh et al., 2019; Snyder et al., 2019). to prosodic irregularities in TTS productions as a marker of For example, Snyder et al. (2019) found that participants a synthetic voice (Németh et al., 2007). If it is the case that showed greater vowel duration alignment toward human bottom-up acoustic properties of synthetic speech, even for voices, relative to Apple’s Siri voices. Furthermore, these more realistic, modern voice-AI TTS, drive the way humans vowel-durational patterns mirror those reported in a interact with interlocutors, regardless of their apparent status perceptual similarity ratings task of alignment: less overall as humans or devices, then we predict that apparent guise vocal alignment toward voice-AI (relative to human voices) will not affect listeners’ behavior. In the design of our overall (Cohn et al., 2019). This suggests that rate/durational current study, this would lead to observe only a main effect cues could be used as a window into the cognitive/social of Voice Type. dynamics in vocal alignment toward human and voice-AI A second hypothesis is that we will observe asymmetries interlocutors. in how apparent humanness guises affect vocal alignment Taken together, these findings suggest that our linguistic patterns of TTS and human voices. This would be expected interactions with devices/computers are distinct from how if there are both bottom-up (acoustic-level) and top-down we talk to human interlocutors — in some cases triggering (humanness category) influences on how people interact less alignment toward voice-AI (vocal alignment) and in with human and device interlocutors. In other words, if other cases more alignment toward computers (lexical, listeners are sensitive to the acoustic-phonetic differences syntactic alignment). In other words, contra CASA, one between synthetic and naturally produced speech and if the possibility is that AI actors hold a social status that is top-down labels of device and human speaker lead listeners distinct from that of humans and subtle differences in our to apply different expectations and social rules to the behavior toward them reflect this. interaction, we predict asymmetrical influences of the guises “device” and “human” on synthetic and naturally produced Current Study voices. Support for this possibility comes from observations The current study was designed to investigate a specific of the “uncanny valley” phenomenon: when an entity that research question: What is the effect of apparent humanness humans know is not a natural human takes on enough on vocal alignment patterns toward human and voice-AI similarity to human-like features that it elicits feelings of interlocutors? Manipulating apparent social information unease or discomfort (Mori et al., 2012). One interpretation about the speaker has not, to our knowledge, been used in of this phenomenon is that it is due to cue incongruence vocal alignment paradigms. To that end, in the current (Moore, 2012). Often, the uncanniness is assessed and study, participants completed a word shadowing paradigm explored experimentally using listener likeability ratings: (Babel, 2012) consisting of four distinct model talkers. likeability increases as the human-likeness of a device Across two between-subjects conditions, the 4 model talkers increases, but drops, creating a non-linear function, when were either all real human voices or all TTS voices human-likeness approaches actual “human” levels. We will (Amazon Polly TTS voices). These TTS voices were explore the uncanny valley through vocal alignment. Since developed to be used in Amazon Alexa Skills (e.g., greater degrees of vocal alignment has been shown to interactive apps through Alexa-enabled devices). In each of correlate with higher ratings of likeability and these conditions, participants were told that two of the attractiveness, it can also serve as a way to explore people’s

3492 subtle reactions to incongruency of cues for humanness. Our each voice to a human or device guise was counterbalanced interpretation is that dips in degree of alignment for the across 4 lists for each Voice Type condition (the 4 lists inauthentic guise reflects a negative reaction toward the top- contained all possible different pairings of a voice to a down and bottom-up pairing for a given condition. Thus, a guise). Participants were randomly assigned to a Voice Type finding of an interaction between Voice Type and Apparent condition and list. Guise, where a device or human voice receives unequal Before beginning the study, all participants were told they patterns of alignment behavior in an inauthentic guise, would be repeating words produced by both device and would support the cue incongruence hypothesis. human voices. Each participant was presented with four talkers: two apparent device models (a female and male) and two apparent human models. An introductory slide Methods presented this information and presented the four talkers, including names “Joanna” and “Matthew” (female and male Stimuli device voices, respectively) and “Melissa” and “Carl” (female and male human talkers) (see Figure 1). Along with Target words consisted of 30 CVC monosyllabic English names, pictures of the talkers were also provided. The low usage frequency real words, balanced by vowel pictures served to reinforce the apparent humanness guise categories (/i/: weave, teethe, deed, cheek, peel, key; /æ/: for each of the voices. The images for the voice-AI wax, wag, vat, tap, nag, bat; /ɑ/: wad, tot, sod, sock, pod, interlocutors consisted of two Amazon Echo devices, while cot; /o/: woe, soap, moat, hone, comb, coat; /u/: zoo, toot, the images for the humans were two stock photos of adult hoop, dune, doom, boot). Target words were selected as a humans of corresponding genders (photos were selected subset of the original 50 words used in Babel (2012) by from the first Google results of “male” and “female stock omitting words with complex codas or open syllables. image” at the time of the study design). Stimuli consisted of recordings of the target words produced by 8 distinct voices. For the real human voices, target words were recorded by 4 humans (2 females, 2 males), native English speakers of American English in their 20s, using a Shure WH20 XLR head-mounted microphone in a sound- attenuated booth. For the device voices, recordings of the 4 TTS voices (2 females, 2 males) were generated with standard parametric TTS in Amazon Polly (US-English): “Joanna”, “Matthew”, “Salli”, and “Joey”. All recordings were amplitude normalized to 60 dB. Figure 1: Voice-AI device (Amazon Echos) and human Participants guises used in the present study.

Overall, 92 participants were recruited from the UC Davis The first part of the study was a baseline word production undergraduate subjects’ pool and received course credit for block. The purpose of this block was to collect the pre- their participation. The participants’ mean age was 20.3 exposure production of each of the 30 target words by each years (range=18-42 years old). All participants were native participant. In this block, each of the 30 target words were English speakers and reported having no visual or hearing shown on a computer screen one at a time and participants impairments. 76 participants reported experience using a were instructed to produce the word aloud. Participants read digital device on at least a weekly basis (either Apple’s Siri, each word aloud twice, randomly selected across 2 blocks. Amazon’s Alexa, Microsoft’s Cortana, and/or Google Following the pre-exposure production block, participants Assistant), while 16 participants reported no device usage. completed the shadowing block: on each trial, they heard one of the four interlocutor voices saying one of the target Procedure words and were instructed to repeat the word. On the screen, Participants were assigned to either the Human Voice Type they saw the speaker guise and a sentence showing the name condition (n=53) or the TTS Voice Type condition (n=39) of the speaker with the target word (e.g., “Joanna says in a between-subjects design. The procedure and ‘doom’”). Each trial consisted of a randomly selected word- instructions were identical across conditions, except in the talker pairing. A block containing one repetition of each Human Voice Type condition participants heard only the item per talker was repeated twice in the experiment. In human voices while in the TTS Voice Type condition, total, participants shadowed the 30 words twice for each participants heard only the Alexa TTS voices. Within the interlocutor voice (4 model talkers x 30 words x 2 Human Voice Type and TTS Voice Type conditions, 2 of repetitions = 240 shadowed word productions per the voices was assigned a Human Guise and 2 of the voices participant). Each word production was recorded and was assigned a Device Guise (for the Human Guise, male digitized at a 44kHz sampling rate using a Shure WH20 and female voices were assigned images and names XLR head-mounted microphone in a sound-attenuated corresponding to their apparent gender). The matching of booth. The entire experiment took roughly 45 minutes.

3493 Following the pre-exposure production block, participants (Authentic, Inauthentic), as well as their interaction. Following the pre-exposure production block, participants (Authentic, Inauthentic), as well as their interaction. Following the pre-exposure production block, participantscompleted (Authentic, the shadowing Inauthentic), block where as on well each as trial their they interaction.Random effects structure consisted of by-Participant completedFollowing the the shadowing pre-exposure block production where on block each, participants trial they Random(Authentic, effects Inauthentic), structure as consisted well as of their by -Participant interaction. completed the shadowing block where on each trialheard they one ofRandom the four effects interlocutor structure voices consisted saying one o off the by- Participantrandom intercepts and by-Participant random slopes for heardcompleted one of the the shadowing four interlocutor block wherevoices on saying each one trial of theythe randomRandom intercepts effects andstructure by-Participant consisted random of by - slopesParticipant for heard one of the four interlocutor voices saying one target of the words random and were intercepts instructed and to byrepeat-Participant the word random. On the slopesHumanness for Guise. target words and were instructed to repeat the word. On the FollowingHumanness the pre Guise.-exposure production block, participants (Authentic, Inauthentic), as well as their interaction. target words and were instructed to repeat the word. Onscreen, the theyHumanness saw the speaker Guise. Guise and a sentence showing screen, they saw the speaker Guise and a sentence showingcompleted the shadowing block where on each trial they Random effects structure consisted of by-Participant screen, they saw the speaker Guise and a sentence showingthe name of the speaker with the target word (e.g., “Joanna Results the name of the speaker with the target word (e.g., “Joannaheard one of the four interlocutorResults voices saying one of the random intercepts and by-Participant random slopes for the name of the speaker with the target word (e.g., “Joannasays ‘doom’”). Each trial consisted Results of a randomly selected says ‘doom’”). Each trial consisted of a randomly selectedtarget words and were instructedResults to repeat the word. On theThe modelHumanness did not Guise. compute significant main effects of either says ‘doom’”). Each trial consisted of a randomly selectedword-talker TheThe pairing. model model Adid did block not not compute compute containing significant significant one repetition main main effects effects of of of either either word-talker pairing. A block containing one repetition ofscreen, The they model saw thedid notspeaker compute Guise significant and a sentence main effects showing ofVoice either Type [b=-1.5, SE=1.1 p=.1] or Guise [b=.07, SE=.34 word-talker pairing. A block containing one repetitioneach ofitem Voice per talker Type was [b= repeated-1.5, SE=1.1 twice p =.1]in the or experiment.Guise [b=.07, SE=.34 each item per talker was repeated twice in the experiment.the nameVoice of Typethe speaker [b=-1.5, with SE=1.1 the target p=.1] word or Guise (e.g., [ b“Joanna=.07, SE=.34p =.8]. Yet, the model did compute a significant interaction each item per talker was repeated twice in the experiment.In total, participantsp=.8]. Yet, shadowed the model the did 30 compute words twicea significant for each interaction Results In total, participants shadowed the 30 words twice for eachsays ‘doom’”).p=.8]. Yet, Each the trialmodel consisted did compute of a randomlya significant selected interactionbetween Voice Type and Guise [b=-.85, SE=.34 p=.01]. In total, participants shadowed the 30 words twice forinte eachrlocutordevice between and voice human Voice (4 model model Type talkers andtalkers. Guise x Yet, 30[b = wordsthere-.85, SE=.34were x 2 differences p=.01].The model did not compute significant main effects of either interlocutor voice (4 model talkers x 30 words x 2word -talkerbetween pairing. Voice A Type block and containing Guise [ b one=-.85, repetition SE=.34 ofpThis =.01]. interaction is illustrated in Figure 2. Overall, mean DID interlocutor voice (4 model talkers x 30 wordsrepetitions x in 2 degreeThis = interactionof 240 alignment shadowed is illustrated across word inconditions. Fig productionsure 2. Overall, DID per valuesmean DID wereVoice Type [b=-1.5, SE=1.1 p=.1] or Guise [b=.07, SE=.34 repetitions = 240 shadowed word productions pereach itemThis perinteraction talker was is illustrated repeated in twice Fig urein the2. Overall, experiment. meanvalues DID for each condition were above 0, meaning that repetitions = 240 shadowed word productionsparticipant). perlargest values Each(indicating for word each production greatestcondition weredegree was aboverecorded of 0, convergence) andmeaning that p =.8].in Yet, the model did compute a significant interaction participant). Analysis Each word production was recorded andIn total,values participants for each shadowed condition the were30 words above twice 0, for meaning eachshadowers that did align to the vowel duration properties of both participant). Each word production was recordeddigitized and atshadowers a 44kHz did sampling align to ratethe vowel using duration a Shure properties WH20 of bothbetween Voice Type and Guise [b=-.85, SE=.34 p=.01]. digitized at a 44kHz sampling rate using a Shure WH20shadowedinte rlocutorshadowers productions voice did (4align model to of the TTS talkersvowel voicesduration x 30 words presentedproperties x 2of device inboth the and human model talkers. Yet, there were differences digitized at a 44kHz sampling rate using a Shure XLR WH20 headdevice-mounted and human microphone model talkers. in a soundYet, there-attenuated were differences This interaction is illustrated in Figure 2. Overall, mean DID XLR head-mounted microphone in a sound-attenuatedAuthenticrepetitions device guise =and 240 human(i.e., shadowed withmodel a talkers. picture word Yet, of productions therea device) were .differences Yet, perin degreewhen of alignment across conditions. DID values were Measuring alignment:XLR head-mounted Difference microphone in distance in a sound(DID)-attenuated booth. Thein entire degree experiment of alignment took acrossroughly conditions. 45 minutes. DID values werevalues for each condition were above 0, meaning that booth. The entire experiment took roughly 45 minutes. theparticipant). TTSin degreevoices Each of were alignment word presented production across in conditions. the was inauthentic recorded DID values andguiselargest were (i.e., (indicating greatest degree of convergence) in booth. The entire experiment took roughly 45 minutes. largest (indicating greatest degree of convergence) inshadowers did align to the vowel duration properties of both In this experiment, we focused on durational alignment, digitizedlargest at a (indicating 44kHz sampling greatest rate degree using a of Shure convergence) WH20shadowed in productions of Device voices presented in the presentedshadowed with productionsa pictureAnalysis of of a Device human) voices, shadowers presented displayed in thedevice and human model talkers. Yet, there were differences given the close correspondence Analysis between overall vocal XLR shadowed head-mounted productions microphone of Device in a voices sound - presentedattenuated Authentic in the guise (i.e., with a picture of a device). Yet, when Analysis less alignmentAuthentic guise to (i.e these., with same a picture word of productions.a device). Yet, when Humanin degree of alignment across conditions. DID values were alignment patterns for human/Siri voicAnalysises observed across a booth.Authentic The entire guise experiment (i.e., with took a roughlypicture of45 aminutes. device) . Yet,the when Device voices were presented in the Inauthentic guise voicesthe presented Device voices in the were authentic presented guise in thereceived Inauthentic the smallest guiselargest (indicating greatest degree of convergence) in Measuringthe alignment: Device voices Difference were presented in distance in the (DID) Inauthentic (i.e., guise presented with a picture of a human), shadowers global similarityMeasuringMeasuring paradigm alignment: alignment: (cf. AXB Difference Difference similarity in in in distance distanceCohn et (DID) (DID) (i.e., presented with a picture of a human), shadowersshadowed productions of Device voices presented in the Measuring alignment: Difference in distance (DID)In thisDID experiment, values,(i.e., presented indicat we focused ing with least a on picture durational degree of of a alignment, convergence. human), shadowersdisplayed When less alignment to these same word productions. al., 2019) andIn difference this experiment, in distance we focused (DID) on vowel durational duration alignment, displayed less alignmentAnalysis to these same word productions.Authentic guise (i.e., with a picture of a device). Yet, when In this experiment, we focused on durational alignment,giventhe the human displayedclose voicescorrespondence less alignmentwere presented between to these in overall same the wordinauthentic vocal productions. Human guise voices presented in the Authentic guise received the measures (Snydergiven et the al.,close 2019). correspondence Following between Snyder overall et al. vocal Human voices presented in the Authentic guise received thethe Device voices were presented in the Inauthentic guise given the close correspondence between overall alignment vocal(i.e. , patternsHuman with a voicesfor picture human/Siri presented of voic ain device)thees observedAuthentic, however, across guise areceived degreesmallest the of DID values, indicating least degree of convergence. alignment patterns for human/Siri voices observed across aMeasuring smallest alignment: DID values, Differenceindicating least in distancedegree of (DID)convergence. (i.e., presented with a picture of a human), shadowers (2019), we usedalignment a DID acousticpatterns for measure human/Siri to assess voices degreeobserved of across globalconvergence a similaritysmallest paradigm increased.DID values, (cf. AXBindicat similaritying least degree in Cohn of convergence. et When the human voices were presented in the Inauthentic globalalignment similarity patterns paradigm for human/Siri (cf. AXB voic similarityes observed in Cohnacross et a When the human voices were presented in the Inauthenticdisplayed less alignment to these same word productions. vocal alignmentglobal toward similarity the model paradigm talkers (cf. . AXBFirst, similarity recordings in Cohn al., 2019) etIn this andWhen experiment, difference the human in we distance voices focused were (DID) on presented durational vowel duration in alignment, the Inauthenticguise (i.e., with a picture of a device), however, degree of al.,global 2019) similarity and difference paradigm in (distancecf. AXB (DID) similarity vowel in durationCohn et guise (i.e., with a picture of a device), however, degree ofHuman voices presented in the Authentic guise received the were force-aligned;al., 2019) vowel and boundariesdifference in weredistance hand (DID)-corrected vowel duration measuresgiven (Snyderguise the close(i.e., et withal.,correspondence 2019).a picture Following of a between device) Snyder , overall however, et al. vocal degreeconvergence of increased. measuresal., 2019) (Snyder and difference et al., in2019). distance Following (DID) vowel Snyder duration et al. convergence increased. smallest DID values, indicating least degree of convergence. measures (Snyder et al., 2019). Following Snyder (2019), et al.alignment weconvergence also patterns used aincreased.for DID human/Siri acoustic voic measurees observed to assess across a by a trained phonetician(2019),measures we (Snyder also based used et on al., a the DID 2019). presence acoustic Following ofmeasure voicing Snyder to assess et al. Alignment by Apparent Humanity When the human voices were presented in the Inauthentic (2019), we also used a DID acoustic measure to degree assessglobal of vocal similarity alignment paradigm toward (cf. the AXB model similarity talkers . inFirst, Cohn et Voices and higher formantdegree(2019), o f we structure. vocal also alignment used Using a DID toward a acoustic Praat the model script, measure talkers we to . assessFirst, guise (i.e., with a picture of a device), however, degree of degree of vocal alignment toward the model talkers.recordings First,al., 2019) were and force difference-aligned; in distance vowel boundaries (DID) vowel were duration TTS measured vowelrecordingsdegree o durationf vocal were alignment force (in-aligned; towardmilliseconds) vowel the model boundaries talkersfrom . wereFirst, Greater convergence increased. recordings were force-aligned; vowel boundaries hand were-measurescorrected alignment (Snyder by a trained et al., 2019). phonetician Following basedTTS Snyder on voices the et al. Human handrecordings-corrected were by force a trained-aligned; phonetician vowel boundaries based on were the participants’ prehand-exposure-corrected and by shadowinga trained phonetician productions, based as onpresence the(2019), of voicing we also and used higher a DID formant acoustic structure. measureHuman Using to voices assessa presencehand-corrected of voicing by and a trained higher formantphonetician structure. based Using on the a well as from presence the model of voicing talkers’ and productions. higher formant Degree structure. of Using Praat degree ascript, owef vocal measured alignment vowel toward duration the (in model milliseconds) talkers. First, Praatpresence script, of we voicing measured and highervowel formantduration structure.(in milliseconds) Using a alignment wasPraat then script, tabulated we measured as “difference vowel duration in distance” (in milliseconds) from recordingsparticipants’ were pre- exposure force-aligned; and shadowing vowel boundariesproductions, were fromPraat participants’ script, we measured pre-exposure vowel and duration shadowing (in milliseconds)productions, from participants’ pre-exposure and shadowing productions,as wellhand as- corrected from(ms) the model by a talkers’ trained productions. phonetician Degree based on of the (DID, Pardo etasfrom al., well participants’2013) as from: DID the pre = model -|baselineexposure talkers’ and duration productions.shadowing - model productions, Degree of as well as from the model talkers’ productions. Degreealignment ofpresence was of then voicing tabulated and higher as “difference formant structure. in distance” Using a duration| - |shadowedalignmentas well as duration was from then the - tabulated modelmodel talkers’ duration|. as “difference productions. First, in the Degreedistance” of alignment was then tabulated as “difference in distance”(DID,Praat Pardo script, et al., we 2013) measured: DID =vowel |baseline duration duration (in milliseconds)- model absolute difference(DID,alignment Pardo between was et al., then 2013) the tabulated : DID participant = as |baseline “difference’s duration baseline in distance”- model (DID, Pardo et al., 2013): DID = |baseline duration - duration|modelfrom - participants’ |shadowed duration pre-exposure - model and duration|.shadowing First, productions, the duration|(DID, Pardo - |shadowed et al., 2013) duration: DID -= model |baseline duration|. duration First, - model the production andduration| the model’s- |shadowed production duration - model was calculated duration|. First,. absolute theas well difference as from the between model the talkers’ participant’s productions. baseline Degree of absoluteduration| difference- |shadowed between duration the- model participant’s duration|. First, baseline the Then, the absoluteabsolute value difference of the between difference the participant’s between the baseline productionalignment and was the then model’s tabulated production as “difference was calculated in distance”. productionabsolute difference and the model’s between production the participant’s was calculated baseline. model’s productionproduction and and the model’s participant’s production shadowed was calculated Then,(DID, . the Pardo absolute et al., value 2013) of : theDID difference= |baseline betweenduration the- model Then,production the absolute and the value model’s of the production difference was between calculated the . production wasThen, calculated the absolute. Note value that weof the matched difference first between and model’s theduration| production - |shadowed and duration the participant’s - model duration|. shadowed First, the

Then, the absolute value of the difference between the DID vowel dur model’s production and the participant’s shadowedabsolute difference between the participant’s baseline second baselinemodel’s repetition production to first and and the second participant’s shadowed shadowed production was calculated. Note that we matched first and production was calculated. Note that we matched first andproduction and the model’s production was calculated. production was calculated. Note that we matched firstsecond and baseline repetition to first and second shadowed production, respectivelysecond baseline. Finally,repetition to we first calculated and second the shadowed Then, the absolute value of the difference between the second baseline repetition to first and second shadowedproduction, respectively. Authentic Finally, we calculatedInauthentic the difference of production, these two respectivelydifferences. (DID). Finally, This we measure calculated themodel’s production and the participant’s shadowed production, respectively. Finally, we calculateddifference the of these two differences (DID). This measure difference of these two differences (DID). This measureproduction was calculated. Note that we matched first and assesses overalldifference alignment. of these Positive two differences DID values (DID). indicate This measure assesses overall alignment. PositiveGuise DIDAuthenticity values indicate assesses overall alignment. Positive DID values indicatesecond baseline repetition to first and second shadowedFigure 2: Vowel duration DID scores (means and standard alignment towardassess es the overall model alignment. talker’s Positive production, DID values while indicatealignment Figure toward 2 : the Vowel model duration talker’s DID production, scores (means wh andile standard alignment toward the model talker’s production, whileproduction, Figure 2 respectively: Vowel duration. Finally, DID scores we calculated (means and the standarderrors of the mean) for Device and Human Voices by negative valuesalignment indicate toward divergence the model from talker’s the production, model. negative while errors values of indicate the mean) divergence for Device from and the Human model. Voices by negative values indicate divergence from the model.Figuredifference errors 2: Vowelof of these the duration two mean) differences for DID Device scores (DID). and (means T Humanhis measure and Voices standardApparent by Humanness Authenticity (Authentic = Humanness Additionally, negative the magnitude values indicate of DID divergence reflects degree from theof model.Additionally, Apparent the magnitude Humanness of Authenticity DID reflects (Authentic degree =of Humanness Additionally, the magnitude of DID reflects degree oferrorsassess esApparentof overallthe mean) Humanness alignment. for text Authenticity Positive-to-speech DID (Authentic (TTS) values synthesized indicate= HumannessGuise and matches Voice Type, Inauthentic = mismatch Additionally, the magnitude of DID reflects degreealignment: of Guise Guise larger matches matches positive Voice Voice values Type, Type, indicate Inauthentic Inauthentic greater = = mismatch mismatchFigure 2: Vowel duration DID scores (means and standard alignment: largeralignment: positive larger valuespositive indicate values indicate greater greaterHumanalignment Guise Voices toward matches by the Guise Voice model Authenticity Type, talker’s Inauthentic production, (Authentic = wh mismatchile =between Guise Voice Type and Humanness Guise). alignment: larger positive values indicate greaterconvergence, betweenbetween while Voice Voicesmaller Type Type positive and and Humanness Humannessvalues indicate Guise). Guise). weaker errors of the mean) for Device and Human Voices by convergence, convergence,while smaller while positive smaller values positive indicate values indicateweaker weaker negative between values Voice indicate Type and divergence Humanness from Guise). the model. convergence, while smaller positive values indicate weakerconvergence.matches Voice Type, Inauthentic = mismatch betweenApparent Humanness Authenticity (Authentic = Humanness convergence. convergence. Additionally, the magnitude of DID reflects degree of Discussion convergence. Voice Type and Guise). DiscussionDiscussion Guise matches Voice Type, Inauthentic = mismatch alignment: larger positiveDiscussion values indicate greaterT he current study explored the top-down effect of Statistical T heanalysis current study explored the top-down effect ofbetween Voice Type and Humanness Guise). StatisticalStatistical analysis analysis convergence,The current while smaller study positive explored values the indicate top-down weaker effect“humanness of guise”, i.e., telling people voices were Statistical analysisStatistical analysis DID scores“humannessT he were current modeled guise”, study using exploredi.e., a mixed telling the effects people top- down linear voices effect were of DID scores were modeled using a mixed effects linearconvergence. “humanness guise”, i.e., telling people voices produced were by either a human or a device, on their subsequent DID scores weDIreD scoresmodeled were using modeled a mixed using a effects mixed linear effects regression linear produced“humanness including by main either guise”, fixed a human effect i.e., or predictors telling a device, people on of Voicetheir voices subsequent were Discussion regressionregression including including main main fixed fixed effect effect predictors predictors of of Voice Voice produced by either aDiscussion human or a device, on their subsequentresponses to voices naturally produced by humans and regression includingregression main including fixed main effect fixed predictors effect predictors of Voice of Type Voice (Device, responses Human) to and voices Humanness naturally Guise produced Authenticity by humans andT he current study explored the top-down effect of TypeType (Device,(Device, Human)Human) andand HumannessHumanness GuiseGuise AuthenticityAuthenticityStatistical responses analysis to voices naturally produced by humans and Type (TTS, Human)Type (Device, and HumannessHuman) and Humanness Guise Authenticity Guise Authenticity The current study explored the effect of humanness guise“humanness guise”, i.e., telling people voices were (DIi.e.,D telling scores werepeople modeled voices usingwere aproduced mixed effects by either linear a human (Authentic, Inauthentic), as well as their interaction. regression including main fixed effect predictors of Voice produced by either a human or a device, on their subsequent Random effects structure consisted of by-Participant orType a device (Device,) on Human) degree and of vocalHumanness alignment Guise toAuthenticity voices naturally responses to voices naturally produced by humans and random intercepts and by-Participant random slopes for produced by humans and voices generated by a common Humanness Guise. voice-AI system (Amazon Polly TTS voices). In doing so, we aimed to examine humanness (device vs. human) as a social category – how it predicts whether people interacting with devices will adopt the speech patterns produced by an Results apparent AI system, or not, relative to how they would with The model did not compute significant main effects of either an apparent human interlocutor. Voice Type [b=-1.5, SE=1.1 p=.1] or Guise [b=.07, SE=.34 First, we hypothesized that acoustic differences between p=.8]. Yet, the model did compute a significant interaction TTS and naturally produced voices might predict alignment between Voice Type and Guise [b=-.85, SE=.34 p=.01]. patterns. We do not find support for this hypothesis; the This interaction is illustrated in Figure 2. Overall, mean DID model did not find an effect of Voice Type. Overall, we values for each condition were above 0, meaning that observe that participants align toward both voice types shadowers did align to the vowel duration properties of both (naturally produced and TTS). That we observe some vocal

3494 alignment for both voice types is broadly in line with the Siri voices used in prior work. One area for future study is “Computers are Social Actors” (CASA) framework (Nass et to control for and vary the style of speech across the two al., 1997): participants apply human-human social behaviors interlocutor types (e.g., using more formal/expressive TTS to voice-AI. In this case, the behavior is vocal alignment – a and human productions, relative to more neutral robust behavior with social motivations in human-human productions). It may be that the more casual-sounding interaction (cf. Babel, 2012). productions by the humans were favored less than more Yet, contra a strict interpretation of CASA, we observe formal and/or expressive productions by the Amazon Polly that patterns of vocal alignment are mediated by top-down voices, even if the TTS voices were synthetic. representations of the voices — as produced by either real Another possibility is that the apparent age of the talkers humans or devices. This supports our second hypothesis, may have mediated degree of vocal alignment. In a follow- that we would see an interaction between voice type and up post-hoc ratings study, we found that the 4 human voices guise authenticity. Participants were more likely to vocally were rated as younger (mean age rating = 29 years old) than align to a device TTS voice when the guise was authentic, the 4 TTS voices (mean age rating = 35.7 years old). The yet, alignment was reduced when given the human guise for extent to which acoustic indices of speaker age interact with the synthetic voices (i.e., the inauthentic guise). Meanwhile, top-down effects (e.g., varying age guises) has, to our we observe the reverse pattern for the real human voices: in knowledge, not yet been explored in human-computer/voice the authentic guise (seeing a human picture, hearing a AI interaction, or in vocal alignment more generally. human voice) participants show less alignment than when Another possibility for the asymmetry in the present study they are told the voice is from a device talker. is that our “authentic” human guise was still computer- The asymmetry observed between the authentic and mediated. Participants completed the experiment through a inauthentic guises suggests that both the low-level acoustic computer (via E-Prime), perhaps making it a more differences between human and device voices and the top- ecologically valid interaction with the device (TTS) voices, down effect of categorizing the interlocutor as a human or but less natural way to interact with the humans. This non-human entity are factors that influence our speech interpretation would be in line with Branigan et al. (2003), behavior. Interestingly, in the current study, these effects who found greater syntactic alignment toward computers appear to be additive: participants aligned more to the TTS when the participants thought they were interacting (via than the human voices in the authentic guise; yet switching text) with a computer versus with a real human. This the guise leads participants to align to an equal extent suggests that matching expectations — such as an toward both voices, reducing alignment toward the same individual’s expectation to engage with a computer via TTS voices and increasing alignment toward the same typing — may impact degree of alignment. human voices. These findings support our proposal of a On the other hand, our observation of greater alignment gradient personification of voice-AI based on the social cues toward device voices in the authentic guise parallels prior available — one that is not categorically applied to studies’ observations that people display greater technology that exhibits cues of “humanity” (cf., CASA; syntactic/lexical alignment toward apparent computer, Nass et al., 1997). relative to apparent human, interlocutors (e.g., Cowan et al., This particular pattern of alignment toward the TTS 2015; Branigan et al., 2010, 2011). The interpretation from voices as a result of humanness guise additionally supports these studies, was that the functional difficulty related to the proposal by Moore (2012) that cue congruence can communicating with a computer, since computers are explain human behavior interactions with non-human believed to be less communicatively able than humans, lead entities: the presence of two cues which provide conflicting to greater alignment (Cowan et al., 2015). It is possible that information about the realism of an entity leads people to the same motivation could be argued to be at play in the react in a negative way (i.e., with feelings of disgust or present study, leading to greater vocal alignment toward to discomfort). In the current study, this is realized as the TTS voice in the device guise, where authenticity lead to decreasing degree of alignment toward the device voice a more believable scenario where the interlocutor was when presented in a human guise. The cues to non- communicatively disadvantaged. Thus, across multiple humanness are realized in the synthetic quality of the studies where top-down guise is manipulated, people voices, thus the false guise leads participants to align less to display greater alignment, across multiple linguistic levels, the device voices. toward technological agents than toward humans. Future Yet, our finding of less alignment toward the human work exploring this possibility can shed light on this voices in the authentic guise is in contrast with recent possibility. findings for human/voice-AI in a similar population of Overall, this study provides a first step in examining how university-age students (Cohn et al., 2019; Snyder et al., bottom-up and top-down influences of apparent humanness 2019). It is possible that there were idiosyncrasies in the shape vocal alignment, serving as a window into particular human voices used in the current study, all of participants’ subconscious social behaviors toward AI and whom were undergraduates at the time of recording. The human interlocutors. TTS voices, on the other hand, consisted of Amazon Polly voices, which differ in various ways with the quality of the

3495 References Hutchinson, K. L., & Wegge, D. G. (1991). The Effectiveness of Interviewer Gender Upon Response in Athey, K., Coleman, J. E., Reitman, A. P., & Tang, J. telephone Survey Research. J. Social Behavior and (1960). Two experiments showing the effect of the Personality, 6(3), 573. interviewer’s racial background on responses to Meltzoff, A. N., & Moore, M. K. (1989). Imitation in questionnaires concerning racial issues. J. Applied newborn infants: Exploring the range of gestures imitated Psychology, 44(4), 244. and the underlying mechanisms. Dev. Psych., 25(6), 954. Babel, M. (2012). for phonetic and social Moore, R. K. (2012). A Bayesian explanation of the selectivity in spontaneous phonetic imitation. J. ‘Uncanny Valley’ effect and related psychological Phonetics, 40(1), 177–189. phenomena. Scientific Reports, 2, 864. Bell, L., Gustafson, J., & Heldner, M. (2003). Prosodic Mori, M., MacDorman, K. F., & Kageki, N. (2012). The adaptation in human-computer interaction. Proc. ICPhS, uncanny valley [from the field]. IEEE Robotics & 3, 833–836. Automation Magazine, 19(2), 98–100. Bentley, F., Luvogt, C., Silverman, M., Wirasinghe, R., Nass, C., Moon, Y., & Carney, P. (1999). Are people polite White, B., & Lottridge, D. (2018). Understanding the to computers? Responses to computer based interviewing long-term use of smart speaker assistants. Proc. Inter., - Mobile, Wearable and Ubiquitous Tech., 2(3), 1–24. systems 1. J. Applied Social Psych, 29(5), 1093–1109. Nass, C., Moon, Y., Morkes, J., Kim, E.-Y., & Fogg, B. J. Branigan, H. P., Pickering, M. J., Pearson, J., & McLean, J. (1997). Computers are social actors: A review of current F. (2010). Linguistic alignment between people and computers. J. Pragmatics, 42(9), 2355–2368. research. Human Values and the Design of Computer Branigan, H. P., Pickering, M. J., Pearson, J., McLean, J. F., Technology, 72, 137–162. Németh, G., Fék, M., & Csapó, T. G. (2007). Increasing & Brown, A. (2011). The role of beliefs in lexical prosodic variability of text-to-speech synthesizers. Proc. alignment: Evidence from dialogs with humans and computers. Cognition, 121(1), 41–57. ISCA. Branigan, H. P., Pickering, M. J., Pearson, J., McLean, J. F., Nielsen, K. (2011). Specificity and abstractness of VOT & Nass, C. (2003). Syntactic alignment between imitation. Journal of Phonetics, 39(2), 132–142. computers and people: The role of belief about mental Pardo, J. S. (2006). On phonetic convergence during conversational interaction. JASA, 119(4), 2382–2393. states. Proc. Cognitive Science Society, 186–191. Pardo, J. S., Jordan, K., Mallari, R., Scanlon, C., & Burnham, D., Joeffry, S., & Rice, L. (2010). " Does-Not- Lewandowski, E. (2013). Phonetic convergence in Compute”: Vowel Hyperarticulation in Speech to an Auditory-Visual Avatar. A-V Speech Processing 2010. shadowed speech: The relation between acoustic and Chakrani, B. (2015). Arabic interdialectal encounters: perceptual measures. JML, 69(3), 183–195. Investigating the influence of attitudes on language Raveh, E., Siegert, I., Steiner, I., Gessinger, I., & Möbius, B. (2019). Three’sa Crowd? Effects of a Second Human accommodation. Language & Communication, 41, 17–27. on Vocal Accommodation with a Voice Assistant. Proc. Cohn, M., Ferenc Segedin, B., & Zellou, G. (2019). Interspeech 2019, 4005–4009. Imitating Siri: Socially-mediated alignment to device and human voices. Proc. ICPhS, 1813–1817. Shepard, C. A., Giles, H., & Le Poire, B. A. (2001). Cowan, B. R., Branigan, H. P., Obregón, M., Bugis, E., & Communication accommodation theory. In The new Beale, R. (2015). Voice , interlocutor handbook of language and social psychology (W. P. modelling and alignment effects on syntactic choices in Robinson, H. Gile, pp. 33–56). John Wiley & Sons, Ltd. Snyder, C., Cohn, M., & Zellou, G. (2019). Individual human- computer dialogue. Int'l J. Human-Computer Studies, 83, 27–42. variation in cognitive processing style predicts differences Delvaux, V., & Soquet, A. (2007). The Influence of in phonetic imitation of device and human voices. Proc. Ambient Speech on Adult Speech Productions through Interspeech, 116–120. Unintentional Imitation. Phonetica, 64(2–3), 145–173. Zellou, G., Scarborough, R., & Nielsen, K. (2016). Phonetic Finkel, S. E., Guterbock, T. M., & Borg, M. J. (1991). Race- imitation of coarticulatory vowel nasalization. JASA, of-interviewer effects in a preelection poll Virginia 1989. 140(5), 3560–3575. Public Opinion Quarterly, 55(3), 313–330. Hay, J., Warren, P., & Drager, K. (2006). Factors influencing speech perception in the context of a merger- in-progress. J. Phonetics, 34(4), 458–484. Heyselaar, E., Hagoort, P., & Segaert, K. (2017). In dialogue with an avatar, language behavior is identical to dialogue with a human partner. Behavior Research Methods, 49(1), 46–60. Hoy, M. B. (2018). Alexa, Siri, Cortana, and more: An introduction to voice assistants. Medical Reference Services Quarterly, 37(1), 81–88.

3496