<<

Representing Emotions with Animated Text

By

Raisa Rashid, B.Comm (Ryerson University, 2005)

A thesis submitted in conformity with the with the requirements for the degree of Masters in Information Studies Faculty of Information Studies

University of Toronto

© Copyright by Raisa Rashid 2008

Representing Emotions with Animated Text

Raisa Rashid

Masters of Information Studies

Faculty of Information Studies

University of Toronto

2008

Abstract

Closed captioning has not improved since early 1970s, while film and television technology has changed dramatically. Closed captioning only conveys verbatim dialogue to the audience while ignoring music, sound effects and speech prosody. Thus, caption viewers receive limited and often erroneous information. My thesis research attempts to add some of the missing sounds and emotions back into captioning using animated text.

The study involved two animated caption styles and one conventional style: enhanced, extreme and closed. All styles were applied to two clips with animations for happiness, sadness, anger, fear and disgust emotions. Twenty-five hard of hearing and hearing participants viewed and commented on the three caption styles and also identified the character’s emotions. The study revealed that participants preferred enhanced, animated captions. Enhanced captions appeared to improve access to the emotive information in the content. Also, the animation for fear appeared to be most easily understood by the participants.

ii Acknowledgements

I would like to express my gratitude to all those who gave me the possibility to complete this thesis. It would be impossible without the people who supported me and believed in me.

I want to give heartfelt thanks to my thesis supervisors Dr. Deborah Fels and Dr.

Andrew Clement. I want to specially thank Deb for expertly guiding me and inspiring me at every step. I also want to thank Quoc Vy and Richard Hunt for helping me with the studies. A huge note of gratitude goes to my fellow “labbies” who were always encouraging. I want to especially thank Bertha Konstantinidis, Daniel Lee, JP Udo, Emily

Price and Carmen Branje. Finally I want to thank my family. My husband has been a source of strength and support throughout the last year. He has made my work on this thesis so much easier by taking care of our lives when I was not able to. I also want to thank my loving parents, in-laws and sister for their on-going support, love and encouragement.

I also want to thank University of Toronto, Ryerson University and NSERC for supporting my research.

iii Table of Contents

Introduction...... 1 Chapter 1. Literature Review...... 5 Section 1.01 Closed Captioning...... 5 (a) Captioning Standards...... 6 (b) Current Captioning Practices...... 7 (c) Captioning Practices outside North America...... 9 (d) Problem with Captioning...... 10 Section 1.02 Missing Elements: Music, Sound Effects & Prosody...... 11 Section 1.03 Kinetic ...... 16 (a) Elements of Typography...... 16 (b) Kinetic Typography...... 17 Section 1.04 A Model of Basic Emotions for Captioning ...... 23 Chapter 2. Model of Emotion Used in this Thesis ...... 26 Section 2.01 Framework of emotive captions...... 28 Chapter 3. System Perspective ...... 33 Chapter 4. Method...... 39 Section 4.01 Content used in this study...... 42 Section 4.02 Data Collection ...... 46 (a) Pre-Study Questionnaire...... 46 (b) Data collection during viewing of the content...... 48 (c) Post-study questionnaire...... 48 (d) Video Data ...... 49 (e) Experimental setup and time to complete study...... 50 (f) Data Collection and Analysis...... 50 Chapter 5. Results...... 54 Section 5.01 Questionnaire Data...... 54 Section 5.02 Video Data ...... 62 Section 5.03 Emotion ID...... 70 Chapter 6. Discussion...... 76 Section 6.01 Limitations...... 90 (a) Emotion Identification Method...... 90 (b) Communication barriers...... 92 (c) Small Number of Participants...... 93 (d) Limited Genre...... 94 (e) Short and Limited Test Clips...... 94 Chapter 7. Recommendation/Conclusion...... 95 Section 7.01 Future Research ...... 97 Section 7.02 Conclusion ...... 98 Chapter 8. References...... 100 Appendix A...... 106 Appendix B Pre-Study Questionnaire...... 106 Appendix B Pre-Study Questionnaire...... 107 Appendix C Post-Study Questionnaire ...... 111

iv List of Tables

Table 1: Summary of the relevant animation properties for anger, fear, sadness, happiness ...... 29 Table 2: Description of the themes ...... 50 Table 3: Emotion identification categories...... 53 Table 4: χ2, degrees of freedom (df), mean and standard deviation for enhanced captioning attributes for To Air...... 55 Table 5 χ2, degrees of freedom (df), mean and standard deviation for enhanced captioning attributes for Bad Vibes...... 56 Table 6: Number of positive/negative comments by groups for the video data...... 67 Table 7: Descriptive statistics for HOH group in all categories of the video data analysis ...... 68 Table 8: Descriptive statistics for hearing group in all categories of the video data...... 69 Table 9 Average number of Emotions identified in each category ...... 71 Table 10: Emotions versus type of identification ...... 73 Table 11: Caption category versus type of identification ...... 74

v List of Figures

Figure 1 Example of closed captioning. The music note is one of the commonly used symbols to represent music (with permission)...... 2 Figure 2 Teletext Example...... 10 Figure 3 Shark Week Titling Sequence ...... 18 Figure 4 Example of size representing volume (Lee et. al. 2002)...... 21 Figure 5: Example of high intensity anger over four frames...... 31 Figure 6 Example of low intensity fear. text size is default size. Text size then expands and contracts rapidly for the entire duration that the text is on the screen...... 31 Figure 7 Conventional Captioning System...... 34 Figure 8 Captioning System, Proposed...... 34 Figure 9 Captioning Process: Conventional and proposed...... 38 Figure 10: Age distribution...... 42 Figure 11: Example of Enhanced Caption and Extreme Caption...... 46 Figure 12 Comparison of the three captioning styles by number of comments ...... 70

vi Introduction

Deaf and hard of hearing viewers have limited access to the rich media of television.

Most are unable to obtain the same quality and quantity of sound information as their

hearing peers. Hearing impaired viewers can compensate for some of the missing sound

information through the use of closed captioning, which relays a portion of aural sound

visually through text and icons.

The size of the hard of hearing population in Canada is very difficult to estimate.

The degree of hearing loss can vary from mild to profound deafness making the

classification of being hearing impaired somewhat amorphous (Gallaudet Research

Institute, 2007). In addition, data on the hearing impaired population is primary based on

self-reported or informant-reported instances and many hearing impaired individuals do

not report their disabilities (Gallaudet Research Institute, 2007). The Canadian Hearing

Society (CHS) roughly estimates that 23% of the Canadian population has some form of

hearing loss (CHS, 2004), which translates into approximately seven million people.

However, not all caption users are hearing impaired, as many viewers turn on

captions to learn English or when sound is unavailable (such as at fitness centres, bars or

restaurants) (Lewis, 2000). The National Captioning Institute estimates that there are 100

million viewers in the US that benefit from using closed captioning (2003). This is

neither a small nor insignificant population that can take advantage of this technology.

Despite the large user group, very few industry or research resources are allocated to

develop closed captioning technology, which has remained stagnant for decades.

The North American system of closed captioning is called Line 21 captioning and it

was developed for analogue television in the 1970s (Canadian Association of

1 Broadcasters, 2004). This type of captioning is limited to a small set of , colours and graphics (Canadian Association of Broadcasters, 2004). Limitations in decoder technology posed legibility problems for early captions, forcing the text to be mono- spaced, upper case, and in white font colour displayed against a black background

(Canadian Association of Broadcasters, 2004).

Recent advancements in decoder technologies allow for legible mixed case letters, some symbols, and a choice of fonts and colours to be used (Canadian Association of

Broadcasters, 2004). However, despite the new capabilities, the uppercase, white text on black background is still the principle format for captions used today (National

Captioning Institute, 2003) (see Figure 1 for an example of standard closed captioning).

Figure 1 Example of closed captioning. The music note is one of the commonly used symbols to represent music (with permission).

The lack of progress within the captioning industry is incongruent with the

advancements in television technology over the last few decades, especially the

emergence of digital television. Closed captioning have not only remained unchanged for

three decades, current guidelines actually discourage the use of the new available features

2 such as colours, fonts and mixed case lettering (Canadian Association of Broadcasters,

2004).

A major concern with closed captioning is that it provides only for the verbatim equivalent of the spoken dialogue of a television show and ignores most non-verbal information such as music, sound effects, and tone of voice. Much of this missing sound information is used to express emotions, create ambiance, and complete the television and film viewing experience. The audience members forced to access sound information through captions alone stand to lose vital components of their television and film viewing experience.

The inspiration for my thesis is to investigate ways of infusing captions with the non-

verbal sound elements using a relatively new approach: animated or kinetic text. One of

the primary aims of my thesis research work is to determine whether animated text is a

viable option to pursue when representing sound information through animation. My

personal motivation for pursuing this research area comes from being a member of an

assistive technologies lab dedicated to investigating information challenges faced by

people with disabilities. The hearing impaired caption users have tremendous unfulfilled

needs that are not being addressed currently, and as an information technology

professional I am committed to use new technologies to solve unmet information needs.

Animated text has been used in the entertainment industry for exciting title sequences and

it appeared to be an unexplored, yet creative approach to text with the potential for

visually representing sound information.

My research work in combining animated text with captioning is described in this thesis. It begins with a review of the literature in the following areas: captioning, kinetic

3 text, music, sound effects and emotions. The literature review is followed by a description of the process used to develop a model of emotion/animation that attempts to map the properties of animation with the expression emotions. The process description is followed by a systems look at how captions are created by third-party caption houses.

The next chapter outlines the study that was undertaken to explore and refine the emotion/animation model. The results of the data analysis obtained from the user study are reported following the methods section. These results are then interpreted and analyzed in a discussion section. The final chapter of the thesis includes recommendations based on the discussion and concluding remarks.

The goal of this thesis is to explore the potential for animation to enhance captioning and incorporate more of the non-verbal sound information, not to categorically claim that animation is the only (or even the best) way to present non-verbal sound information.

Additionally, the type of animations explored in this thesis is only one possible application of animations to represent emotions. Many other possible ways of representing emotions using animations exist that have not been investigated. The focus of this thesis is on verbatim captioning in its current state only. While literacy levels of hearing impaired viewers can impact their ability to watch television with verbatim captioning, this topic is out of scope for this thesis.

4

Chapter 1. Literature Review

Section 1.01 Closed Captioning

Closed captioning is a technique for translating dialogue into text. A television set with a built-in decoder is able to display the translated text on the screen to match the dialogue of a show (Abrahamian, 2003). Captioning can be provided by the broadcasters in real-time or off-line (for broadcast at a later time) format and can appear on screen as roll-up or pop-on captions (Canadian Association of Broadcasters, 2004). Off-line captions are produced by captioners from third party captioning houses or by in-house captioning departments of large broadcasters. Off-line captions are created before the program actually airs, which allows time for correcting of errors, inserting of symbols and organizing of sentence and phrase structure so that it is easy to read. The Canadian

Association of Broadcasters (2004) recommends that off-line captions are provided in the pop-on format, where the entire sentence/phrase “pop” on screen at once; however, scroll up captions are sometimes used for off-line captions because they are less costly to produce (R.I.T., n.d.)

Real-time or on-line captions are created by caption stenographers, who watch live broadcasts (like sporting events) and transcribe them for the viewers at home

(Canadian Association of Broadcasters, 2004). Real-time captions typically appear using a roll-up format, where the captions scroll up the screen one line at a time.

5 (a) Captioning Standards The standard defining the captioning format, spacing and font for North American analogue television is EIA-608. The EIA-608 standard specifies a restricted set of fonts, characters and colours for use in captioning (Consumer Electronics Association, 2005).

When captions were first introduced in early 1970s, limitations in television encoder and decoder technology prevented the use of anything other than a mono-spaced font and white text colour. However, several options have been added since the original specification, namely the use of mixed letter cases, more font colours, and special characters. These new additions are rarely implemented in captions. Moreover, captioning guidelines strongly discourage the use of the new additions citing viewer expectation and bandwidth limitations as obstacles.

The reasons behind the discouraging caption guidelines are valid to some extent.

Caption viewers are accustomed to the white, uppercase captions and in general are hesitant to accept alterations to the old standard (NCAM, n.d.). In one NCAM study, for example, one participant commented about a green caption colour that although the green was easy to read, she felt “funny” about a non-white caption (NCAM, n.d.). The bandwidth allowed for EIA-608 captions is 980 bits per second (bps) (Robson &

Hutchins, 1998) and the quantity and quality of information that can be broadcasted at this bandwidth is quite constrained, limiting the type and style of information being transmitted (Consumer Electronics Association, 2005).

The viewer preferences will not change easily, but the bandwidth limitations will soon be mitigated. The limited analogue closed captioning standard, EIA-608, is to be replaced by the emerging EIA-708 (or digital television) standard in the near future. The higher bandwidth of 9600bps for EIA-708 means that more data per minute of video can

6 be transmitted (Consumer Electronics Association, 2006). As a result, EIA-708 will provide for a much more enhanced character set that includes non-English letters, accented letters and an array of symbols (Consumer Electronics Association, 2006). The new standard will also allow features such as viewer-adjustable sizing of text, which will transfer more control over to the viewers by allowing them to increase or decrease the size of their caption display. EIA-708 will permit the use of different colours for the text, as well as translucency in the backgrounds. The translucent background has the potential to obstruct less of the television screen than the black, opaque background used currently

(Blanchard, 2003).

The digital television standard will further allow for text styles that include edged or drop-shadowed text and a broad collection of fonts such as mono-spaced, , sans- serif and cursive. The 708 standard will provide for other interesting features such as the delay command (Blanchard, 2003). The delay command feature has been designed to instruct the caption decoder to halt processing of the Service Input Buffer data for a designated period of time. When a delay command is received by the decoder, incoming data is kept in the Service Input Buffer until the defined delay time has expired

(Blanchard, 2003). Blanchard describes this feature of the 708 caption standard as “time- release captions” (2003). The Delay command can potentially be used to reduce synchronization of speech and caption errors that exist in captioning today.

(b) Current Captioning Practices The existing captioning guidelines, referenced by most captioners, are heavily influenced by the research of Harkins, Korres, Singer & Virvan in 1995. The Harkins et. al. (1995) study presented 106 deaf and 83 hard of hearing viewers with 19 television

7 clips and asked them to complete a questionnaire to indicate their captioning preferences.

The study sought to garner user preferences about what the researchers called “non- speech information” (e.g. speaker identification, music, manner of speaking). Based on the study results, Harkins et. al. (1995) drafted recommendations for use by various captioning institutions.

The Harkins et. al. (1995) study brought forth some interesting insights about caption user preferences and identified possible improvements for captioning. However, much of the issues uncovered in that study were insufficiently addressed and still require in-depth investigation. Harkins et. al. (1995) concluded that colour is ineffective in distinguishing between various speakers compared to explicit text descriptions; however, little research has been conducted since then to uncover possible benefits of coloured text to caption viewers. Harkins et. al. (1995) concluded that animations like flashing text is poorly received by audiences, but no research has been carried out to determine whether and how animation, graphics or symbols can be beneficial.

Harkins et. al. (1995) used a survey instrument that did not compare different styles of captions but only asked people to imagine them when answering questions. Use of animation, symbols, and graphics in captioning is so uncommon that most people will never have seen it. People are generally poor at imagining features and functionality that they have never experienced before, and are unable to rate the effectiveness or desirability of them (Jacko & Sears, 2003). Controlled experiments with different styles and use of sensory enhancements are warranted to determine the effect of these alternatives on people’s attitudes and levels of understanding of the content.

8 As mentioned before, very little research has been conducted to build on the findings of Harkins et. al. (1995) and captioning guidelines have remained virtually unchanged as a result of it. This lack of advancement in closed captioning can be attributed to several factors. Firstly, caption lobbyists are still struggling to ensure better captioning legislation and increase the number of captioned programs (Downey, 2007).

As a result, the social pressure is not on advancing caption research, rather having captions present at all. Secondly, captions are created by people unrelated to the production of the content (e.g., third-party caption houses) without any input from the producers/creators of a particular show. Therefore, all captioning decisions are made post-production, by outside captioners. It is relatively easy to speculate that these captioning houses have little compelling reasons to want to change their current way of producing the captions.

Finally, the broadcasters are mandated by government regulations to provide a specific number of hours of captioning a week (Downey, 2007). As the broadcasters are naturally concerned with the bottom line, the easiest and least expensive captioning solution is the one they tend to opt for. Quality control and assurance then is undertaken by the captioning house rather than the broadcasters. However, as mentioned before, the captioning houses have little incentive for improving the state of captions,.

(c) Captioning Practices outside North America While closed captioning in North America has remained colourless and motion- less, European and Australian caption viewers have been benefiting from colours, multiple fonts, and mixed case lettering for many years. The European and Australian equivalent of captioning, called teletext subtitling (NCAM, n.d.b), emerged out of an

9 advertising model. Teletext, shown in Figure 2, was created by the British Broadcasting

Company to provide accessibility through subtitles to deaf and hard of hearing viewers, but was subsequently adapted to provide other information such the weather, advertising, sports etc. to all viewers (NCAM, n.d.b). Teletext’s higher data transmission rate (12 kilobytes per second) allow for the display of more features (such as animation) and information (NCAM, n.d.b). However, complaints from European viewers about their subtitling services are surprisingly similar to those from North Americans: lack of synchronization, poor spelling, too fast onset, delays in update rates, insufficient amount of text and undesirable position of text on screen (Ofcom, 2005).

Figure 2 Teletext Example

(d) Problem with Captioning

It appears that there is much research and development required to improve captioning to even a minimally satisfactory level. Many issues such as spelling, synchronization of text to dialogue, and positioning of text on screen have potential technical solutions. However, none of the captioning techniques described thus far have adequately managed to present information beyond verbatim dialogue. The capabilities of the captioning technologies allow for improvements in this area, but little research has

10 been done to gauge audience reaction to alternative means of sound expression beyond static text. If the viewers of a show are only receiving the dialogue portion of television sound, then they are missing, as identified by Harkins (1995) and Ofcom (2005), important non-verbal elements such as music, sound effects, and speech prosody.

Section 1.02 Missing Elements: Music, Sound Effects &

Prosody

Speech prosody, music and sound effects are almost as integral to the television viewing experience as the dialogue and the visuals. Speech prosody profoundly influences emotional context of words (Cruttenden, 1997). Music has the ability to impact the way viewers recognize, understand and remember a television series or a movie (Boltz, 2004; Bruner, 1990) and sound effects have the ability to provide information as well as emotions.

Tone of voice heavily influences emotional context of speech by enhancing

perception of words (Cook, 2002). Speech prosody can be divided into three areas:

rhythm, loudness of voice and intonation (Cook, 2002). All three elements of voice can

convey emotions, in particular changes in loudness and pitch (Cook, 2002). Caption

users who are unable to access these important elements of voice risk losing crucial

dramatic cues, especially when irony and puns are concerned.

As Boltz (2004) suggests, music can be used in parallel with a scene to increase or

decrease the emotional impact of the visual elements. The combination of pitch, timing

and loudness properties have the ability to create eerie suspense or heartbreaking sorrow

(Boltz, 2004). Rising and falling pitch, for example, can represent growing or declining

11 intensity within a specific emotional context. Complex melodies can convey more sadness or frustration than simple melodies (Bruner, 1990). Up-tempo music is generally associated with happy or pleasant feelings, while slow tempo suggests sentimental or solemn feelings (Gundlach, 1935; Hevner, 1937 as cited in Bruner, 1990). Increasing loudness or crescendo can express an increase in force, while a decreasing loudness or diminuendo can convey a decrease in energy (Cooke, 1962; Zetti, 1973 as cited in

Bruner, 1990). All of these critical musical properties are manipulated by the filmmakers to present the audiences with another dimension of information beyond the visuals. If the viewers are missing the added music dimension because they cannot hear (or cannot hear fully), then they are missing tremendous amounts of crucial information the filmmakers intended to deliver and that is not replicated in the visual information.

Chion (1994) claims that music can have two basic types of emotional effects on a scene: empathetic and anempathetic. Music that creates empathetic effects directly expresses the feelings of a scene, by emulating the rhythm, tone and phrasing of that scene (Chion, 1994). Music that produces anempathetic results uses indifferent music alongside an emotional scene to exaggerate its impact (Chion, 1994). Boltz (2004) suggests a similar concept, where the ironic contrast technique is used to combine dissimilar visuals and music to enhance or diminish the emotional intensity of a particular scene. An example of the ironic contrast technique would be playing cheery music together with a violent scene to make the actions dramatically more disturbing.

With empathetic music, where visuals are reinforced by music, the risk to viewers who cannot hear is that they lose some of the emotional impact of the scene. However, with anempathetic music or where the ironic contrast is used, music plays a much larger

12 role. In this case, the risk to the viewers is much higher because losing those critical, ironic musical cues can result in misunderstanding an entire scene (or an entire film).

Sound effects, like music, can add much information and affect to a television program or a film. Sound effects supplement visual elements and dialogue to provide information and establish mood (Marshal, n.d.). According to Kerner (1989), sound effects are used to accomplish three objectives: simulate reality, create illusions and establish mood. Firstly, sounds effects simulate reality by bridging the gap between a staged, artificial scene and the audience’s perception of reality. For instance, when a fake bottle is broken over a cowboy’s head in a Western drama, only the sound effects of a glass bottle crashing make it real to the audience (Kerner, 1989). Secondly, sound effects create illusions by assisting audiences in imagining scenes that were never filmed or shown. For instance, off-screen crowd chatter convinces the audience that the actors are in a crowded arena instead of an empty studio (Kerner, 1989). Finally, sound effects establish the mood of a scene simply based on the information it provides (Kerner, 1989).

For example, the sound of a door clicking shut can provide the audience with the information that a door has been shut but the same door clicking during a burglary scene can communicate a sense of dread (Marshall, n.d.). Viewers who are unable to obtain these critical sound effect prompts stand to misinterpret (or simply not enjoy) television shows.

Since closed captioning mostly provides a translation of the dialogue elements, the richness of music, sound effects and intonation is frequently missed by the hearing impaired viewers (or viewers missing access to sound). At this time, closed captioning either ignores the non-verbal sound information or sound/music information is provided

13 through italics and brackets: such as ( woman crying) or (door slamming) or (soft music) .

These descriptive phrases are, at best, poor replacements for the depth of music, sound effects and voice. Moreover, time limitations and viewer reading speeds prevent most captioned film and television from having much text descriptions of the non-dialogue information.

Jensema (1997) found in a series of studies that the average viewer reads captions at a speed of 145 words per minute. He also found that caption reading speed of hearing impaired viewers was somewhat higher than the hearing viewers and that the reading speed of all caption viewers varied greatly (Jensema, 1997). He showed that hearing impaired viewers have a higher caption reading speed because they likely have more experience using captions than the hearing viewers. Harkins et. al. (1995) reported in their study that 53% of the deaf and hard of hearing participants expressed an interest in having more of the background sound information presented, in addition to the dialogue.

Since, as reported by Jensema (1997), average caption speed on television is already 141 words per minute, any further addition of information beyond dialogue may render the captions too fast for people to enjoy or understand. Also, Jensema, Danturthi

& Burch (2000) found that when viewers watched captioned television programs, they looked at the captions 84% of the time and the actual video only 14% of the time. These findings indicate that any additional sound information conveyed to the viewers via the television screen must be contained in the area where captions are being displayed.

In order to inject more information into the closed captioning without increasing the

number of words, it may be possible to supplement text information with other forms of

information representation such as icons, animation, and colours. Fels, Lee, Branje &

14 Hornburg (2005) have explored the use of colours and icons in captioning. The main objectives of this research were to communicate emotional elements missing from regular captioning and to provide access to more sound information. They experimented with using icons such as emoticons, music notes and symbols in conjunction with colour to express emotions such as happiness, anger, and surprise.

Fels et. al. (2005) found that participant reactions differed dramatically between the hard of hearing and deaf groups. The hard of hearing participants found the use of colour and icons very useful and expressive. However, many of the deaf participants thought the use of colour was juvenile and preferred to rely on their own faculties to interpret the emotional content (Fels et. al., 2005). All of the participants found the use of icons to provide sound information useful (Fels et. al., 2005). The authors drew two main conclusions from the findings: first colours and icons can be used successfully to convey sound and emotional information, and second, enhanced captioning may be perceived as more beneficial to hard of hearing viewers than deaf viewers (Fels et. al., 2005).

Silverman & Fels (2002) also investigated the use of emotive captions, but in the form of comic book style graphics. The researchers used speech bubbles and colourful letters to portray emotions such as sadness, happiness, anger and fear as well as background sounds. Silverman & Fels (2002) also used multiple icons and a number of text styles to manipulate the dialogue and present non-dialogue sound information. The emotive captioned content was viewed by deaf and hearing participants. The researchers found that most of the participants (ten of eleven) thought that the emotive captions increased their understanding of the content. The primary complaint received from the

15 viewers was that the comic book style was somewhat juvenile and ill-suited to more serious content (Silverman & Fels, 2002).

Beyond using static elements such as graphics, icons and colours, animating the caption text could be a new way of infusing text with an added dimension of information without adding more text descriptions. Animated or kinetic text has long been used in the entertainment industry to entertain, inform and evoke emotions. The emotive power of kinetic text could be applied to captions to potentially create animated captions that convey emotions, music and sound effects.

Section 1.03 Kinetic Typography

(a) Elements of Typography

Typography encompasses all the design choices embedded in selecting a type on a webpage, printed or on the television screen (Ernst, 1984). Type can be determined by such elements as font (or ), size, spacing (between lines of type) and contrast.

Common kinds of are categorized as Serif and Sans Serif. A serif typeface has small lines projecting from the ends of the letters, while a Sans Serif typeface does not have these lines (Ernst, 1984). Most fonts are proportionately spaced, where larger characters occupy more space than smaller ones (Ernst, 1984). With a mono-spaced font

(type used in closed captioning), each character takes up the same amount of space. The size of type is measured in vertical height or points (72 points in one inch) and the size is measured from common . The weight of type refers to the density of letter (e.g. bold, light, and heavy) (Ernst, 1984).

16 According to Ambrose & Harris (2006) manipulating the type elements can ensure that a piece of text is readable and legible. Readability refers to the ease with which readers can comprehend the body of text and takes into account the overall layout of the text (Ambrose & Harris, 2006). In captioning, pop-on captions are considered more readable than roll-on captions, as viewers read in collections of words as opposed to one word at a time (RIT , 2000). Legibility, unlike readability, concerns the finer, operational

detail of a typeface (e.g. size, weight, and typeface). A legible typeface, for instance, can

be rendered unreadable by making it too wide (Ambrose & Harris, 2006). or

spacing between two letters is another factor that affects readability (Ambrose & Harris,

2006). Lack of white space between letters makes text less readable.

(b) Kinetic Typography Kinetic typography, also known as animated text, is essentially text that moves over time. Kinetic typography has recently emerged as a powerful tool for expressing emotion, mood, personal characteristics, and tone of voice in the creative media industry

(Forlizzi, Lee & Hudson, 2003). According to Woolman (2005), kinetic type has intrinsic, embedded meaning and the ability to inform, entertain and emotionally affect the audiences. One of the first examples of kinetic typography can be found in the title sequence of the Alfred Hitchcock film Psycho (Lee, Forlizzi & Hudson, 2002), where erratic lettering and movement of text on screen successfully communicated the unsettling nature of the classic horror film.

In recent years, the film and television industry has invested heavily in animated text for title or credit sequences (perhaps to emotionally prepare the audience for the impending viewing experience) (Geffner, 1997). Another widely acclaimed use of kinetic

17 typography is found in the 1995 film horror film Se7en. In the opening sequence of Se7en

trembling, high-contrasting letters with a scratchy typeface are used to convey a sense of

terror that emotionally prepares the audiences for this disturbing film (Geffner, 1997).

Other popular examples of kinetic text is found in the title sequences from television

shows like 24 and Shark Wee, illustrating the power of kinetic text on influencing

audience emotions (Woolman, 2005). The title sequence for Shark Week (Figure 3)

conveys the dangers associated with sharks with only floating, eerie letters without

showing the sharks themselves (Woolman, 2005).

Figure 3 Shark Week Titling Sequence

In contrast to the film/artistic industry, where much work is being done to exploit the potential of kinetic typography, little formal research is being conducted in the academic community to understand the impact of kinetic type on audiences/users. The few kinetic typography researchers include Wang, Predinger & Igarashi (2004), who have explored the impact of kinetic typography in expressing the affective or emotional state in online communication using a library of pre-made animations. Wang et. al.

(2004) used a library that included around twenty different animations, some of which were animations to represent emotions (happy and sad) and others to signify .

Using galvanic skin response sensors to detect emotional arousal among the participants,

18 Wang et. al. (2004) established that the degree of emotional response increased when animated text was used, implying that animated text can communicate emotions.

However, Wang et. al. (2004) did not try to evaluate the relationship between specific emotions and animations or see whether different animation properties had different effects on the users.

Bodine & Pignol (2003) conducted a similar study to Wang et. al. (2004), evaluating the emotional impact of kinetic typography in instant messaging communication. The researchers concluded that “kinetic typography has the capacity to dramatically add to the way people convey emotions” (Bodine & Pignol, 2003, p. 2).

Forlizzi et. al. (2003) agreed with the findings of Body & Pignol, that kinetic typography has the capacity to express affective content. Similar to Wang et. al. (2004), Bodine &

Pignol (2003) and Forlizzi et. al. (2003) did not analyze the relationship between specific animation properties and emotions. As a result, all three research groups allowed kinetic properties of the text to be driven by the interpretation of the creators of the animations rather than the emotions themselves.

In addition to affective content, kinetic typography can create characters and direct the attention of the audiences (Forlizzi et. al., 2003). Ford, Forlizzi & Ishizaki

(1997) discovered that readers of kinetic typography messages attached tone of voice to the messages they observed. Ford et. al. (1997) define tone of voice as “variations in pronunciation when segments of speech such as syllables, words and phrases are articulated” (1997, p.269). The researchers also suggest that kinetic typography can take on “personality” of the people composing the animated messages, where the message readers begin to assign distinguished personas to the writers of the kinetic dialogue.

19 The research initiatives mentioned thus far has demonstrated that kinetic typography could be used to enhance the emotional interpretation of written words; however, only a few of the studies attempted to determine which properties of the animation excited the particular emotions. In order for captioners to effectively apply animations to captions, they need to convey the specific emotions in a consistent and accurate way.

Lee, Forlizzi & Hudson (2002) are some of the few researchers in the academic community to look at the relationship between properties of text animation and emotion.

They are also some of the few to attempt to create reusable animations. In their study using kinetic text with instant messaging, Lee et. al. (2002) show that kinetic text can be used consistently with pre-defined patterns to communicate specific emotions. They argue that kinetic typography parameters correspond to prosodic features of voice that express emotions such as rate of speech and volume of voice. They suggest that animation properties of text, such as increase in size or upward or downward movement, relate to voice characteristics such as pitch, volume, and speed of delivery. They report that sweeping upward or downward motions of text can suggest rising and falling pitch and increase and decrease of text size can express loudness of voice (Lee et. al., 2002).

They also discovered that short up and down movements are able to mimic fear and vibration can express anger in animated text. An example of this would be to communicate loudness with larger font size, and express quietness with smaller font size.

As shown in Figure 4, a) represent exuberance with large letters while b) represents lowness in feeling. The findings of Lee et. al. (2002) suggest that emotions that are expressed using the prosodic elements of voice may possibly be represented using

20 animation properties that correspond to those elements. If in anger, for example, the voice increases in volume and vibrates, then the animation representing that anger would increase in size and vibrate.

Lee et. al. (2002) make some intriguing observations; however, their techniques are geared towards generating hundreds of different types of animations. These animations are contained within a behaviour library, where some animations represent affective states; others represent nothing more than the typographer’s creative state of mind.

Figure 4 Example of font size representing volume (Lee et. al. 2002)

Minakuchi & Tanaka (2005) agree with Lee et. al. (2002) that the motion of the

kinetic text itself has meaning. They classify this into three sub-classes: physical motion,

physiological motion, and body language. Kinetic type that follows physical motion,

copies the natural pattern of movement such as bouncing or falling. Kinetic type

mimicking physiological motion, mirrors human reactions such as turning red when

angry. Finally, animations that follow body language replicate such motions as

shrugging. Minakuchi & Tanaka’s findings further support the close relationship between

21 animation and emotions, and suggest that specific motion of text can be related to specific emotions. The discovery of these relationships means that particular animation patterns can be applied consistently to text to produce repeatable expression of emotions.

However, Minakuchi & Tanaka (2005) are focused on developing a kinetic typography engine that automatically animates a variety of words, whether emotional or otherwise.

Also, the researchers have done little work in the area of validating how accurately/effectively the animations they have produced represent the meaning of the words/emotions. Therefore, the application of Minakuchi & Tanaka’s research to closed captioning is quite limited considering the broad range of caption users.

Overall, kinetic typography needs to be explored further in the context of captioning, especially to visually express the emotional aspects of sound information such as music, tone of voice and sound effects. Unless the animations correctly characterize the emotions, sound effect, or music, their application will be a detriment to the users because they will also add clutter, confusion and distraction to the captions.

The examples from the film industry reveal that emotional impact can be made using

kinetic typography, but individual viewer reactions or the exact nature of the viewer

reaction to kinetic text has not been formally studied. Furthermore, application of kinetic

text in the captioning industry has many limitations. A film titling sequence (lasting only

minutes) can have a budget that exceeds $30,000 (Geffner, 1997) and can involve an

entire artistic team. Comparatively, the budget for captioning is around $1000 for a one

hour show and typically involves one captioner (RIT, n.d.). Also, a captioner lacks the

necessary graphic design skills to manipulate animation properties to create effective

animations. As a result, a simplified model of animation based on a limited number of

22 emotions is required. The simplified model will allow a captioner to identify emotions of words and phrases and easily apply the appropriate animations to them in a cost-effective way. Eventually an information system needs to be built that uses kinetic typography to create animated captions that convey the missing sound and emotional elements.

Section 1.04 A Model of Basic Emotions for Captioning

What constitutes emotions is constantly being debated by psychologists and psychology theorists, as the human emotional system is complex and difficult to study.

While very few concrete conclusions have been drawn about emotions, one of them is that emotions are crucial to interpersonal communication. In fact, Ekman (1999) found that those who are unable to convey emotions through facial expressions and speech prosody have tremendous difficulty communicating and maintaining personal relationships.

Theories about emotions can be loosely separated into two main streams of thought. One theory maintains that all emotions are the same in essence and differ only in intensity and pleasantry (Ortony & Turner, 1990). The second theory claims the existence of a set of discreet, basic emotions, which are fundamentally dissimilar to each other.

Evaluating the basic emotion models proposed by the range of researchers, Ortony and

Turner claim that most proposed basic emotion models can be combined to form all other emotions (1990). Even within the group that agrees on the existence of basic emotions, few concur on the number of basic emotions and their identity/definition.

Examining the research on basic emotions, it appears that the number of basic emotions falls somewhere between two and several dozen. Mowrer, for example, proposes only two basic emotional states: pleasure and pain (1960). Frijda identifies 18

23 different emotions in his model which includes interest, wonder, humility, indifference and desire, in addition to the more commonplace ones like anger and happiness (1986).

Many of these researchers have their own distinctive reasons or criteria to determine basic emotions. Mowrer, for example, only includes pleasure and pain as basic emotions because they are unlearned emotional states (1960).

Among the plethora of conflicting theories and vast number of suggested basic emotions, Ortony and Turner (1990) observe some congruency. They claim that the most commonly identified basic emotions include anger, happiness, sadness, and fear (Ortony

& Turner, 1990). Ortony and Turner (1990) also out that often the basic emotion theorists are not proposing emotions that are different; they are merely using different words to describe the same emotion. Words like frustration, irritation, aggression, and resentment, for example, are less or more intense forms of anger (Ortony and Turner,

1990).

The most well-established psychological models of basic emotion include those proposed by (Plutchik, 1980) and (Ekman, 1999), which suggest that all emotions can be reduced to a set of five to eight primitive emotions. Based on cross-cultural facial expressions Ekman and Friesen (1986) derived a five emotion model that includes the four, overlapping, common emotions plus surprise. Their justification suggests that basic emotions are episodic and brief in nature, felt as a direct result of an antecedent, and each unique emotion results in a common set of facial expressions that are shared across cultures. Plutchik (1980), like Ekman & Friesen (1986), acknowledge the existence of a set of emotions that combine to derive other emotions. Plutchik’s emotion model includes acceptance, anger, anticipation, disgust, joy, fear, sadness, and surprise (1980).

24 Due to resource constraints, such as lack of money and artistic talent, and the fast turn-around time required of captioners, the process of adding emotive animations to captions needs to be simple and easy to use. As a result, for my research, a simplified model of emotion to characterize the emotional elements present in television and film dialogue is used. A four-emotion model using the four common emotions suggested by

Ekman was considered initially but a disgust category (suggested as a primitive emotion by Plutchik, 1980) was added to the emotion model because it appeared once (but prominently) in the example video content and required a unique animation style.

25 Chapter 2. Model of Emotion Used in this Thesis

The literature on the relationship between specific animation and emotions is very limited, especially when applied to captioning. Thus, a design-oriented, intuitive approach was chosen to uncover the linkage between animation properties and emotion.

The intuitive approach was spearheaded by an experienced graphic designer and a design team. The design process consisted of generating a set of alternatives, selecting one concept based on group consensus, applying of that concept to captioning content, refining the concept and then creating a final production. There were natural limitations to this creative process; however, it seemed to be an appropriate method at this initial stage.

The experienced graphic designer specialized in animated text and its ability to evoke emotions. His interest in this particular project was partly due to his own hearing loss. My role in the research at this initial stage was to utilize captioning, animated text, psychology and typography literature to support the creative process. I was also responsible for the more technical aspects of the initiative, such as using multimedia software to actually construct the animations and applying them to video.

Some animation movements have been found to suggest specific emotions, especially those emotions that can be easily conveyed by voice intonations. For instance,

Juslin & Laukka, (2003) found that rising vocal contours were associated with anger, fear and joy. Lee et. al. (2002) also found that emotions that are expressed using the prosodic elements of voice can be represented using animation properties that correspond to those elements. According to limited existing literature, animation motions that followed these voice patterns appeared to consistently suggest the same emotions. These shared

26 properties created a good starting point for the generation of animation ideas for the graphic designer and the research team.

At first, the research team focused on generating an array of alternative example concepts. Each concept was roughly sketched to produce many simple “test” animations.

Some of the “animations” were fully rendered using software, while other concepts remained in mock-up form. During the idea generation period, my role was to take the designer’s mock-ups and create the test animations. I also used literature to ensure that the animation examples were readable, legible and comprehensible as captions. The period of idea generation lasted three weeks.

Next, the example concepts were critically examined and only the strongest concepts were chosen for further development. My role at this critical period was to evaluate and reject test concepts based on preferences of caption users according to literature. Several iterations later, the chosen examples were applied to video samples and the animated clips were assessed by the full research team consisting of me and two other individuals: an experienced caption researcher and a deaf consultant from the Canadian Hearing

Society. Critical input from the research team guided the selection of the final two concept styles.

Once agreement was reached on which alternatives to adopt, they were applied to the text captions of two one to two minute examples of video content. At this point, my job was to create the animations using software and ensure that the animations worked as captions. During the animation creation process, it appeared that some emotions did not lend themselves easily to animation. For instance, the motion representing sadness was not consistently rendered and accurately interpreted by the group. In one instance, for

27 example, motion was used to move type “up” on the page, to complement dialogue which said “we’re all going to die!” said with an increasing pitch in the voice. The motion was meant to convey that sense of “dying” but was too tenuous a link for understanding. To correct it, linguistic cues were relied upon and resulted in the sadness animation moving downward mimicking sadness and falling intonation. These and other ideas were then developed and applied by the designer and me and evaluated by the research team until there was agreement on the final animations.

Section 2.01 Framework of emotive captions

Once two content pieces were captioned using the expert’s approach, a framework was created, relating animation properties to the set of basic emotions as described by

Ekman (1999) and Plutchik (1980). At this point I was involved in extracting the designer’s reasons behind the animations and matching them with the emotions. The emotion surprise, anticipation and acceptance did not produce unique animation properties but existed within the other five. As a result, only four emotions, anger, fear, sadness, and happiness comprised the framework. There was one instance of disgust identified but the animation for that instance followed the physical motion of the character. While creating the framework, disgust was left out because the animation for it appeared to be unique to a specific situation and unrepeatable.

The animation properties consist of a set of standard kinetic text properties. These

are: 1) text size; 2) horizontal and vertical position on screen; 3) text opacity or how see-

through the text is; 4) how fast the animation moves/appears/disappears (speed/rate); 5)

how long it stays on the screen (duration); 6) a vibration effect; and 7) movement from

28 baseline. Vibration is defined as the repeated and rapid cycling of minor variations in size of letters and appears as “shaking” of the letters.

All of the properties are defined for three stages of onscreen appearance; 1) onset or when the text first appears or begins to animate; 2) “the effect” occurring between the offset and onset stages; and 3) offset or when the animated text stops being animated or disappears from the screen. In addition, the intensity of the emotion can affect all of the properties. For example, high intensity anger has a faster appearance on the screen and grows to a larger text size in the effect stage than text that shows a lower intensity anger condition (as suggested by Lee, et al., 2002). For high intensity anger, the text will also seem considerably larger than the surrounding text. Most of these properties are defined relative to the original size of the non-animated text and they would be identified as a default value by the captioning standard that is applied.

Table 1 summarizes the property descriptions for each primitive emotion. Where an animation property is not defined, a default value would be used. For example, where onset speed is not specifically defined, this would be the default onset speed defined by the captioning software. The framework also allows for other emotions to be added and the related animation properties defined accordingly.

Table 1: Summary of the relevant animation properties for anger, fear, sadness, happiness

Emotion Relevant Properties (Effect) Intensity of Effect Fear Size : Repeated expansion and contraction of Low : Size of animated text is the the animated text for the duration of the same as non-animated text at onset. caption. Low level vibration. Rate: Expansion and contraction occur at High : Size of animated text is larger constant rate (e.g. 5 times per second). than non-animated text at onset. Vibration : Constant throughout the effect. Higher the intensity, the larger the

29 text size up to 150% of original size. High level of vibration.

Anger Size : Contract to smallest size (e.g. 10%), Low : Size of effect is smaller. then expand to largest size (e.g. 150%), then Slower onset. expand and contract until original size is reached. High : Size of effect is larger. Faster Vibration : Occurs at largest size in cycle. onset. Onset : Faster onset than default onset rate. Duration : Pause at largest size in cycle and vibration occurs here. Sad Size : Vertical scale of text decreases to 50% Low : faster offset, faster downward Position : Downward vertical movement vertical movement from baseline. Opacity : 50% decrease in opacity High : slower offset slower Offset : Slower offset than default offset rate. downward vertical movement. Happy Size : Vertical scale of text increases Low : Slower onset. Offset text size Position : Upward vertical movement from is smaller. below baseline. Follows a curve (fountain- High : Faster onset. Offset text size like). is larger.

Figure 5 and Figure 6 illustrate these concepts for a high intensity anger caption and a low intensity fear caption with the most salient properties identified. The animations would appear during the time that the captions were present in the video material.

30

1. Onset: Starting size 2. Size: Vertical scale of text decreases

3. Size: Vertical scale of text increases 4. Size: Vertical scale of text decreases

Figure 5: Example of high intensity anger over four frames.

1. Size: Expansion of text. 2. Size: Contraction of text.

Vibration: Constant Vibration: Constant

Figure 6 Example of low intensity fear. Initial text size is default size. Text size then expands and contracts rapidly for the entire duration that the text is on the screen.

31 Animations created using this framework could be applied to words, sentences, phrases or even single letters (grain size) in a caption to elicit the desired emotive effect.

The intended emotive effects of the animations seem to be consistently applicable for a range of different typefaces (e.g., Frutiger, Century Schoolbook).

32 Chapter 3. System Perspective

The system for creating captions is a manual process driven by the choices of the

captioners and broadcasters. As can be seen from Figure 7, the conventional captioning

model has visuals and audio as inputs and visuals, audio and text captions as outputs. The

audio input includes all the components discussed in the literature review that combine to

make the television viewing experience complete: speech prosody, sound effects,

dialogue and music. In order to convert inputs to outputs, the captioner transforms the

inputs by transcribing, interpreting, describing and summarizing the dialogue and some

of the other sound elements into text captions (DCMP, n.d.). The resulting output

includes the text equivalent of the dialogue and does not include most of the other sound

components. The dashed lines outline the components of audio that could be missed by

hearing impaired viewers or those viewers unable to access sound (because they are at the

gym or a bar).

My hypothesis is that those caption users unable to fully access non-dialogue

audio component of the output can compensate for the missing information with the

introduction of animated captions to the conventional system (as shown in Figure 8). In

the proposed system, the inputs are altered to include the input of the director/film-

maker/content-creator and the emotional model outlined in Chapter 2. The captioner can

then use the two added inputs to create the animations that can supplement the missing

audio elements. The emotion model can allow the captioner to create meaningful

animations, which would reduce the amount of descriptive work.

33

Figure 7 Conventional Captioning System

Figure 8 Captioning System, Proposed

34

Figure 9 a) schematically represents the tasks the captioner carries out to convert

the inputs into the outputs, based on the captioning process described in (DCMP, n.d.).

The captioner first creates a copy to work with from the Master copy of the show, he then

watches the program, and transcribes the video. Following that, the captioner works on

the transcript to prepare the captions for broadcast by:

a) Paraphrasing or truncating the dialogue/text

b) Adding text descriptions and

c) Eliminating redundant/unnecessary text.

Once the captioner has completed the process, the captions are timed with the dialogue based on the time code and set at a pre-determined presentation rate. Once the captions and video are matched, it is also the captioner’s job to review his work, merge the video and the captions and produce a copy for broadcast.

Figure 9 a) indicate the steps where the captioner makes crucial, editorial decisions during the captioning process. Firstly, the captioner decides how the transcript for a show will be broken up into phrases and presented to the viewers. Then he also determines whether dialogue need to be paraphrased or “unnecessary” words must be removed to fit into a short timeframe. The captioner decides which sound effects, music or prosodic elements are vital enough to be described and the words that should be used to describe them (DCMP, n.d.). These decisions can often be very important ones and have the power to change the meaning of a particular show. Incorrectly paraphrasing fast-paced dialogue can dramatically alter the tone of a show. Similarly omitting a

35 particular sound description, because it is deemed unnecessary by the caption creator, can hugely impact the story.

The directors of captioned shows and movies have failed to realize that captions are not merely tools that aid understanding of dialogue. Captions are representations of the director’s work. Those viewers accessing sound via captions interpret a large portion of the message and meaning of the show or film based on the decisions made by the caption houses. It is possible for audiences to dislike or misinterpret a movie based on substandard caption work.

Right now closed captioning is relatively simple, thus decision making by caption houses are not as extensive. If the animated caption model is implemented in the current captioning system, the responsibility of the captioner would increase considerably. In addition to all the decisions the captioner is currently making, he would have to decide on the emotions felt by the characters. The only effective and accurate way of incorporating emotive animations into captioning would be to involve the show creators into the process.

Figure 9 b), the proposed or potential future process is described, where the animated captions are incorporated into the output. Unlike the current process, the proposed process requires much more input from the director/creator of the show. The show creator (or the show’s subject matter expert like script writer) would advise the captioner on how to break up the text into phrases, provide notes on the emotions/intensity ratings of dialogue, sound effects and music, and indicate how sound effects and music is to be described. The captioners in the proposed system would use the director’s notes and the emotion model to create the appropriate animations. The rest of

36 the proposed process is similar to the current process except that at one point the director or a representative of the director will review the captions with the captioner before broadcast.

Broadcasters are legally required to provide captioning in order to renew their licenses (Downey, 2007), thus their focus is on providing captions as cheaply as possible.

If show creators can appreciate that approximately 100 million caption users (National

Captioning Institute, 2003) are accessing and interpreting their works through inferior caption work, then they might be more willing to participate in the captioning process.

37 9 a) Current captioning process

9 b) Proposed captioning process

Figure 9 Captioning Process: Conventional and proposed

38

Chapter 4. Method

In order to begin to understand the impact of kinetic typography in captioning and to verify and refine the framework outlined in Chapter 2, the following research questions are proposed:

1. Can viewers interpret/understand/appreciate the animated text captions?

2. Can emotions be represented using animated text in captions?

3. What properties of animation can be used to create emotive animated text?

The first question seeks to determine if typical viewers of captioning, either hearing or hearing impaired, are able to watch and understand a show captioned with animated text. Before implementing a novel form of captioning, it is imperative to know if audiences actually understand, accept and enjoy watching the new style.

The purpose this research is to improve access to the intended emotional information contained in non-dialogue sound of television by expressing those emotions through movement of text. As such, I want to attempt to measure the level of understanding of those emotions within the context of an actual narrative or show. It should be noted that measuring understanding of emotions is a difficult and complex process. The animations for a show do not exist by themselves, as the animations are supplemented by facial expressions and residual sounds that interfere with people’s understanding. Furthermore, the study attempts not to discover how the participants are feeling while watching a show, but to measure their understanding of the character’s emotions—a difficult task to accomplish.

39 It is still important to attempt to measure the participant’s understanding of the character’s emotions because the main limitation of the kinetic typography research thus far has been to study reaction to animated text out of context of television/film. My research endeavours to put the participants understanding of emotions through animation in the context of watching television or film.

The second research question relates to whether the same properties of animation applied to different words and phrases can produce a consistent emotional effect on viewers. For instance, I tried to determine if “angry words” with the same angry animation properties are consistently identified as angry by the participants. I also tried to see if variations in intensity of those properties can produce the identification of that same emotion.

Finally, the last question examines which animation properties (and with what value)

can be used to create the desired animations. The focus of this question is on the

relationship between animation properties like size, duration, vibration rate (Lee et. al.,

2002) and specific emotions. The intent of this question is to construct and refine an

emotion model that can be used to generate animations for captioning.

In order to evaluate and refine the proposed emotional model and to analyze user’s opinions of animated captions, a 2x2x3 factor empirical study involving hearing impaired and hearing participants, two different pieces of content and three different caption styles was carried out. One of the caption styles was conventional closed captioning, which was used as the control for the study.

This study applied a mix of quantitative and qualitative methods, as suggested in

Creswell (2003). The study used questionnaires, verbal protocol and probing interviews

40 to determine user opinions, understanding and interpretation of the emotions expressed through the captions, and attitudes and preferences regarding the animated captioning techniques. Data analysis consisted of thematic analysis of verbal protocol and interview data, and statistical analysis of the written questionnaire responses. Ethics approval was sought and received prior to commencing the study and the approval letter can be found in Appendix A. The deaf and hard of hearing (HOH) test participants were recruited via the hard of hearing association listserv and they were paid a convenience fee of $30 for their participation.

Since the study was exploratory and the population of people who are HOH is small

and relatively difficult to recruit, the number of HOH subjects was limited to 15. Ten

hearing subjects were recruited for comparison. Ten of the participants were male and 15

were female. The ages of the participants ranged from children to seniors with two under

the age of 18, three between 19 and 24, five between 25 and 34, four between 35 and 44,

seven between 45 and 54 and four participants between 55 and 64. The distribution of

ages is shown in Figure 10. Nineteen of the participants had college or university

training, two (the under 18s) completed elementary school, and four completed high

school.

41 Age Distribution

8 7 6 5 4 3 2 1

Number Number of Participants 0 Under 19 19-24 25-34 35-44 45-54 55-64 Age ranges

Figure 10: Age distribution

Although five of the participants were deaf, only one of the participants was oral deaf.

Fourteen of the 15 participants were able to communicate through speaking. Since all of the participants (except the two children under the age of 18) had completed high school, their literacy level was considered to be high enough to understand verbatim captioning.

Section 4.01 Content used in this study

marblemedia (sic), a production company partner, provided a children’s television program called Deaf Planet to be used for the captioning study . Deaf Planet is about an average man from earth named Max, who crash lands on a planet inhabited by a population of deaf individuals. On this deaf planet, Max befriends a deaf female named

Kendra. Each episode of Deaf Planet shows Max and Kendra having different adventures where they discover facts about science. Since Kendra is deaf, she uses American Sign

Language (ASL) to communicate with the other characters in the show. Max’s robot, named Wilma, interprets between the hearing and deaf characters. Deaf Planet was

chosen as the test material for two primary reasons. Firstly, since marblemedia creates

42 shows for the hard of hearing and deaf audiences, they are more willing to incorporate the new caption model outlined in Chapter 3. They were willing to work with the research team to provide guidance about the character’s emotions and review the finished animations. Secondly, copyright laws restricted the type of content that could be used in the study. The permission of the content creator was required before showing the clips to the participants. marblemedia, as a partner in previous research endeavours, permitted the use of Deaf Planet in the study and for publications. In addition to copyright issues, I

needed a high resolution copy of a piece of content in an uncompressed format because

each time the clip was rendered with layers of text, the quality would deteriorate. The

research team had a well-enough connection with marblemedia to procure such high

quality video.

The sign language content of the show necessitated the exclusion of all

participants trained in ASL, since they will likely rely on the signing instead of the

captions for comprehension. As a result, only non-signing, deaf and HOH participants

were invited to take part in the study.

Two different clips of Deaf Planet were chosen for animated captioning. The clips were chosen based on having a representative range of emotional content and sound effects as determined by a committee (including the artist who subsequently created the animations). The content needed to contain at least five of the basic emotions identified by Ekman (1999) and Plutchik (1980), with a minimum of one example each. Two seasons of Deaf Planet episodes (26 in total) were viewed before two short clips from two different episodes (To Air is Human and Bad Vibrations) were selected.

43 In “To Air” (TA) the main characters are swallowed by a giant fish and experience varying degrees of fear, anger, sadness, happiness and disgust. In “Bad

Vibrations” (BV) two male characters compete for the attentions of a female character and experience emotions such as fear and anger. BV also contains three sound effects: whirring (the start of a space-craft engine), rumble (the earth shaking) and plop (snow falling on a character). As an informative, learning-oriented show, geared towards young children, Deaf Planet uses fairly simple and exaggerated expressions of emotion and sound effects. As a result, it was relatively simple to identify and categorize the emotions in the test clips. Before proceeding with the animations, the categorized emotions were verified by marblemedia.

The clips were captioned using two kinetic typography techniques labelled enhanced captioning and extreme captioning. Combined with conventional closed captioning, the total number was six videos. Each of the test clips was short in duration, lasting between 1 minute 30 seconds and 2 minutes. More about enhanced captioning and extreme captioning will be discussed further on in this section.

A short, 30-second training clip was also produced using closed captioning to orient the participants to the study. The training clip introduced the characters and the premise of the show and allowed the test participants to adjust their environment

(volume, proximity to the television) according to their needs. Comments made by the participants during the training clip segment were not used during data analysis.

The caption styles chosen for testing were divided into three categories: enhanced captions (EC), extreme captions (XC) and conventional closed captions (CC).

Animations were applied to the captions to emphasize the emotional content of words or

44 phrases, to indicate background noise or sound effects and to denote emphatic words. The emotive animations were developed through an intuitive approach (described in Chapter

2), where an experienced graphic designer and artist created animations that he thought

(in his artistic experience) would convey the five basic emotions: anger, fear, sadness, happiness and disgust. Many of the artist’s choices corresponded with literature that connected prosodic features (e.g. pitch and volume) to animation properties (e.g. size and vibration). The animations were iteratively reviewed and revised by a committee until consensus was reached regarding their effectiveness in portraying the emotions. The steps in developing the emotive captions are outlined in detail in Section 2.01 in Chapter 2.

Figure 11 shows examples of each type of caption style used in this study. In the enhanced captioned version, the animations were intended to be unobtrusive and all words were placed at the same location on the screen at the bottom. The static placement of the captions represents the typical location of conventional captions.

The extreme captions included the same animations as the enhanced captions but the animated words and phrases were placed dynamically on the screen. In addition, the extreme caption animations were much larger than the enhanced version. The graphic designer chose this stylistic interpretation to convey his own artistic expression and emulate a new style of text display found in advertising.

Conventional closed captions were also produced for the videos as a control measure to compare the emotive styles of captioning to what is currently being used in the industry. The conventional captions were produced using the services of an online captioning company and are seen in Figure 10c

45

a. Enhanced Caption b. Extreme Caption c. Conventional Closed Captions

Figure 11: Example of Enhanced Caption and Extreme Caption

Section 4.02 Data Collection

(a) Pre-Study Questionnaire

Before viewing the videos, the participants were asked to complete a pre-study questionnaire. The questionnaire was divided into three basic parts. The first section contained five questions and documented demographic information: hearing level

(hearing, HOH or deaf), gender, age, educational level, and. The demographic distribution was collected to assess whether age, education level and occupation of the participants affected their reaction to the animated captions.

The second part of the questionnaire had four questions and sought to ascertain the participants’ television viewing habits. The first question revealed how much television/movies they watched in a week, with six choices with the lowest being less than 1 hour and the highest being more than 20 hours. Question two and three asked whether the participants watched television by themselves or with others using a five point Likert scale with categories of Always, Frequently, Sometimes, Seldom, and Never.

Question four asked how often the participants discussed the television program they were watching with others using the same Likert scale

46 The third section of the questionnaire focused on how the participants used conventional closed captioning and asked their opinions on the current state of captioning. Section three contained six questions in total. The first question asked how often the participants watched television with captioning. The second question asked the participants to check off a list of features they liked about captioning: rate of speed of display, verbatim translation, placement on screen, use of text, size of text and colour.

The third question provided the same features as question two but asked the participants to check off which ones they disliked. The fourth question provided the participants with a list of potential elements missing from captioning and asked them to check those ones they believed were important and wanted to see in the future. The elements for question four were a) emotion in speech, b) background music/notes, c) speaker identification, d) text displayed at the same as the words being spoken and e) adequate information in order to understand jokes/puns correctly. The fifth question in section three asked the participants to select five characteristics of closed captioning that needed to be changed from a list of fourteen including a) faster speed of displaying captions, b) text description for background noise, c) use of colours and d) use of different font and text sizes. The final question in the questionnaire was open-ended and asked the participants to add any comments not captured by the previous question.

The questionnaire was evaluated prior to the study by two people in order to eliminate ambiguity and confusion. The pre-study questionnaire can be found in

Appendix B.

47 (b) Data collection during viewing of the content

The study was designed as a within-subjects 3 (caption styles) x 2 (different clips of Deaf Planet) factor design. The order of viewing of the clips was randomized for each test participant in order to reduce learning biases resulting from watching the same clip multiple times. Randomizing the viewing ensured that multiple users were viewing each clip for the first time at least once.

After the first viewing of the first version of each episode, participants were given a checklist and asked to identify which emotions were present in the clip. Next, they were asked to identify the checked emotions during a second viewing of the same version of the show. At this point, they could also add new emotions recognized in the show but not recalled during the checklist task. Since I wanted to avoid a learning effect and since people usually watch a program once, I wanted to obtain people’s initial reaction and understanding of the emotive content without much time to reflect.

Participants then watched the remaining versions of the episode and commented on the captioning style, the information presented and the show. The participants were encouraged to talk out loud their thoughts while watching the clips. All sessions were video taped as well as transcribed. After each set of three viewings for each episode, participants were asked to complete a post-viewing questionnaire to provide reactions to and opinions of the three different captioning styles.

(c) Post-study questionnaire

The post-study questionnaire contained ten questions relating to the participant’s viewing experience. The first question asked participants to summarize their understanding of the test clip. Next, on a Likert scale ranging from 1 to 5, the participants

48 were asked to rate the likability of various caption attributes for the enhanced captions

(color, size of text, speed of display, placement on screen, movement of the text, how well the animations portrayed the character’s emotions, and text descriptions) where 1 was “Liked very much” and 5 was “Disliked very much”. They were then asked to rate their affinity for the size, speed, screen placement and text descriptions of the conventional captions using the same Likert scale. Participants were asked to rate whether the enhanced captions contributed to their understanding of the show and the emotions, their level of confusion about the show, and their level of distraction. Finally, participants were asked for any concluding commentary or to expand on specific comments made during the show that they thought were particularly noteworthy. The post-study questionnaire was evaluated by two people before commencing the study in order to reduce ambiguous or confusing wording. The post-questionnaire can be found in

Appendix C Post-Study Questionnaire.

(d) Video Data

During the viewing of the various clips, the participants were encouraged to think out loud and provide commentary and opinions about the content, and captioning. After watching each clip, participants were given further opportunity to comment on the captioning styles and the content. All the studies were video taped to capture participants’ comments.

49 (e) Experimental setup and time to complete study

The pre and post study questionnaires required five to ten minutes to complete and the total viewing time for the clips, including comments, was 15 to 20 minutes. In total, all studies were completed within 45 to 60 minutes.

The test environment simulated a home television viewing experience. As a

result, all the captions were viewed on a television set instead of a computer screen. The

participants were allowed to adjust their proximity to the television and the volume level

prior to starting the study.

(f) Data Collection and Analysis

To determine user opinion of and to compare the different caption styles, the video

data was subjected to a thematic analysis proposed by (Miles & Huberman, 1994). First,

themes were identified and defined by three independent reviewers through viewing a

representative sample of the video data. These themes were then distilled into five main

themes (see Table 2) by consensus of the three reviewers. Each theme was further

divided into positive and negative sub-categories.

Two independent raters then evaluated two sample videos using the themes and sub- categories and the results were analyzed using an intraclass correlation (ICC) to determine reliability of the theme definitions. The single measures ICC between the two raters for all categories were 0.6 or higher. The remaining video data was then analyzed by a single rater.

Table 2: Description of the themes

Theme Description Example Positive Example Negative

50 Comments Comments Like/dislike Specific direct or “I like this” “this “I don’t like this” “this is Enhanced implied comments helps me understand confusing” Captioning about preferring/ not better” preferring enhanced captions. Like/dislike extreme Specific direct or “this helps me know “this is too imposing” captioning implied comments who is speaking” about preferring/not preferring extreme captions. Comments about Any comments about “I like closed “This doesn’t provide me with closed captioning closed captioning. captioning better than all the information” the other two”

Technical comments Captures all “This is an interesting “Captions block off too much or comments comments unrelated show” of the screen” unrelated to caption to the captions. style Impact of Impact of the captions “Captions are useful” “Captions are distracting” captioning as a whole. Impact on Genre Any comments related “This would be great “This would never work for to the appropriateness for comedies” dramas” of animated caption to a specific genre. Impact on people Comments about how “Kids would love “Hearing people will find it captions would impact this” distracting” other people. Sound Effects All comments related “I like the rumbling “The word music should be to the sound effects noise sound effect” replaced with notes”

51 The emotion identification portion of the video data was categorized separately using signal-detection theory as a basis for the design of the categories (Abdi, 2007). The categories were designed to take into account the correctly detected stimuli (“hit”), undetected stimuli (“miss”), incorrectly detecting the absent stimuli (“false alarm”) and missing or absent stimuli (“correct rejection”) (Abdi, 2007).

As mentioned in Chapter 2, the artist who created the animations for each of the emotions also categorized the emotions in the content with the help of a small committee.

The classified emotions were then approved by Deaf Planet’s production company, marblemedia, to verify the validity of the classification. A “hit” or correct identification was recorded when the user identified the correct emotion (as classified by the artist and approved by the production company) experienced by the characters within a two second timeframe of the occurrence of that emotion. The two-second window was selected to account for reaction time. The human information processor model proposed by Card,

Moran & Newell (1983) suggests that reaction time involves the time it takes for an individual to see (70-700 msec), perceptually process (50-100 msec), cognitively process

(25-170 msec) and respond with a physical action (30-100 msec) to a visual stimuli (a total of 175 msec - 1.1 sec). Unlike the model-human processor where the task was to press a button when a correct stimulus appeared on a screen, the participants in this study completed a more complex series of tasks: watch a program, process the emotions and verbalize the emotions aloud. As a result, I increased the window of acceptable reaction time to two seconds.

Incorrect identification was noted when the user identified the “wrong” emotion

(compared to what was classified) but within the two second timeframe (+ or – 2

52 seconds). No identification or “miss” was noted when the viewer did not indicate that there was emotion when it was present in the captioning. Extra emotion identification or

“false alarm” was noted when the viewer identified an emotion outside of the classified emotion window. Correct rejection was not applicable in this case because the participants were not asked to identify instances where there was no emotion. As a result

D-, the index of discriminability of a signal, could not be calculated and future studies might consider a mechanism to incorporate this element. The D prime statistic in future studies can be used to compare hit, miss, false alarm and correct rejection rates between enhanced caption style and regular caption style as perceived by the users.

Table 3: Emotion identification categories

Category Definition

Exact Correctly identified emotion within the correct time frame.

Incorrectly Incorrectly identified emotion within the correct time frame. identified

No Did not identify the emotion at all within correct time frame identification/Miss

Extra emotion Identified emotion outside of the designated time frame identification

53

Chapter 5. Results

Section 5.01 Questionnaire Data

Twenty-five participants completed one pre-study and two post-study

questionnaires during the session. The pre-questionnaire results revealed that 11 of 25

(44%) of the participants spent one to five hours watching television per week, five of 25

(20%) watched six to ten hours television per week, six of 25 (24%) watched 11 to 15

hours of TV, one of 25 (4%) watched 11 to 15 hours of TV and two of 25 (8%) watched

more than 20 hours of TV a week.

Thirteen of 25 participants (52%) watched television alone either always or

frequently. Eight of 25 participants (32%) said they watched television alone sometimes

and four of 25 (16%) said they seldom watched television alone. Fifteen of 25 (60%) of

participants have conversations with friends and family while watching television, and

40% seldom or rarely have conversations with friends and family while watching

television. Twenty of 25 (80%) reported using closed captioning either always or

frequently and five of 25 (20%) said they never used captioning when watching

television. The participants were asked what they liked most about captioning and 40%

claimed they liked the rate of display, 40% chose verbatim translation of the dialogue,

12% chose use of text, eight percent chose use of black and white colours and four

percent chose the “other” category (they were allowed to choose more than one option).

When reporting what elements the participants thought were missing from closed

captioning, the most popular answer, with 44%, was the lack of emotions in speech,

54 followed by speaker identification with 16%, inadequate information for jokes and puns with 32%, lack of background noise/music with 12% and lack of synchronization with

16%.

Finally, the participants were asked to choose five characteristics of existing closed captions that they would like to see changed. Most participants (84%) chose the use of different fonts and text sizes, and 68% chose speaker identification. Less popular were speed of captions with 44%, floating captions and use of colours with 36% and text description for emotions with 32%.

A chi-square (χ2) analysis of response categories comparing responses with an

expected frequency of equal distribution among all answer categories for all questions in

the post-study questionnaire for both episodes, To Air is Human (TA) and Bad Vibrations

(BV), showed some interesting similarities. There was a significant difference between

response categories at the p < 0.05 level for most questions regarding the “likability”

ratings for the enhanced captions attributes for both episodes. The “likability” ratings, as

mentioned in Section 4.02, ranged from “Liked very much” (score of 1) to “Disliked very

much” (score of 5) on a 5-point Likert scale and were applied to colour, size, speed,

placement, movement and emotiveness of the captions. Table 4 and Table 5 show the

CHI-square results and the means and standard deviations (SD) for each attribute

(“likeability” rating) that showed significance for episodes TA and BV respectively.

Table 4: χ2, degrees of freedom (df), mean and standard deviation for enhanced captioning attributes for To Air.

Attribute Χ2 df Mean SD

55 Colour of text 12.7 4 2 1.1

Size of text 21.6 3 1.8 0.7

Speed of caption display 14.8 3 1.8 0.8

Placement on screen 14.8 3 1.7 0.8

Movement of text on screen 15.8 3 1.9 1.0

Portrayal of emotions 17.2 4 2 1.0

For To Air, 17 of 24 (71%) rated the colour as positive or very positive, five of 24

(21%) rated the colour as neither liked nor disliked and two of 24 (8%) reported disliking the colour. Twenty-three of 25 (92%) reported the size as liked or very much liked but eight percent said they were neutral or they disliked the size. Twenty-two of 25 (88%) liked or very much liked the speed while 12% reported being neutral or negative towards the speed. Twenty-two of 25 (88%) reported placement of text as positive or very positive but 12% reported placement to be neutral or negative. Twenty-one of 25 (84%) reported the movement of the text on the screen as positive or very positive while 16% rated movement as neutral or negative. Finally, 20 of 25 (80%) reported the portrayal of the character’s emotion using movement of the text on the screen as positive or very positive, 12% remained neutral in this category while 8% said they either disliked or disliked very much the portrayal of emotions.

Table 5 χ2, degrees of freedom (df), mean and standard deviation for enhanced captioning attributes for Bad Vibes.

56 Attribute χ2 df Mean SD

Colour of text 10.5 2 2.0 0.6

Size of text 15.5 3 2.0 0.8

Speed of caption display 19.6 3 1.9 0.8

Placement on screen 12.2 3 1.9 0.9

Portrayal of emotions 14.8 4 2.0 1.0

Text description 16.8 3 2.2 0.8

For Bad Vibes, 19 of 23 responses (83%) rated the colour as positive or very positive and 17% rated the colour as neutral or negative. Twenty-one of 25 (84%) reported the size as positive or very positive, while 16% reported being neutral or negative towards the size. Twenty of 25 (80%) liked or very much liked the speed while

20% were neutral or negative towards the speed. Twenty-one of 25 (84%) reported placement as positive or very positive while 16% were either neutral or negative towards the placement. Nineteen of 25 (76%) reported portrayal of the character’s emotion using movement of the text on the screen as positive or very positive, while 16% reported being neutral and 8% reported being negative or very negative. Finally, 19 of 25 (76%) reported the text descriptions as positive or very positive while 24% reported being either neutral or negative towards text descriptions.

Also, the χ2 results between response categories for the question regarding

whether participants would be willing to discuss the show with hearing friends or family

showed a significant difference for both episodes [χ2(4, 25) = 13.9 for TA and χ2(4,

57 25)=12.8 for BV, p < 0.05]. The mean and SD for TA was 4.0 and 1.1 respectively on the

Likert scale of 1 to 5 where 1 was “not willing to discuss show” and 5 was “very willing”. Eighteen of 25 (72%) respondents said that they were either willing or very willing to discuss the show with hearing friends. The mean and SD for BV was 3.9 and

1.1 respectively. Eighteen of 25 (72%) of respondents said that they would be willing or very willing to discuss the show with hearing friends.

For Bad Vibes, a χ2 analysis between responses categories showed a significant

difference for the question regarding the level of distraction caused by the enhanced

captions [ χ2(4, 25) = 14.8, p < 0.5] with the mean = 3.0, SD = 1.3. Thirteen of 25 (52%)

of respondents said that they were either very or slightly distracted by the enhanced

captions while 10 of 25 (40%) said that they were not or only slightly distracted by the

enhanced captions.

When the descriptive statistics for the extreme captions were examined there were some conflicting results. First, 19 of 28 (68%) responses very much liked or liked how the emotions were portrayed with the captions. However, a majority of participants 16 of

28 (57%) disliked the movement of the captions

A t-test analysis comparing two independent samples was carried out between the hard-of-hearing and hearing subject groups for each caption style of each different episode for the questionnaire variables related to captioning attributes. For To Air, a significant difference was found in the enhanced caption text descriptions attribute [t(23)

= -3.43, p< 0.05] where the average responses for the hard of hearing (HOH) group was

1.64 (SD = .63) and hearing group was 2.45 (SD = 0.52). A significant difference was also found in the extreme caption colour attribute [t(12) = -2.31, p< 0.05), where the

58 average responses of the HOH group was 1.67 (SD = 0.52) and hearing group was 2.50

(SD = 0.76).

For Bad Vibes there was one significant difference, in the enhanced caption placement attribute [t (23) = -4.29, p< 0.05] between the two groups. Hard of hearing group very much liked/liked the placement of the enhanced captions (m = 1.57, SD =.85), while the hearing group liked/felt neutral about the enhanced captions placement (m=

2.27, 0.79).

A second t-test analysis was carried out to determine differences in the caption attributes between the two episodes, "Bad Vibes" (BV) and "To Air Is Human" (TA) for all participants. The movement of the text for the enhanced captions between these episodes was found to be significant [t (48) = -2.2, p < 0.05] where mean = 1.7, SD = 1.0 for TA and mean = 2.3, SD = 1.1 for BV. The text description for conventional captioning was also found to be significantly different [t(48) = -2.2, p < 0.05] for TA

(mean = 2.3, SD = 0.9) and for BV (mean = 2.8, SD = 0.9).

A t-test analysis between both episodes for hearing subjects only was also carried out. There were no significance differences between any of the enhanced or conventional caption attributes for the two episodes.

A final t-test was carried out between both episodes for only the hard-of-hearing subjects. The movement of text attribute for the enhanced captions was significantly different between the two episodes [t (26) = -2.5, p < 0.05] (TA mean = 1.4, SD = 0.8 and

BV mean = 2.3, SD = 1.1). The text description of enhanced captioning was found to be significantly different between the episodes [t (26) = -2.3, p < 0.05] (for TA mean = 1.6,

SD = 0.6 and for BV mean = 2.3, SD = 0.8). The text description of conventional

59 captioning was also found to be significantly different between the episodes [t (26) = -

2.3, p < 0.05] (for TA mean = 2.0, SD = 0.88 and for BV mean = 2.9, SD = 1.1).

Crosstab analyses are applied to nominal data in order to uncover correlations between two variables. Crosstab analysis were carried out on all of the participant data to examine whether there was any correlation between the ratings of understanding

(provided on a 5-point Liker scale) with the various attributes of the enhanced captions.

For this analysis, data were separated by the participant groups of HOH and hearing and aggregated over both episodes—to ensure enough data points.

For the hearing group, there were six significant crosstab Pearson correlations found for the enhanced caption attributes of:

1) Size of the text and the overall understanding of the show [ χ2(12, 22) = 24.8, p<0.05]

where 13 of 22 responses rated their level of confusion as low and liking the size of

the text;

2) Size of the text and level of distraction [ χ2 (12, 22) = 27.7, p<0.05] where 8 of 22

respondents rated the level of distraction of the enhanced captions as not distracting

or slightly distracting and liking or very much liking the size. However, 7 of 22

respondents reported an inverse relationship where the enhanced captions somewhat

or greatly distracted them but they still liked or liked very much the size of the

captions;

3) Movement of the text on screen and understanding of the show [ χ2(16, 22) = 34.3,

p<0.05] where 12 of 22 responses found their level of understanding increased or

greatly increased and they liked or liked very much the animated text;

60 4) Movement of the text on screen and distraction [χ2(16, 22) = 36.2, p<0.05] where 8 of

22 responses found they were only slightly or not distracted and they liked or liked

very much the animated text;

5) Portrayal of the emotions using the animated text and understanding of the show

[χ2(12, 22) = 30.5, p<0.05] where 14 of 22 responses found their level of

understanding increased or greatly increased and they liked or liked very much the

animated text; and

6) Portrayal of the emotions using the animated text and distraction [ χ2 (12, 22) = 27.9,

p<0.05] where 9 of 22 responses found they were slightly or not distracted and they

liked or liked very much the animated text. However, 8 of 22 respondents reported an

inverse relationship where the enhanced captions somewhat or greatly distracted by

the enhanced captions they still liked or liked very much the emotions portrayed in

the captions.

For the HOH group, there were six significant crosstab Pearson correlations found

for the enhanced caption attributes of:

1) Size and the overall understanding of the show [χ2(8, 28) = 18.9, p<0.05] where 15 of

28 responses rated their level of understanding of the video increased or greatly

increased and liking the size of the text;

2) Size and level of understanding of the character’s emotions [ χ2(8, 28) = 21.0, p<0.05]

where 17 of 28 respondents rated the level of understanding of the emotions as

greatly increased or increased and liking or very much liking the size of the text;

61 3) Speed of the captions appearing on the screen and confusion with the show [ χ2(8, 28)

= 21.6, p<0.05] where 10 of 28 responses found their level of confusion with the

show as less or much less and liking or liking very much the speed of animated text;

4) Speed of the captions appearing on the screen and understanding of the character’s

emotions [ χ2(8, 28) = 21.0, p<0.05] where 15 of 28 responses found their level of

understanding increased or greatly increased and liking or liking very much the speed

of animated text;

5) Speed of the captions appearing on the screen and distraction [ χ2(8, 28) = 20.5,

p<0.05] where 17 of 28 responses found they were not distracted or slightly distracted

and liked or liked very much the speed of animated text;

6) Placement on the screen and understanding of the show [ χ2(12, 25) = 24.9, p<0.05]

where 16 of 28 responses found their level of understanding increased or greatly

increased and they liked or liked very much the placement of the animated text on the

screen; and

7) Placement on the screen and understanding of the character’s emotions [ χ2(12, 28) =

23.0, p<0.05] where 18 of 28 responses found their level of understanding increased

or greatly increased and liking or liking very much the speed of animated text.

Section 5.02 Video Data

As described in Chapter 4, the video data was coded using a thematic analysis approach according to the themes and sub-categories in Table 2. The comments made by participants were recorded and then analyzed using the pre-defined themes. Many participants commented negatively and positively at various times regarding the same

62 theme. Repeated measures ANOVA and t-tests were applied to analyze the video data to compare the mean data among the various groups. ANOVA and T-tests were appropriate for this data set because they met the homogeneity of variance and independence of cases criteria. However, as the data set was not normally distributed and non-parametric tests such as Kruskall-Wallace and Mann-Whitney were also applied to the data set. Since the results were identical between the parametric and non-parametric tests, the more common, parametric results were considered for reporting. Due to the number of tests, a

Bonferroni adjustment was applied to the 0.05 to derive an adjusted alpha value of 0.02 to reduce type one errors of (error of incorrectly detecting a significant difference).

A t-test was carried out to determine whether there was a difference between the

hearing and HOH participants for each theme. A significant difference was found

between the hearing and HOH group in the Impact on People Positive category [t (7) = -

3.0, p <0.02]. The results of video comments analyses somewhat mirrored results from

the questionnaire data where no significant differences was found between the hearing

and the HOH in most categories.

In order to determine whether the participants had a preference regarding the style

of captioning, the number of positive and negative comments for the three captioning

styles was analyzed using Repeated Measures Analysis of Variance (ANOVA). There

was a significant difference among the three caption styles in the positive [F (2, 39) =

9.32, p< 0.02] and negative categories [F (2, 41) = 4.11, p<0.05]. A Tukey post-hoc test

revealed significance for:

1) Positive comments between enhanced captioning and extreme captioning (p =

0.002).

63 2) Positive comments between enhanced captioning and closed captioning (p =

0.005).

3) Negative results between extreme captions and closed captions (p = 0.018).

The Repeated Measures ANOVA revealed no significant differences for:

1) Positive comments between extreme captioning and closed captioning

2) Negative comments between enhanced captions and extreme captions.

Table 6 and Figure 12 compare the number of positive and negative comments made by the HOH and hearing group for each category. The most number of total comments were in the enhanced caption positive category (75 comments made by HOH and H participants combined). This was, however, closely followed by the number of total negative comments for the extreme captions (total of 68 negative comments by

HOH and H combined). The fewest number of total comments were for the extreme caption positive sub-category with only eleven comments between the HOH and H participants. This was followed by 12 total comments for the Impact on People negative sub-category.

As shown in Table 6, Table 7 and Table 8, 12 of 15 (80%) of the HOH (m = 2.8,

SD = 1.71) and 10 out of 10 (100%) hearing participants (m = 4.2, SD = 1.93) made at least one positive comment about the enhanced captions. The total number of positive comments about the enhanced captions made by the HOH group was 33 and total number of positive comments made by the hearing group was 42.

Four of 15 (27%) HOH participants (m = 2.8, SD = 1.25) commented negatively on enhanced captions and one of 10 (10%) of hearing participants (m = 4.0, SD = 0)

64 commented negatively about the enhanced captions. The total number of negative comments related to enhanced captioning was 11 for HOH and four for hearing.

Seven of 15 (46.67%) HOH participants (m = 1.4, SD = 1.13) and four of 10

(40%) of the hearing participants (m = 1.8, SD = 0.96) made positive comments about closed captioning, with total positive comments for closed captioning of 10 and seven respectively. Ten of 15 or (66.67%) of the HOH participants (m = 2.2, SD = 1.14) and nine of 10 (90%) of hearing participants (m = 2.0, SD = 1.11) made negative comments about closed captioning, with a total of 22 and 18 negative comments respectively.

Extreme captioning received the fewest number of positive comments out of all three styles with only four of 15 (26.67%) HOH (m = 1.3, SD = 0.50) and five of 10

(50%) of participants (m = 3.6, SD = 1.69) commenting positively. The total number of positive comments for extreme captioning was five for HOH group and six for hearing group. Conversely, extreme captioning had the most negative comments out of the three styles. Eleven of 15 (73%) HOH participants (m = 3.6, SD = 1.69) and nine of 10 (90%) of hearing participants (m = 3.1, SD = 1.76) commented negatively about extreme captions. In total, the HOH group made 40 negative comments and the hearing group made 28 negative comments about extreme captions.

The technical category amalgamated comments about a number of elements: the content (Deaf Planet), the use of the translucent background for the captions, comments regarding colour, size or speed of the captions and the television screen. Eight of 15 or

53% of HOH participants (m = 1.8, SD = 1.03) made positive comments about the technical category with 14 positive comments in total. Only one out of 10 (10%) hearing participant (m = 1.0, SD = 0) made one positive comment about the technical category.

65 The HOH group made 13 negative comments in the technical category with six of 15

(40%) of the participants commenting (m = 2.2, SD = 1.33). The hearing group made 17 negative comments in the technical category with 7 of 10 (70%) of participants commenting (m = 2.4, SD = 1.40).

Comments regarding the sound effects were divided almost equally between positive and negative. Six of 15 (40%) of HOH participants made 16 positive comments in total about the sound effects (m = 2,7, SD = 1.63). Five of 10 (50%) of hearing participants made seven positive comments about the sound effects (m = 1.4, SD = 0.54).

Eight of 15 (53%) HOH participants commented negatively about sound effects with a total of 16 negative comments about the sound effects (m = 2.0, SD = 1.07). Four of 10

(40%) of hearing participants made eight negative comments about the sound effects (m

= 2.0, SD = 1.41).

Comments in the Impact of Captioning category were regarding participant’s opinions about how the captions contributed to their television viewing experience. Eight of the 15 (53%) HOH participants commented favourably about the overall impact of the captioning on their entertainment (m = 2.3, SD = 2.05) and seven of 10 (70%) of the hearing participants commented positively (m =1.6, SD = 1.13) on impact of captioning.

Conversely, nine of 15 (60%) of HOH participants commented unfavourably about the impact of captioning (m =, 1.6 SD = 0.72) while five of 10 (50%) of the hearing participants commented negatively on impact of captioning (m = 1.4, SD = 0.55).

Impact on People category represented data on how the participants thought the animated (enhanced and extreme) captions would affect other people. Six of 15 (40%)

HOH participants commented that the animated captions would positively affect other

66 viewers (m = 1.7, SD = 4.1), with a total of seven positive comments. Three of 10 (30%) hearing participants said that the animated captioning would affect others positively (m =

2.7, SD = 1.20), with total positive comments of eight. Six of 15 (40%) HOH participants said the animated captioning would affect others negatively with 11 negative comments in total (m = 1.8, SD = 1.33). Only one of 10 (10%) of hearing participants made one comment that the animated captions would affect others negatively with 1 comment in total (m = 1.0, SD = 0).

The final category measured was Impact on Genres, designed to determine whether the participants thought that the emotive captions would be suitable for all genres. Eleven of 15 (40%) of the HOH participants suggested that the emotive captions could be applied to all genres (m = 1.6, SD = 0.92) and five of 10 (50%) of the hearing participants suggested that these captions would be appropriate for all genres (m = 1.8,

SD = 1.10). Six of 15 (40%) of the HOH participants thought emotive captions would not be appropriate across many genres (but not all) (m = 1.8, SD = 1.17) and four of 10

(40%) of the hearing participants thought emotive captions were not appropriate for all genres (m = 1.5, SD = 1.00).

Table 6: Number of positive/negative comments by groups for the video data

Total Themes Hard of Hearing Hearing Comments

Number Total Number of Total

of People Comments People Comments

Enhanced Captions Positive 12 33 10 42 75

Enhanced Captions Negative 4 11 1 4 15

Closed Captions Positive 7 10 4 7 17

67 Closed Captions Negative 10 22 9 18 40

Extreme Captions Positive 4 5 5 6 11

Extreme Captions Negative 11 40 9 28 68

Technical Positive 8 14 1 1 15

Technical Negative 6 13 7 17 30

Sound Effects Positive 6 16 5 7 23

Sound Effects Negative 8 16 4 8 24

Impact of Captions Positive 8 18 7 11 29

Impact of Captions Negative 9 13 5 7 20

Impact on People Positive 6 7 3 8 15

Impact on People Negative 6 11 1 1 12

Impact on Genre Positive 11 18 5 9 27

Impact on Genre Negative 6 11 4 6 17

Table 7: Descriptive statistics for HOH group in all categories of the video data analysis

N Mean SD

Enhanced Captions Positive 12 2.75 1.71

Enhanced Captions Negative 4 2.75 1.26

Closed Captions Positive 7 1.43 1.13

Closed Captions Negative 10 2.20 1.14

Extreme Captions Positive 4 1.25 0.50

Extreme Captions Negative 11 3.64 1.69

Technical Positive 8 1.75 1.04

68 Technical Negative 6 2.17 1.33

Sound Effects Positive 6 2.67 1.63

Sound Effects Negative 8 2.00 1.07

Impact of Captions Positive 8 2.25 2.05

Impact of Captions Negative 9 1.44 0.73

Impact on People Positive 6 1.17 0.41

Impact on People Negative 6 1.83 1.33

Impact on Genre Positive 11 1.64 0.92

Impact on Genre Negative 6 1.83 1.17

Table 8: Descriptive statistics for hearing group in all categories of the video data

N Mean Standard Deviation

Enhanced Captions Positive 10 4.20 1.93

Enhanced Captions Negative 1 4.00 .

Extreme Captions Positive 5 1.20 0.45

Extreme Captions Negative 9 3.11 1.76

Closed Captions Positive 4 1.75 0.96

Closed Captions Negative 9 2.00 1.12

Technical Positive 1 1.00 .

Technical Negative 7 2.43 1.40

Sound Effects Positive 5 1.40 0.55

Sound Effects Negative 4 2.00 1.41

Impact of Captions Positive 7 1.57 1.13

69 Impact of Captions Negative 5 1.40 0.55

Impact on People Positive 3 2.67 1.15

Impact on People Negative 1 1.00 .

Impact on Genre Positive 5 1.80 1.10

Impact on Genre Negative 4 1.50 1.00

Comparing Captioning Styles

80 75 68 70

60

50 40 Enhanced 40 Extreme Closed 30

20 17 15 Number Number of Comments 11 10

0 Positive Negative Comments

Figure 12 Comparison of the three captioning styles by number of comments

Section 5.03 Emotion ID

Participants were asked to identify the emotions present in the first clip of the six that they watched. These identifications were then analyzed according to five different categories (four for the conventional captions): 1) correct identification within a two

70 second window from the time the emotion occurred (to account for response time processing); 2) identification of a different/incorrect emotion within the two second window; 3) missed identification; 4) emotion identified outside of the two second timeframe of when the animation occurred, and 5) emotions identified for non-emotion animation (e.g., for sound effects).

A t-test was used to assess differences in emotion ID between the hearing and HOH groups. A significant difference was found between hearing and HOH participants for the emotion ID of fear (t (23) = -2.11, p<0.05) where the HOH made an average of 1.7 identifications of fear (SD = .82) and the hearing identified a mean of 2.9 identifications of fear (SD = 1.7). No other significant differences were detected.

As shown in Table 9, the HOH group identified an emotion correctly 55% of the time and the hearing group 50% of the time. The HOH group incorrectly identified an emotion 6% of the time, same as the hearing group. The HOH group missed identification 13% of the time, while the HOH group missed identification 23% of time.

The HOH group identified an emotion outside the animation boundary 12% of time, while the HOH did so 13% of time. Finally, the HOH group identified an emotion during other animations 13% of time, while the hearing group identified an emotion during other animations 7% percent of the time.

Table 9 Average number of Emotions identified in each category

Total # of Correct Incorrect Missed Emotion Emotions Identified emotion/correct emotion/corr emotion identified identified time (% of ect time completely outside of during other total) animation animations window

71 Hard of 67 37 (55) 4 (6) 9 (13) 8 (12) 9 (13)

Hearing

Hearing 125 63 (50) 8 (6) 29 (23) 16 (13) 9 (7)

A Repeated Measures ANOVA was applied to detect any differences between type of emotion and their identifications. Bonferroni adjustment was applied to the alpha level to derive adjusted alpha level of 0.02. Significance was detected in the correct emotion identification category [F (4, 58) = 12.23, p< 0.02]. A post-hoc Tukey analysis revealed the following:

1) Correct identification of fear versus anger (P= 0.005).

2) Correct identification of fear versus sadness (P = 0.000).

3) Correct identification of fear versus happiness (P = 0.000).

4) Correct identification of fear versus disgust (P = 0.000).

As can be seen from Table 10, anger was correctly identified 50% of the time mistakenly identified 4% of the time, missed completely 13% of the time, identified outside of the animation window from non-caption related cues (such as facial expressions) 23% of the times and identified during other animations (such as sound effects) 10% times.

Fear was correctly identified 67% of the time, mistakenly identified 7% of the time,

completely missed 15% of the time, identified from non-caption related cues 7% of the

time and from other animations 5% of the time. Sadness was detected correctly 36% of

the time, mistaken for another emotion 14% if the time, missed 32%,perceived based on

non-caption cues 4% of the time and identified during other animations 4% of the time.

72 Happiness was correctly identified 37% of the time, mistakenly identified 7% of the time,

missed entirely 41 % of the time, not identified outside of the animation window, and

identified during other animations 15% of the time. Finally, disgust was identified 42%

of the time, not mistaken for another emotion, missed 8% of the time, detected outside of

the animation window 29 % of the time and identified during other animations 21% of

the time.

Of all of the emotions, fear appeared to be correctly identified more often than the

other emotions (67% of the time fear was correctly identified when it occurred), while

happiness was missed more often than the other emotions (41% of the times it occurred it

was missed). Disgust appears to be more often identified based on non-caption related

cues (e.g. facial expressions, gestures or voice) than any of the other emotions (it was

identified based on non-caption cues 21% of the time).

Table 10: Emotions versus type of identification

Total # of Correct Incorrect Missed Emotion Emotions

Identified emotion/corr emotion/corre emotion identified identified

ect time (% ct time completely outside of during other

of total) animation animations

window

Anger 52 26 (50) 2 (4) 7 (13) 12 (23) 5 (10)

Fear 61 41 (67) 4 (7) 9 (15) 4 (7) 3 (5)

Sadness 28 13 (36) 4 (14) 9 (32) 1 (4) 1 (4)

Happiness 27 10 (37) 2 (7) 11 (41) 0 (0) 4 (15)

Disgust 24 10 (42) 0 (0) 2 (8) 7 (29) 5 (21)

73 A Repeated Measures ANOVA analysis was applied to determine possible differences in emotion understanding between the caption styles. No significant differences were found. The conventional captioning clips did not have animations; as such the fifth identification category (emotions identified during other animations) shows no counts. Table 11 shows the total percentage for each type of emotion identification listed by captioning category (percentage shown in brackets). Emotions were correctly identified most often for conventional captions (57%) but also most often incorrectly identified (9% of the time). Emotion identification was missed most often for the extreme captions (27% of the time).

As shown in Table 11, Conventional Captions were correctly identified 57% of the time, incorrectly identified 9% of the time, not identified 16% of the time, identified outside of animation window 20% of the time and not identified during other animations

(since conventional captioning did not contain any animations). Enhanced captions were correctly identified 50% of the time, incorrectly identified 2% of the time, not identified

19% of time, identified outside of animation window 11% of time and identified during other animations 19% of time. Finally, Extreme captions were correctly identified 47% of time, incorrectly identified 7% of time, not identified 26% of time, identified as a result of non-animation related cues 5% of time and identified during other animations 14% of time.

Table 11: Caption category versus type of identification

Total # of Correct Incorrect Missed Emotion Emotions Identified emotion/correct emotion/correct emotion identified identified time (% of time completely outside of during

74 total) animation other window animations Conventional 81 46 (57) 7 (9) 13 (16) 15 (20) 0

Captions

Enhanced 54 27 (50) 1 (2) 10 (19) 6 (11) 10 (19)

Captions

Extreme 57 27 (47) 4 (7) 15 (26) 3 (5) 8 (14)

Captions

75 Chapter 6. Discussion

Animated captions were completely novel to all the participants, except in the form of animated titling in film and television. Therefore, the results of this study were unexpected and generally surprising. In addition, since the method used to generate the captions was artistic, audience reactions could not be predicted in advance. One desired outcome of the study was to establish some future research directions to pursue on how non-dialogue sound information can be better represented to users of closed captioning.

The audiences’ ability to understand the intention of the animated captions and their level of acceptance of this approach were critical factors in achieving this outcome.

The first objective of the user study was to gain an understanding of the level of

acceptance of animated captions by the hearing and hearing impaired audience members.

The video taped comments and questionnaire data were examined closely to gain insight

into this objective. The hard of hearing (HOH) and hearing viewers seemed to prefer the

enhanced captions to the conventional or extreme captions for both episodes. As seen in

Table 6, 80% of (22 of 25) participants made at least two positive comments about the

enhanced captions with a combined total of 75 positive comments. Comparatively, only

44% (11 of 25) commented positively about conventional captioning and 36% (9 of the

25) commented positively about extreme captioning with totals of 17 and 11 comments

respectively. The Repeated measures ANOVA of the number of positive comments made

about each of the captioning styles was statistically significant. A Tukey post-hoc showed

that there was a significance differences in user preference for enhanced captions for

conventional captioning (p= 0.005) and extreme captioning (p= 0.002). In the interviews

76 following completion of the questionnaire, participants made comments such as enhanced captions “added a new dimension” and “I want to see more captioning like this…someone should market it” that reinforced the results from the video and questionnaire data. Participants liked that enhanced captions showed the inflection of voice and emotions and that it “wasn’t flat” because it “added excitement.”

In contrast to the enhanced captions, the extreme captions and to a lesser extent conventional captions were criticised far more. Nineteen of 25 participants commented negatively about the conventional captions and 21 of 25 participants commented negatively about extreme captions, generating a total of 40 and 68 negative comments respectively. Comparatively, only five people made negative comments about enhanced captions with a total of 15 negative comments. The Repeated Measures ANOVA uncovered significance in the number of negative comments among the three captioning style categories. A Tukey post-hoc revealed that there was a significant difference between the extreme captions compared to conventional captions (p = 0.18). Participants commented in the open discussion session that the extreme captions “does not add value” and exclaimed “what was the point of that” while watching. They went as far as to make comments such as “Deep 6 aka bury this one.” Comments about the conventional captions were more moderate but showed noticeable preference for enhanced captions through complaints like “it won’t be easy going back to the regular [conventional] captions”. The extreme captions were described as overwhelming, distracting and too bouncy by a majority of participants. The conventional captions were found to be less exciting and less informative than the enhanced captions.

77 For the HOH group, the ratio of positive to negative comments for the enhanced captions was 3 to 1 and for the hearing group, it was 10.5 to 1, providing additional evidence of the preference trend. For the conventional and, particularly, the extreme captions, the ratio of positive to negative was far less than 1, meaning many more negative than positive comments were made. Given that 73% of the HOH participants reported using conventional captions on a regular basis, it is surprising to find such strong support for the enhanced captioning style.

The questionnaire results support the findings from the video data and tell a similar story. Enhanced captions received a very positive response in all of the “caption” attributes (Colour, Size of Text, Speed of Display, Placement on Screen, Text

Description, Movement of Text, How moving Text Portrayed Emotion and Text

Descriptions) with high ratings of positive or very positive by more than 70% of all participants. The questionnaire data also revealed that many of the participants in the

HOH group reported they were less confused (54%), less distracted (61%) and more perceptive (64%) of the character’s emotions in the show while watching enhanced captions. The open ended comments and discussion after each show also support these findings. Participants made comments such as “liked emphasis on words and bouncing to match actions.” and “dramatic improvement over traditional captioning, specifically when trying to convey emotions in a show”.

Sixty-seven percent of the people reported liking the extreme captioning style overall in the questionnaires; however, 57% of people reported disliking the movement of extreme captioning text. The extreme captioning was bold, colourful and blended in well

78 with the children’s program. It is possible that the participants enjoyed the aesthetics of the captions as an artistic piece but found the shaking letters difficult to read.

On the whole, video data, questionnaire data and free form comments indicate that enhanced captions are perceived to be well-understood and accepted by the participants. Enhanced captions are also identified by the participants as an improvement over the traditional captions (at least in short examples). Additionally, people described themselves to be willing or very willing to discuss the show with hearing friends. The willingness of the HOH participants indicate enough confidence in their level of understanding of the show’s content to discuss it with hearing friends who can access the sound through the audio channel. Communication difficulties can often exist between caption user and non-caption user groups regarding show content. Hearing impaired caption users often believe that they do not necessarily have sufficient information about the show, and thus they are reluctant to discuss it with hearing viewers (Fels et. al.,

2005). Captions that seem to increase user comfort level with the content are an important improvement towards more equal access to television and film content.

Another objective of my study was to uncover possible differences between the

HOH and hearing group in their reactions to the animated captions. A large portion of caption users are not hearing impaired. Also, many hearing impaired viewers use captions in the presence of hearing friends and family. Thus, analyzing and comparing the reactions of the two groups was important to consider. Surprisingly, only the category of

“Impact on People” showed a significant difference between the two groups.

The data revealed that six of 15 HOH participants thought that hearing people considered all captions to be disruptive (not just animated captions) to their television

79 viewing experience. The six HOH users had particular reservations about the acceptance of the animated (enhanced and extreme) captions by their hearing peer. One HOH participant said, for example, that hearing people would be greatly bothered and distracted by the animated captions. Contrary to what the HOH participants believed,

100% of the hearing users reacted very positively to the enhanced captions (but not to the extreme captions). The questionnaire data supported the video data and showed that in the hearing group 61% of the people found that their level of understanding of the show increased with the addition of the animated text. Meaning, the moving text did not impede the understanding of the hearing participants of the show but rather enhanced it.

These types of assumptions can colour people’s ability to empathize with others, in this case hearing television viewers with hard of hearing viewers and vice versa. This discrepancy in perception may be worthwhile exploring in future research.

The questionnaire data showed another significant difference between the hearing and HOH groups in the placement of the text on the screen for the enhanced captions.

The placement of the enhanced captions was near the bottom of the television screen, which closely resembled that of closed captions. The HOH group seemed to like it very much (m = 1.5, SD = 0.75, where 1 is liked very much and 5 is disliked very much) and the hearing group did not like it as much—but the mean still fell close to the liked category (mean = 2.1, SD= 0.83). This difference in preference can be explained by examining the caption viewing habits of the two groups. None of the hearing participants reported watching television with captions always, only “occasionally” (60%) or “never”

(40%); however, 73% of the HOH group reported always watching television with captions. Since the HOH participants are more accustomed to this position than the

80 hearing participants, their preference for the familiar placement would naturally be higher.

All the questionnaire responses were analyzed separately using crosstab analyses to understand possible correlations between participant ratings of caption attributes and comprehension attributes. These crosstabs were aggregated over two episodes and grouped by hearing status to ensure enough data for each category and to detect any differences between the two groups.

In the HOH group, those who liked the speed of the enhanced caption presentation also found that they were less confused (36%); less distracted (61%) and had a better understanding of the character’s emotions the show (54%). Standard captioning speed is recommended at 145 words per minute (Jensema, 1997) and in this study, this standard was used as the rate at which enhanced captions appeared. It appears that this reading speed was optimal for this participation demographic. For the hearing group, the crosstab analysis showed that 63% of hearing participants liked the portrayal of emotions using enhanced captions and that this portrayal also increased their level of understanding of show. This positive correlation indicated that hearing viewers appeared to derive value from viewing television shows and films with enhanced captions.

Having captions at the bottom of the screen is a conventional display style. It is expected that regular users of captions would show a preference for this style. The crosstab results showed that the placement of captions at the bottom of screen was correlated with the participants’ understanding of the show and of the character’s emotions. Participant comments showed that the familiar placement prevented the viewers from hunting around the screen for the caption. Participants made strong

81 suggestions for placement by saying “leave the captions at the bottom”. In addition, animated text captions that are placed in different locations on the screen were found to be generally undesirable. Participants commented that “a little word here and there is distracting” and the unusual placement of words is causing “competition for attention.”

Fels et al. (2005) reported similar preferences expressed by participants in a study with graphic captions. Fels et. al. (2005) found that deaf and HOH users strongly disliked having captions located near the speaker because captions located close to the face of a speaking person were interpreted by viewers as being told to lip read that person.

The more likely cause of this dislike of captions in different screen locations is

that more extensive eye movement or work is required to process/read the dynamic

captions. In my study, not only were extreme captions placed in different locations on the

screen but also only some of the words from the captions were placed in those different

locations. In order to read the whole caption, the viewer had to focus on the full text at

the bottom of the screen and move their eyes to the moving text that was placed in the

different screen locations. Because the text was animated, it easily attracted/distracted the

viewer’s attention; a common reaction to moving objects (Peterson & Dugas, 1972).

Having only some words animated at different locations on the screen forced the user’s

attention to be diverted from reading the caption sentences. Consequently, the user’s eyes

moved back and forth between the moving words and the static captions, increasing the

work the eye and the cognitive process must do. Therefore, I would recommend that

moving text in captions should be contained within the full caption text and be located at

the bottom of the screen.

82 The crosstab analysis also showed a positive relationship between the size of the captions and participant’s level of understanding of the show. In their study on people’s preferences for captions, The National Centre for Accessible Media found that people preferred “medium” sized captions, where medium was defined as “roughly equivalent to the 26 lines of the TV picture” or 26 pixels (NCAM, n.d.c, section Test Element: Caption

Size, 1). The size of the enhanced captions in this study began at 25 pixels and generally increased with the various emotions, closely following recommended font sizes. For emotions such as anger, afraid and happy the animation became larger, only sadness became smaller in vertical size. It is not surprising then that this size was positively correlated to participant’s levels of understanding. No comments were made on the impact of changing the caption sizes. Further research is required with more examples of animated captions to determine what size attributes are causing the positive relationship between understanding emotions and caption size.

A final objective of my research was to measure the participants’ understanding of the emotions experienced by the characters of the show as a result of the animated text.

In my attempt to attain this objective, I realized that reliably measuring people’s understanding levels about a piece of content is tremendously difficult, particularly when one group has a communication disability. No standard measures exist for establishing equivalency between HOH and hearing groups because most tests of understanding involve questions developed by hearing people, which are often inappropriate for comparison. Thus, the design for measuring the understanding of emotions had drawbacks which will be discussed in the limitation section of this thesis.

83 The emotion analysis portion of the study showed that, hearing participants accurately identified more emotions than the HOH participants. However, the only significant difference between the groups was for fear, where the HOH group made on average 2.0 correct identifications while the hearing group made 2.7 correct identifications.

The fear emotion occurred more often than any of the other emotions and fear was identified by more people than the other emotions. Thirty-two percent of all emotion identifications were for fear compared with 27% for anger, 15% for sadness, 14% for happiness and 13% for disgust. Fear was also identified correctly more than all other emotions (67%), and had a low percentage of incorrect (7%) and missed identifications

(15%). Out of all the emotions, fear was significantly identified more accurately [F (4,

58) = 12.23, p< 0.02] for both groups compared to the other emotions. The results then indicate that the animation for fear represented the emotion to the participants with a high degree of clarity.

Anger also showed a similar trend to that of fear, although the results were not significant. Fifty percent of the time the anger animation was correctly identified, only

4% of the time it was incorrectly identified, and 13% of the time missed completely. The high number of correct identifications (and low number of incorrect identification) indicate that the animation for anger could correctly represent the emotion. However, further research will need to be carried out in order to validate the anger animation, especially using content that contains a range of anger emotions.

As seen in Table 10, happiness, disgust and sadness seem less understandable and more difficult to interpret, although sadness had a 46% correct identification rate.

84 Possible explanations for this result are: 1) the animation mapped somewhat unsuccessfully to the emotion; or 2) too few instances of these three emotions were present compared to anger and fear in the content. Instances of anger and fear occurred three times each in the clips while happiness, disgust and sadness occurred only once.

The mixed results in the case of happy, sad and disgust indicate that more research is definitely required in order to create more reliable animations. However, it is important to note that the main purpose of this exploratory study was to establish if animated captions are a worthwhile avenue to pursue, not to establish categorically consistent and accurate emotion animations.

One surprising result of the study was that the emotions were correctly identified

the most number of times in the conventional captions (57%), more often than the

enhanced (50%) or extreme captions (48%)—although these differences were not

significant. Also, emotions were missed 16% of the time for conventional captions, 19%

for enhanced and 26% for extreme. Emotions were incorrectly identified for the

conventional captions more than the other types (9% compared with 2% for enhanced

captions and 7% for extreme captions). When compared to how well enhanced caption

were rated compared to extreme and conventional, the identification data seem difficult

to rationalize. It is possible that enhanced and extreme captions increase the intrusive

nature of captioning, thus making it difficult for the participants to understand what is

going on. However, this seems unlikely since 64% of participants found that their

understanding of character’s emotions was increased by enhanced captioning. Thus, the

more possible explanation for the contradictory results can be attributed to methodology

and the difficult task of measuring understanding of television content. Many events were

85 occurring simultaneously when participants were asked to view a new piece of content, with new style of captioning and then verbalize the emotions of the characters. These events included watching the content, understanding and interpreting the emotions, looking at the checklist to keep track of the available options and speaking out loud the appropriate emotion. Since participants were already familiar with the closed captioning, the task of identifying and verbalizing the emotions may have been less difficult than performing this task with enhanced captioning. Animated text is inherently more distracting than static text, as a result enhanced captioning, with its added level of distraction may have hindered the viewer’s ability to correctly verbalize rather than correctly understand the emotion.

Other categories of interest measured in the study were the technical category and the sound effects category, although neither showed any statistical significance between the HOH and hearing groups or positive versus negative comments. The technical category amalgamated several factors related to captioning, such as the translucent bar against which the animated captions were displayed, and the content (Deaf Planet).

Viewer opinions about the translucent bar were somewhat divided. Three viewers appreciated that the show was more visible because of the translucency, which indicates that the translucency feature of the EIA-708 standard may be useful in enhancing viewer experience. The negative comments about the bar (by one participant) related to the fact that it stayed on even when the captions were not being displayed. For this particular study, it was too time consuming to synchronize the translucent bar with the appearance of captions. In the future this problem should be rectified so that the translucent bar disappears if there are no captions. The comments about the content, Deaf Planet, were

86 also divided (six participants liked it, four disliked it). Six participants appreciated Deaf

Planet’s inclusive theme and enjoyed watching a program geared towards sign language users. One of the hearing viewers was not accustomed to signing and was distracted by signing characters that did not speak. Other participants (hearing and HOH) disliked the qualities intrinsic to children’s programming such as extreme colours, over expressive acting, and slower, simpler plot lines. It is imperative that further research with animated captions be evaluated using different genres and different lengths of content, in order to eliminate the biases (positive and negative) that are inherent with a sign language based children’s show.

There were almost an equal number of positive (23) and negative (24) comments made by the users in the category of sound effects. Eleven of 25 users found that animating the sound effects was beneficial in providing more information not readily available from the dialogue and visual cues. However, the animation of the sound effects, particularly in the Bad Vibration segment, was thought to be too overpowering and distracting. In particular, the seemingly innocuous use of dancing, colourful letters to depict music was criticised by most HOH viewers as being unnecessary and offensive.

These users suggested that the word “Music” be replaced by musical notes and mentioned that the word “music” unwittingly draws attention to their disability.

The questionnaire data mirrored what was found in the video data regarding the animated sound effects. Seven people reported not liking the word “Music” animated to represent the emotion in the music and suggested animating music notes instead of letters. People also reported not liking the animation for the word “rumble” because it was unreadable due to rapid motion and a faded appearance.

87 When the questionnaire data was grouped by episodes and analyzed, a significant difference was found between the two episodes (To Air and Bad Vibes) for the text description (or sound effects) category. A closer look revealed that the significance was being caused by the HOH group. For To Air, the HOH group reported liking the text description very much (mean = 1.6, SD = 0.63) compared with Bad Vibes where they reported only liking the text descriptions (mean = 2.3, SD = 0.83). As Bad Vibes had more sound effects than To Air, more text descriptions were used for them (e.g.,

“rumble” whirling” and “music”). This data seems to indicate that overusing words to represent sound effects may not be a suitable direction to pursue.

The voiced opinions about the sound effects were decidedly split. The sound effects were reported to be helpful but also distracting. It appeared that participants preferred the sound effects when dialogue text was not present on screen, as opposed to having dialogue text and description text be simultaneously displayed. Participants complained that having dialogue and sound description at the same time is “hard to watch,” because there is “too much going on”. Also, the participants wanted the effects to stop showing as soon as they had a chance to read them, instead of having the description continue as long as the sound was present. For instance, in one scene in the test clips, a rumbling noise can be heard continuously for about 20 seconds. When the sound effect

“Rumble” was displayed throughout those twenty seconds, the participants commented that they wanted the words to disappear as soon as they had a chance to read it. One participant commented, for example, that “… the word rumble went on for too long”.

This is very surprising considering that one of the most desirable but missing element of captioning is synchronization. When it comes to sound effects; however, users do not

88 seem to want the description to be synchronized with the sound—to have the sound effect description be displayed for as long as the sound is present. Since all participants did not comment about sound effects, it is difficult to generalize a trend. However, it seems that viewers are having difficulty with associating the descriptive text with the effect itself.

Furthermore, animating descriptive words to mimic a particular sound effect by manipulating the speed and amplitude of the word may render it unreadable and ineffective. As a result, animating descriptive text to exactly match the sound effect (as was done in this case) may not be the most appropriate technique. Animated graphics may be one possible alternative to descriptive text, as suggested by some of the participants, especially in the case of music. However, effective graphics can be difficult and costly to create, and the number of possible sound effects can be overwhelming. It would be impractical to create a library of graphics for a captioner to use containing all possible sound effects. Caption users can be engaged in defining specific sound categories and creating meaningful graphics through a participatory design approach

(Clement & Van den Besselaar, 2004). While the number of possible sound effects is limitless, some commonality can be established using group consensus. Caption users can then create sound effect animations that are most understandable to them collectively.

Another solution maybe to use an animated, graphical bar, situated at an

inconspicuous part of the screen. This bar would move according to the sound effect

(similar to the equalizer bars found on media players or sound display bars on a stereo).

In the future, tactile elements could also be introduced in conjunction with visuals to

present sound effects to users. Hot and cold sensations can be used to emulate angry or

tension building sounds, while vibrations can mimic things like rumbling effects.

89 Essentially, a lot more research is needed in this area before any concrete conclusions can be drawn.

Overall, animated text captions appeared to be accepted and understood by the participating viewers. The fact that enhanced captions were particularly well-liked by the participants indicate that viewers are able to interpret meaning from these animated captions. It also appears that, at least based on the results of this study, the fear and anger emotions are being accurately represented by animated text. The animations for happy, sad and disgust emotions still need to be revised. It appears that animation properties such as variation in size, direction of movement, change in opacity and manipulation of fonts and colours have the potential to be mapped to specific emotions and applied to captioning.

Section 6.01 Limitations

(a) Emotion Identification Method The most difficult part of the study was designing an accurate way of gauging the relationship between the various animations and the emotions as understood by viewers.

Watching a completely new piece of content with animated captions (a style never before seen) is difficult for the participants in the first place. The HOH viewers were accustomed to watching the conventional captioning and the results of the study may have shown a novelty effect. Conversely, as the hearing viewers were not regular users of any captioning, having to look at captioning (especially in a study environment) could have also been somewhat disconcerting. None of the participants had seen the show

“Deaf Planet” prior to the study, so they were completely unfamiliar with the characters, the setting, and the plot. The test clips were very short (approximately one and a half

90 minutes long) and neither one started at the beginning of the episode, causing further confusion. In future studies, an entire episode of a show that is familiar to the participants is recommended for captioning to reduce the confusion. The captioned episode should also be viewed by the participant from the beginning to ensure maximum comprehension.

However, watching a half-hour show multiple times in a study environment would be difficult to arrange with test participants, as it would likely last for several hours. I would suggest, then, that study participants be allowed to view longer episodes at home and possibly over several hours. Also, data capture and analysis for a long study would also be difficult to coordinate.

The emotion identification process followed by the participants was a complex one: watching a show, identifying the emotions the characters were feeling on a checklist and then verbalizing those emotions using words from the checklist during a second viewing. It should be noted here that only one of the hard of hearing participants was oral deaf, as a result the verbal protocol method was appropriate for this study. All of the participants had the option of writing down their responses but none (except the one who was oral deaf) chose to do so. The addition of the new and distracting factors (new show and new style of captioning.) to the identification process only made an already difficult task harder. In addition, it was obvious from the video data participants seemed to want to speak but did not as the scene changed. This occurred particularly during enhanced and extreme captioning. The normal television viewing experience does not require the viewer to decompose content into parts such as emotions and asking people to do that may have been too complex of a task to achieve successfully.

91 A physiological measure such as galvanic skin response may be more appropriate, since it would eliminate the need for the participants to articulate the character’s emotions. However, these techniques are used to measure arousal of emotion in the participant rather than their understanding of it [Vetrugno et. al., 2003]. Using sliders or dials to identify emotions could have also been used, but the number of emotions was quite high. Asking the participants to remember which dial or slider corresponded to what emotion would have had its own limitations.

To accurately measure understanding of emotion in the context of watching television, the participants need to watch more than just a few minutes of a show. They also need watch the show from the beginning to fully understand what is going. In addition, verbalizing emotions while watching content and concentrating on it is very difficult and highly unnatural. One interesting and possibly more accurate way to measure understanding would be to allow the participants to communicate the story of the show after watching it through a focus group. In this way, a facilitator could encourage all participants to make story contributions to develop a rich and thorough discussion of the content.

(b) Communication barriers In addition to the difficulty of designing the study, the relaying of the instructions to HOH users was somewhat difficult as well. Many of participants had severe hearing loss and were not expert lip readers, even though they themselves were able to communicate articulately. Communicating the complex instructions of identifying the emotions to some of the HOH participants while they watched the clips was at times very difficult. The hearing participants were much more communicative during the emotional

92 identification than the HOH. In fact, several HOH emotional ID portions had to be excluded because either the instructions were not properly conveyed to them. One HOH participant did not understand that he was supposed to verbalize his identifications while watching the videos; instead he continued to mark them down on the emotion checklist during the second viewing. Another HOH participant, who was oral deaf, was only able to communicate through writing and it was difficult for her to start and stop the video each time she wanted to identify an emotion. Three of the HOH users also were visually impaired adding further communication obstacles.

These barriers to communication are very difficult to mitigate in a study setting.

For future studies involving HOH users, an online chat session through which participants and facilitator can communicate through typing maybe helpful. Also, running through a short, but complete, practice study for training would also be helpful— especially when complex directions are involved. Future studies should also be set up in a usability lab specifically designed to accommodate HOH users. The lab should have minimal background noises and high lighting levels for lip readers to see the study facilitator clearly. If any of the HOH users (in future studies) is also a sign language user, then interpreters will be used to ease the communication difficulties.

(c) Small Number of Participants The animated caption study was conducted with a relatively small number of participants, with only fifteen hard of hearing and ten hearing individuals. While the smaller participant groups are sufficient to gauge the soundness and legitimacy of the animated text captions, larger control and experimental groups are needed before categorical conclusions can be drawn.

93 (d) Limited Genre Deaf planet is not only a children’s show, it is also an atypical one. The premise of the show is based on science-fiction, thus is a mixture of animation and live action.

Furthermore, all but two characters in the show are deaf and communicate using sign- language, making all communication between the main characters to be conducted through an interpreter (who is a female but imitates male voices when necessary).

Finally, the actions, colours and acting of Deaf Planet are highly exaggerated to appeal to a younger audience. While animated captions worked well with this show, its unusual nature makes it very difficult to predict the extent to which animated captions would be applicable to other genres such as drama, horror or comedy. It should be noted that all content have built-in biases that can influence the effectiveness of a study of this nature.

Thus, it is imperative that animated captions are tested with as many genres and as many different shows as possible.

(e) Short and Limited Test Clips

The test clips were too short in length and did not provide the participants enough time to fully understand the story or the characters. Furthermore, since Deaf Planet is a children’s show, the range of emotions was very narrow with only a few instances of exaggerated anger, fear and even less instances of happy, sad and disgust. Since I am trying to establish an emotion model based on a few basic emotions combined with intensity rating, a show with a much broader range of emotions (e.g. from sarcasm to rage) is necessary for future research.

94 Chapter 7. Recommendation/Conclusion

Closed captioning has remained essentially unchanged since its inception and little has been done to improve or alter it over the past few decades. One of the issues that caption users have identified as problematic is the lack of access to non-dialogue sound information such as music and speech prosody. The purpose of this thesis work has been to investigate one alternate method of presenting that non-dialogue sound information, in the form of animated text. A framework was constructed that mapped a model of emotions, consisting of a set of basic emotions, onto properties of animation that represent those emotions. Example video content that contained implementations of this framework with text captions was then evaluated with hard of hearing and hearing users.

The first research question of my thesis asked if the participants are able to understand, appreciate and interpret the animated text captions, to determine whether caption viewers can watch and understand a show with animated text. Based on participant reactions, I conclude that animated text shows promise as an effective avenue for representing non-dialogue sound information because most of the participants liked and understood this new method. Enhanced captioning was very well received by all of the participants and was thought to be an improvement over conventional captioning.

However, extreme captioning, where animated text was located in different spots around the screen, was shown to be distracting to the participants. Therefore, my recommendation is that the extreme captioning style (with dynamic placement of text) is abandoned moving forward and only enhanced captioning be used for continued investigation.

95 The second research question asks whether the emotions can be accurately and consistently represented by the animated text. Out of all the emotive animations, the animation used to express fear was most obvious to the participants than any of the other animations. Thus, I recommend that the animation for fear can be used consistently to represent that emotion. Next, the animation for anger, although not statistically significant, was understandable to most people (based on the descriptive statistics). I therefore suggest that the animation used to represent anger is also acceptable. The animations for the remaining three emotions (happy, sad, and disgust) were identified correctly on fewer occasions than fear or anger. However, the two test clips had many more instances of anger and fear than happy, sad or disgust. As a result, I recommend that animation styles for happiness, sadness and disgust be re-evaluated with content containing more of these emotions.

The final research question asks what properties of animation related to what

properties of emotion. To answer that research question, I refer to the animation/emotion

model outlined in Chapter 2. As the model shows, the fear emotion can be represented

with a constant expansion and contraction of size, combined with a vibration effect for

the duration of the animation. The anger emotion can be conveyed with an abrupt, one

cycle expansion and contraction of text, with the vibration occurring at the largest size.

Sadness is represented with text decreasing vertically, moving in a downward motion and

decreasing in opacity. Happy animation is almost opposite of sad, with a vertical increase

of text and upward movement of text from baseline.

Animating descriptions of sound effects is definitely an area to be investigated

further; however, text descriptions may need to be replaced with meaningful symbols or

96 graphics. Although, as long as the animated text descriptions are tested for readability and understanding, they may still be used sparingly to effectively represent sound effects.

Based on this study, it is apparent that all text suggesting musical elements should be replaced with musical notes.

Section 7.01 Future Research

The participants in my study were only able to watch two short clips (less than two minutes each) containing animated captions for a speciality children’s programming.

In order to better establish the effectiveness of this new style of captioning, full-length

(30 to 60 minute) shows need to be captioned using this method and tested with a larger number of participants. Furthermore, the subject matter of Deaf Planet is unusual and directed towards young children. A more adult-oriented, conventional show need to be captioned in order to gauge reactions of a wider audience including those who are sign language users. A typical comment by participants was that animated captions appear more appropriate for comedies and dramas. Further research is needed to examine the appropriateness and acceptability of animated text for full length television and film shows in different genres and to determine the limitations of text captions within those genres.

The animations for sad, happy and disgust must be evaluated and reapplied to a show that contains more of these emotions. It is possible that one or all of these animations must be either abandoned or fine-tuned for accuracy.

The hearing participants of this study appeared to be overwhelmed by the presence of sound and captions. Future studies need to recreate a typical environment for hearing viewers, where sound is muffled or reduced such as in an exercise setting. An

97 improved method for measuring audience understanding of emotion that does not rely on participants verbalizing what they think the emotions characters are feeling is required for future evaluation.

If using the same approach for the emotion identification portion of the study, where participants verbalize the emotions present on screen, a complete signal detection methodology can be applied. Using the signal detection approach, participant responses can be categorized into hit, miss, false alarm and correct rejection. The data can then be analyzed using a D prime statistics to see the relationship among the four categories, for each captioning type.

The primary purpose of this explorative study was to establish the viability of animated text as captions. Since it seems that animated captions are acceptable, expressive and effective, more research time and effort can be committed to this method.

An iterative, participatory design (Clement & Van den Besselaar, 2004) approach maybe used to either refine or create the emotive animations. The end user of captioning can work collectively to arrive at animations that best convey emotions from their perspective. The participatory design approach could be most helpful in creating meaningful animations for sound effects.

Section 7.02 Conclusion

The results of this first study are encouraging and suggest that animating captions is one way of capturing more sound information contained in television and film. People who are hard of hearing indicate that they want more sound information and it appears that animated text may provide progress toward that goal. The emergence of digital

98 television should be leveraged to improve the quality of captions and it is apparent from this research that most users are ready and willing to accept new forms of captioning.

99 Chapter 8. References

Abdi, H. (2007). Signal detection theory. Encyclopedia of Measurement and Statistics.

Thousand Oaks (CA): Sage.

Abrahamian, S. (2003). EIA-608 and EIA-708 Captioning. Retrieved September 19, 2006,

from http://www.evertz.com/resources/eia_608_708_cc.pdf

Ambrose, G., Harris, P. (2006). The fundamentals of typography. New York: Watson-Guptill

Publication

Barnett, S. (2001). Clinical and cultural issues in

caring for deaf people. . Retrieved April 3, 2007, from

http://www.chs.ca/info/vibes/2001/spring/clinicaldeaf.html

Blanchard, R. N. (2003). EIA-708-B closed captioning implementation. Paper presented at the

IEEE International Conference on Consumer Electronics, 80-1.

Bodine, K., & Pignol, M. (2003). Kinetic typography-based instant messaging . . [Electronic

version]. CHI'03 Extended Abstracts on Human Factors in Computer Systems, 2003 , 914-

1. Retrieved September 7, 2006,

Boltz, M. G. (2004). The cognitive processing of film and musical soundtracks . [Electronic

version]. Memory & Cognition, 32 (7), 1194-15.

Bruner, G. C. (1990). Music, mood and marketing. [Electronic version]. Journal of Marketing,

54 (4), 94-11.

Canadian Association of Broadcasters. (2004). Closed captioning standards and protocol for

canadian english language broadcaster. Retrieved February 6, 2007, from

http://www.cfv.org/caai/nadh20.pdf

100 Canadian Hearing Society. (2004). Status report on deaf, deafened and hard of hearing

ontario students. Retrieved January 19, 2007, from

http://www.chs.ca/pdf/statusreport.pdf

Canadian Television and Radio Commission. (2000). Decision CRTC 2000-60. Retrieved April

7, 2007, from http://www.crtc.gc.ca/archive/eng/Decisions/2000/DB2000-60.htm

Card, S.K., Moran, T., & Newell, A. (1983). The psychology of human-computer interaction.

Hillside, NJ: Lawrence Erlbaum Association Inc.

Chion, M. (1994). Audio-Vision: Sound on Screen ed., New York: Columbia University Press.

Clement, A. Van den Besselaar, P. (2004). Proceedings of the Eighth Conference on

Participatory Design: Artful Integration: Interweaving Media, Materials and Practices, PDC

2004, Toronto, Ontario, Canada, July 27-31

Consumer Electronics Association. (2005). CEA-608-C: Line 21 data services. Retrieved

September 20, 2006, from

http://www.ce.org/Standards/StandardDetails.aspx?Id=1506&number=CEA-608-C

Consumer Electronics Association. (2006). CEA-708-C: Digital television (DTV) closed

captioning. Retrieved January 13, 2007, from

http://www.ce.org/Standards/StandardDetails.aspx?Id=1782&number=CEA-708-C

Cook, N. D. (2002). Tone of voice and mind: the connections between intonation, emotion,

cognition and consciousness. John Benjamin Publishing Company,

Creswell, J. (2003). Qualitative, quantitative, and mixed methods approaches . Thousand Oaks,

California: Sage.

Cruttenden, A. (1997). Intonation . Cambridge: Cambridge University Press

101 DCMP. (n.d.). Steps in the captioning process. Retrieved March 17, 2007, from

http://www.dcmp.org/caai/nadh29.pdf

Downey, G. (2007). Constructing closed-captioning in the public interest: From minority media

accessibility to mainstream educational technology. [Electronic version]. Info, 9 (2/3), 69-

13.

Ekman, P., & Friesen, W. V. (1986). A new pan-cultural facial expression of emotion.

[Electronic version]. Motivation & Emotion, 10 , 159-19.

Ekman, P. (1999). Basic emotions. In T. Dalgleish, & M. J. Power (Eds.), Handbook of

cognition & emotion (pp. 301-19). New York: John Wiley.

Ernst, S. B. (1984). ABC's of typography (Revised Edition ed.) Art Direction Book Co.

Fels, D. I., Lee, D. G., Branje, C., & Hornburg, M. (2005). Emotive captioning and access to

television. ACMIS 2005, Ohmaha

Fels, D.I., Silverman, C. (2002). Emotive captioning in a digital world. ICCHP 2002.284(7).

Ford, S., Forlizzi, J., & Ishizaki, S. (1997). Kinetic typography: Issues in time-based

presentation of text. Paper presented at the CHI '97 Extended Abstracts on Human

Factors in Computing Systems: Looking to the Future , Atlanta, Georgia. 269-1.

Forlizzi, J., Lee, J. C., & Hudson, S. E. (2003). The kinedit system: Affective messages using

dynamic texts. Paper presented at the Proceedings of the SIGCHI Conference on Human

Factors in Computing Systems , Ft. Lauderdale, Florida. 377-7. from Conference on

Human Factors in Computing Systems database.

Frijda, N. (1986). The emotions . New York: Cambridge University Press.

102 Gaulledet Research Institute. (2007). How many deaf people are in the united states.

Retrieved June 19, 2007, from http://gri.gallaudet.edu/Demographics/deaf-US.php

Geffner, D. (1997). First things first. Retrieved March 28, 2006, from

http://www.filmmakermagazine.com/fall1997/firstthingsfirst.php

Harkins, J., Korres, E., Singer, M. S. & Virvan, B. M. (1995). Non-speech information in

captioned video: A consumer opinion study with guidelines for the captioning industry.

Retrieved March 11, 2006, from http://www.cfv.org/caai/nadh126.pdf

Jacko, J., & Sears, A. (2007). The human computer interaction handbook: Fundamentals,

evolving technologies and emerging applications . ONLINE: CRC Press.

Jensema, C. J. (1997). Final report for presentation rate and readability of closed captioned

television. Retrieved June 15, 2007, from

http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/1

5/75/be.pdf

Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music

performance: Different channels, same code? [Electronic version]. Psychological Bulletin,

129 (5), 770-14.

Lee, J. C., Forlizzi, J., & Hudson, S. E. (2002). The kinetic typography engine: An extensible

system for animating expressive text. Paper presented at the 15th Annual ACM

Symposium on User Interface Software and Technology Paris, France. 81-9.

Lewis, M. S. J. (2000). Television captioning: A vehicle for accessibility and literacy. Paper

presented at the CSUN, Los Angeles, California.

103 Marshal, J. K. (n.d.). An introduction to film sound. Retrieved June 20, 2007, from

http://www.filmsound.org/marshall/index.htm

Miles, M. B., & Huberman, M. (1994). Qualitative data analysis: An expanded sourcebook (2nd

ed.). Newbury Park, CA: Sage.

Minakuchi, M. & Tanaka, K. (2005). Automatic kinetic typography composer. ACM International

Conference Proceeding Series; Vol. 265. 221(3)

Mowrer, O. H. (1960). Learning theory and behavior . New York: Wiley.

National Captioning Institute. (2003). New analytical study of closed captioning

finds audiences think It’s important but improvements are needed. Retrieved March 12,

2006, from http://www.ncicap.org/AnnenbergStudy.asp

NCAM. (n.d.a). International captioning project description of line-21 closed captions.

Retrieved March 15, 2006, from http://ncam.wgbh.org/resources/icr/line21desc.html

NCAM. (n.d.b.). International captioning project description of subtitles. Retrieved March 15,

2006, from http://ncam.wgbh.org/resources/icr/subtitledesc.html

NCAM. (n.d.c.). ATV closed captioning project. Retrieved June 20, 2007, from

http://ncam.wgbh.org/projects/atv/atvccpart2size.html

Ortony, K., & Turner, T. J. (1990). What's basic about basic emotions? [Electronic version].

Psychological Review, 97 , 315-16.

Peterson, H. & Dugas, D. (1972). The relative importance of contrast and motion in visual

perception. Human Factors, 14, 207 (9)

Plutchik, R. (1980). Emotion, a psychoevolutionary synthesis. New York: Harper & Row.

104

RIT. (2000). Captioning. Retrieved August 13, 2007, from

http://www.netac.rit.edu/publication/tipsheet/captioning.html

Stanislaw, H., & Todorov, N. (1999). Calculation of signal detection theory measures. Behavior Research Methods, Instruments, and Computers , 31 , 137- 149.

Vetrugno R, Liguori R, Cortelli P, Montagna P. (2003). Sympathetic skin response: basic

mechanisms and clinical applications. Clin Auton Res . 2003;13: 256(4).

Woolman, M. (2005). Type in motion 2 . New York: Thames & Hudson.

105 Appendix A

To: Deborah Fels Re: REB 2004-092-1: Burnt Toast Date: July 20, 2006 Dear Deborah Fels, The review of your protocol REB File REB 2004-092-1 is now complete. This is a renewal for REB File REB 2004-092. The project has been approved for a one year period. Please note that before proceeding with your project, compliance with other required University approvals/certifications, institutional requirements, or governmental authorizations may be required. This approval may be extended after one year upon request. Please be advised that if the project is not renewed, approval will expire and no more research involving humans may take place. If this is a funded project, access to research funds may also be affected. Please note that REB approval policies require that you adhere strictly to the protocol as last reviewed by the REB and that any modifications must be approved by the Board before they can be implemented. Adverse or unexpected events must be reported to the REB as soon as possible with an indication from the Principal Investigator as to how, in the view of the Principal Investigator, these events affect the continuation of the protocol. Finally, if research subjects are in the care of a health facility, at a school, or other institution or community organization, it is the responsibility of the Principal Investigator to ensure that the ethical guidelines and approvals of those facilities or institutions are obtained and filed with the REB prior to the initiation of any research. Please quote your REB file number (REB 2004-092-1) on future correspondence. Congratulations and best of luck in conducting your research.

Nancy Walton, Ph.D.

Chair, Research Ethics Board

106 Appendix B Pre-Study Questionnaire

Pre-study Questionnaire

The purpose of this questionnaire is to gather information about participants’ TV viewing habits and preferences. The information gathered here will be used in order to analyze the results of this study, the effectiveness of enhanced closed-captions for interactive television, and to improve the effectiveness emotional enhancements to closed-captions.

This questionnaire will take approximately 5 minutes to complete. Please read the questions carefully. To record your answer, please check the box or write your answer in the space provided. Thank you for participating in the Burnt Toast study.

Part I – Demographics

1. Do you identify yourself as (check one):  Hearing  Hard of hearing  Deaf

2. Are you:  Male  Female

3. Please indicate your age range:  Under 18  19 – 24  25 – 34  35 – 44  45 – 54  55 – 64  over 65

4. What is your highest level of education completed?  No formal education  Elementary school  High School  Technical/College  University  Graduate School

5. What is your occupation? ______

107

Part II – Current Television Patterns

6. How many hours of television do you watch in an average week?  Less than 1 hour  1 to 5 hours  6 to 10 hours  11 to 15 hours  15 to 20 hours  more than 20 hours

7. Do you watch TV alone? Please circle your response.

Always Frequently Sometimes Seldom Never

8. Do you watch TV with friends or family?

Always Frequently Sometimes Seldom Never

9. If you watch TV with friends or family, do you engage in conversation about the program:  While you are watching it  After the program is finished  Not at all  Do not watch TV with friends or family

Part III – Closed Captioning

10. Do you use closed captions when watching television?  Always  Occasionally  Never

11. What do you like about the closed captions on television (check all that apply)

 rate or speed of display  verbatim translation  placement on screen  use of text  size of text  colour (black and white)  other, please specify ______

108

12. What do you NOT like about the closed captions on television (check all that apply)

 rate or speed of display  verbatim translation  placement on screen  use of text  size of text  colour (black and white)  other, please specify ______

13. What elements of closed captions do you believe are not presently available for television (check all that apply):  Emotions in speech  Background music/ noises  Speaker identification  Text at the same speed as the words being spoken  Adequate information in order to time jokes/puns correctly  Other, please specify ______

14. The following is a list of characteristics that could be added to, or changed about closed captions. Please check five characteristics that are most important to you.

____ Faster speed of displaying captions ____ Text descriptions of background noise. ____ Text descriptions for background ____ Use of graphics to represent emotion. music ____ Use of overlay/floating captions ____ Use of graphics to represent music ____ Use of colour in captions for ____ Use of graphics to represent emphasis, emotion or tone background noise ____ Use of text descriptions for emotional ____ Use of different fonts or text size information. ____ Flashing captions for emphasis. ____ Use of moving text to represent emotion ____ Use of graphics or symbols to denote background elements such as ____ Use graphics or text to identify applause or musical inserts. speaker

15. Is there anything else that has not been included in the above list that you would like to see included in closed captions that is not presently available?

______

109

______

110 Appendix C Post-Study Questionnaire

Post-study Questionnaire

The purpose of this questionnaire is to gather information about your impressions of, likes and dislikes of the six short videos that you viewed. The information gathered here will be used to analyze the effectiveness of enhanced closed-captions for interactive television, and to gather suggestions for improvement.

This questionnaire will take approximately 5 minutes to complete. Please read the questions carefully. To record your answer, please check the box or write your answer in the space provided. Thank you for participating in the enhanced captioning study.

16. In a few sentences (or less), please describe what to To Air is Human was about?

______

______

______

______

______

17. Please rate each of the following aspects of enhanced captioning version of “To Air is Human”. Check the box that best fits your rating.

Aspect Liked Liked Neither Disliked Disliked very liked very much nor much disliked Colour Size of text Speed of display Placement on screen Movement of the text on the screen How the moving text portrayed the character’s emotions Text descriptions

111

18. Please rate each of the following aspects of extreme captioning version of “Bad Vibes”. Check the box that best fits your rating.

Aspect Liked Liked Neither Disliked Disliked very liked very much nor much disliked Colour Size of text Speed of display Placement on screen Movement of the text on the screen How the moving text portrayed the character’s emotions Text descriptions at bottom

19. Please rate each of the following aspects of conventional captioning version of “To Air is Human”. Check the box that best fits your rating.

Aspect Liked Liked Neither Disliked Disliked very liked very much nor much disliked Size of text Speed of display Placement on screen Text descriptions

20. Compared with the conventional closed captioning of “To Air is Human”, rate your level of confusion with the enhanced captioning for To Air is Human? Please circle your rating.

Much more Somewhat more not different Less confusing Much less confusing confusing confusing

112

21. How much do you think the enhanced captions increased your overall understanding of the video?

Did not increase Slightly No difference Somewhat Greatly my level of increased my increased my increased my understanding level of level of level of at all. understanding understanding understanding

22. How much do you think the enhanced captions increased your overall understanding of the emotions the characters were feeling in the video?

Did not increase Slightly No difference Somewhat Greatly my level of increased my increased my increased my understanding level of level of level of at all. understanding understanding understanding

23. How much do you think the enhanced captions distracted you from the video?

Did not distract Slightly Did not notice Somewhat Greatly me. distracted me distracted me distracted me.

24. If you had just watched To Air is Human with friends who were hearing, rate your willingness to engage in conversation with them about To Air is Human.

Not willing to Not really Don’t care Somewhat Very willing to discuss To Air willing to willing to discuss To Air is Human at all discuss To Air discuss To Air is Human. is Human is Human

Please indicate why you selected your particular rating.

______

25. What suggestions do you have for further improvements to captioning?

______

______

113 26. Please add any additional comments.

______

114