Collecting a Motion-Capture Corpus of American Sign Language for Data-Driven Generation Research

Collecting a Motion-Capture Corpus of American Sign Language for Data-Driven Generation Research Pengfei Lu Matt Huenerfauth Department of Computer Science Department of Computer Science Graduate Center Queens College and Graduate Center City University of New York (CUNY) City University of New York (CUNY) 365 Fifth Ave, New York, NY 10016 65-30 Kissena Blvd, Flushing, NY 11367 [email protected] [email protected] cult to read the English text on a computer screen Abstract or on a television with closed-captioning. Software to present information in the form of animations of American Sign Language (ASL) generation ASL could make information and services more software can improve the accessibility of in- accessible to deaf users, by displaying an animated formation and services for deaf individuals character performing ASL, rather than English with low English literacy. The understand- text. While writing systems for ASL have been ability of current ASL systems is limited; they proposed (Newkirk, 1987; Sutton, 1998), none is have been constructed without the benefit of annotated ASL corpora that encode detailed widely used in the Deaf community. Thus, an human movement. We discuss how linguistic ASL generation system cannot produce text output; challenges in ASL generation can be ad- the system must produce an animation of a human dressed in a data-driven manner, and we de- character performing sign language. Coordinating scribe our current work on collecting a the simultaneous 3D movements of parts of an motion-capture corpus. To evaluate the qual- animated character’s body is challenging, and few ity of our motion-capture configuration, cali- researchers have attempted to build such systems. bration, and recording protocol, we conducted Prior work can be divided into two areas: an evaluation study with native ASL signers. scripting and generation/translation. Scripting systems allow someone who knows sign language to 1 Introduction “word process” an animation by assembling a se- American Sign Language (ASL) is the primary quence of signs from a lexicon and adding facial means of communication for about one-half mil- expressions. The eSIGN project created tools for lion deaf people in the U.S. (Mitchell et al., 2006). content developers to build sign databases and as- ASL has a distinct word-order, syntax, and lexicon semble scripts of signing for web pages (Ken- from English; it is not a representation of English naway et al., 2007). Sign Smith Studio (Vcom3D, using the hands. Although reading is part of the 2010) is a commercial tool for scripting ASL (dis- curriculum for deaf students, lack of auditory ex- cussed in section 4). Others study generation or posure to English during the language-acquisition machine translation (MT) of sign language (Chiu years of childhood leads to lower literacy for many et al., 2007; Elliot & Glauert, 2008; Fotinea et al., adults. In fact, the majority of deaf high school 2008; Huenerfauth, 2006; Karpouzis et al., 2007; graduates in the U.S. have only a fourth-grade (age Marshall & Safar, 2005; Shionome et al., 2005; 10) English reading level (Traxler, 2000). Sumihiro et al., 2000; van Zijl & Barker, 2003). Experimental evaluations of the understandabil- 1.1 Applications of ASL Generation Research ity of state-of-the-art ASL animation systems have shown that native signers often find animations Most technology used by the deaf does not address difficult to understand (as measured by compre- this literacy issue; many deaf people find it diffi- 89 Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies, pages 89–97, Los Angeles, California, June 2010. c 2010 Association for Computational Linguistics hension questions) or unnatural (as measured by the transitional time between signs, and the in- subjective evaluation questions) (Huenerfauth et sertion of pauses during signing – all of which al., 2008). Errors include a lack of smooth inter- are based on linguistic factors such as syntactic sign transitions, lack of grammatically-required boundaries, repetition of signs in a discourse, facial expressions, and inaccurate sign perform- and the part-of-speech of signs (Grosjean et al., ances related to morphological inflection of signs. 1979). ASL animations whose speed and paus- While current ASL animation systems have ing are incorrect are significantly less under- limitations, there are several advantages in present- standable to ASL signers (Huenerfauth, 2009). ing sign language content in the form of animated • Spatial Reference: Signers arrange invisible virtual human characters, rather than videos: placeholders in the space around their body to • Generation or MT software planning ASL sen- represent objects or persons under discussion tences cannot just concatenate videos of ASL. (Meier, 1990). To perform personal, posses- Using video clips, it is difficult to produce sive, or reflexive pronouns that refer to these smooth transitions between signs, subtle mo- entities, signers later point to these locations. tion variations in sign performances, or proper Signers may not repeat the identity of these en- combinations of facial expressions with signs. tities again; so, their conversational partner • If content must be frequently modified or up- must remember where they have been placed. dated, then a video performance would need to An ASL generator must select which entities be largely re-recorded for each modification. should be assigned 3D locations (and where). Whereas, an animation (scripted by a human • Inflection: Many verbs change their motion author) could be further edited or modified. paths to indicate the 3D location where a spa- • Because the face is used to indicate important tial reference point has been established for information in ASL, a human must reveal his their subject, object, or both (Padden, 1988). or her identity when producing an ASL video. Generally, the motion paths of these inflecting Instead, a virtual human character could per- verbs change so that their direction goes from form sentences scripted by a human author. the subject to the object (Figure 1); however, • For wiki-style applications in which multiple their paths are more complex than this. Each authors are collaborating on information con- verb has a standard motion path that is affected tent, ASL videos would be distracting: the per- by the subject’s and the object’s 3D locations. son performing each sentence may differ. A When a verb is inflected in this way, the signer virtual human would be more uniform. does not need to overtly state the subject/object of a sentence. An ASL generator must produce • Animations can be appealing to children for appropriately inflected verb paths based on the use in educational applications. layout of the spatial reference points. • Animations allow ASL to be viewed at different angles, at different speeds, or by different virtual humans – depending on the preferences of the user. This can enable education applications in which students learning ASL can prac- tice their ASL comprehension skills. (a.) 1.2 ASL is Challenging for NLP Research Natural Language Processing (NLP) researchers often apply techniques originally designed for one language to another, but research is not commonly ported to sign languages. One reason is that without a written form for ASL, NLP researchers must (b.) produce animation and thus address several issues: Figure 1: An ASL inflecting verb “BLAME”: • Timing: An ASL performance’s speed consists of: the speed of individual sign performances, (a.) (person on left) blames (person on right), (b.) (person on right) blames (person on left). 90 • Coarticulation: As in speech production, the sign recordings do not enable researchers to surrounding signs in a sentence affect finger, examine the Timing, Coarticulation, Spatial hand, and body movements. ASL generators Reference, Non-Manuals, or Inflection phe- that use overly simple interpolation rules to nomena (section 1.2), which operate over mul- produce these coarticulation effects yield un- tiple signs or sentences in an ASL discourse. natural and non-fluent ASL animation output. • Other researchers have examined how statisti- • Non-Manuals: Head-tilt and eye-gaze indicate cal MT techniques could be used to translate the 3D location of a verb’s subject and object from a written language to a sign language. (or other information); facial expressions also Morrissey and Way (2005) discuss an exam- indicate negation, questions, topicalization, ple-based MT architecture for Irish Sign Lan- and other essential syntactic phenomena not guage, and Stein et al. (2006) apply simple conveyed by the hands (Neidle et al., 2000). statistical MT approaches to German Sign Animations without proper facial expressions Language. Unfortunately, the sign language (and proper timing relative to manual signs) “corpora” used in these studies consist of tran- cannot convey the proper meaning of ASL sen- scriptions of the sequence of signs performed, tences in a fluent and understandable manner. not recordings of actual human performances. • Evaluation: With no standard written form for A transcription does not capture subtleties in ASL, string-based metrics cannot be used to the 3D movements of the hands, facial move- evaluate ASL generation output automatically. ments, or speed of an ASL performance. Such User-based experiments are necessary, but it is information is needed in order to address the difficult to accurately: screen for native sign- Spatial

Collecting a Motion-Capture Corpus of American Sign Language for Data-Driven Generation Research

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support