A Survey on Mouth Modeling and Analysis for Sign Language Recognition
Total Page:16
File Type:pdf, Size:1020Kb
A Survey on Mouth Modeling and Analysis for Sign Language Recognition Epameinondas Antonakos?y, Anastasios Roussos? and Stefanos Zafeiriou?y Department of Computing, Imperial College London, U.K. Abstract— Around 70 million Deaf worldwide use Sign Lan- guages (SLs) as their native languages. At the same time, they have limited reading/writing skills in the spoken language. This puts them at a severe disadvantage in many contexts, including education, work, usage of computers and the Internet. Automatic Sign Language Recognition (ASLR) can support the Deaf in many ways, e.g. by enabling the development of systems for Human-Computer Interaction in SL and translation (a) LATE between sign and spoken language. Research in ASLR usually revolves around automatic understanding of manual signs. Recently, ASLR research community has started to appreciate the importance of non-manuals, since they are related to the lexical meaning of a sign, the syntax and the prosody. Non- manuals include body and head pose, movement of the eyebrows and the eyes, as well as blinks and squints. Arguably, the mouth is one of the most involved parts of the face in non-manuals. Mouth actions related to ASLR can be either mouthings, i.e. (b) NOT YET visual syllables with the mouth while signing, or non-verbal mouth gestures. Both are very important in ASLR. In this Fig. 1: In American SL, the signs “late” and “not yet” paper, we present the first survey on mouth non-manuals in are performed with the same manual gesture. Their lexical ASLR. We start by showing why mouth motion is important in SL and the relevant techniques that exist within ASLR. distinction is only based on the mouth action of the tongue Since limited research has been conducted regarding automatic touching the lower lip. (Images copyright ASL University) analysis of mouth motion in the context of ALSR, we proceed by surveying relevant techniques from the areas of automatic mouth expression and visual speech recognition which can be applied to the task. Finally, we conclude by presenting the Opposite to another common misconception, SLs are challenges and potentials of automatic analysis of mouth motion as rich and grammatically complex as spoken languages in the context of ASLR. and even though they usually have different grammatical structure, they exhibit very similar properties [32], [57]. For I. INTRODUCTION example, similar to spoken languages, SLs consist of basic Sign languages (SLs) commonly serve as an alternative semantic components, referred to as phonemes [57]. The or complementary mode of human communication. Recent phonemes are mainly expressed through combinations of statistics suggest that around 5% of the worldwide population manual features, such as shape, posture (orientation), loca- suffers from hearing loss to some degree [44]. Furthermore, tion and motion of the hands. However, SLs convey much it is estimated that about 1% of the worldwide population of their prosody through non-manual signs [45], [67], [47], (around 70 million Deaf people) use SLs as their native [68], [56], [41], such as the pose of head and torso, facial languages, even though it is hard to measure the exact expressions (combinations of eyes, eyebrows, cheeks and number. Note that SL is not only used by the Deaf com- lips) and mouth movements. These non-manual articulators munity, but also by people who cannot physically speak. play an important role in lexical distinction, grammatical The limited usage and popularity of SLs among people that structure and adjectival or adverbial content. For example, use spoken languages has lead to the dominance of several yes/no questions in American SL (ASL) are associated with misconceptions about them. For example, many people be- raised eyebrows, head tilted forward and widely-opened lieve that there is a unique international SL or that SL is eyes, and wh-questions with furrowed eyebrows and head simply a pantomime. However, each country has each own forward [36], [56]. Topics are described by raised eyebrows SL (in some cases more than one) and there are hundreds of and head slightly back and negations are expressed with a different SLs worldwide. head shake [36], [56]. The head pose and eye gaze describe turn taking during a story narration [6]. ? The authors contributed equally and have joint first authorship. Their In this paper we are interested in the mouth actions within names appear in alphabetical order. y The work of E. Antonakos and S. Zafeiriou was partially funded by the SL. As explained in [8], [47], the mouth lexical articulators EPSRC projects EP/J017787/1 (4DFAB) and EP/L026813/1 (ADAManT). can be separated in two categories: mouth gestures and foreign language with a fundamentally different grammatical structure. At the same time, the vast majority of the rest of the population does not understand and is unable to use SL. Additionally, the current technological advances within the domain of Human-Computer Interaction (HCI), ranging from text-based interaction to speech recognition and synthesis, are almost entirely oriented towards hearing (a) BRUDER (translated as “brother” in English) people. The aforementioned issues put the Deaf community in a disadvantaged position within many contexts, includ- ing education, work, usage of computers and the Internet. Automatic SL Recognition (ASLR) can support the Deaf community in overcoming these disadvantages by enabling the development of reliable systems for HCI in SL and translation between sign and spoken language. In contrast to speech recognition, which is now ubiquitous in real- (b) SCHWESTER (translated as “sister” in English) world HCI and other applications, ASLR is still far from Fig. 2: In German SL, the signs “bruder” and “schwester” being a mature technology. Nevertheless, during the last are performed with the same manual gesture and only decades, there have been some significant developments in differentiate on the lips patterns. (The figure is used with ASLR (please see [45], [67], [15] for some general surveys). permission from [66].) However, the most of the research effort is oriented towards the SL recognition using hands-based features. The research efforts that employ non-manual features, including mouth actions, are very limited. mouthings. Mouth gestures (also referred to as oral com- In this paper we present an overview of the literature ponents), which include deformations of the mouth’s shape, around the employment and interpretation of mouth actions movements of the tongue and visibility of the teeth, are in ASLR systems. Since, limited research has been conducted not related to the spoken language [47]. The term mouthing regarding automatic analysis of mouth motion in the context refers to the silent articulators of the mouth that correspond of ALSR, we proceed by surveying relevant techniques from to a pronounced word or part of it. If part of the word is the areas of automatic mouth expression and visual speech articulated then, in most SLs, it is its first syllable. Note that recognition which can be applied to the task. As mentioned there is a debate in literature regarding the importance and above, there are many open questions regarding the role role of the mouth during signing. Some researchers argue of the mouth in SL, both on a linguistic and computing that mouthings are not really part of the lexical scope of a level. Herein, we aim to provide a clear and comprehensive SL and are not linguistically significant [57], [30]. However, overview of the ongoing research on this problem in order recent research has shown that mouthings and mouth gestures to highlight the achievements that have already been accom- contribute significantly to the semantic analysis of various plished, but most importantly emphasize the challenges that SLs [37], [39], [9]. The frequency of mouthings during are still open and need to be addressed. signing is different for each SL [47], [16] and is dependent on both context and the grammatical category of the manual sign II. NON-MANUAL FEATURES IN ASLR they occur with [39], [9]. The mouth actions can contribute to The vast majority of ASLR methods use solely hand the signing in various ways [56], [50]. Most mouthings have features. However, the research in ASLR has recently started a prosodic interpretation while others have lexical meaning. appreciating the importance of non-manual parameters. This An example of this is shown in Fig. 1. In ASL, the signs “not relatively new research direction is especially promising and yet” and “late” have the exact same manual gesture. The only is yet to be explored in depth. The non-manual parameters lexical distinction difference between them is that in order play an essential role in SL communication because they are to articulate “not yet” (Fig. 1b) the signer needs to touch the related to the meaning of a sign, the syntax or the prosody lower lip with the tongue and make a slight rotation of the [54], [10], [69], [70]. There are methods related to the direct head from side to side that declares negation [37]. Another recognition of non-manual linguistic markers [43], [42], example is shown in Fig. 2 for the words “brother” and [38], as applied to negations, conditional clauses, syntactic “sister” in German SL. Finally, there are cases in which the boundaries, topic/focus and wh-, yes/no questions. Moreover mouth may articulate physical events, emotions or sensations, there are methods for the detection of important facial events such as types of sounds, noise, disturbances, heaviness, types such as head gestures [40], [24], [42], eyebrows movement of textures etc. These are usually referred to as non-linguistic and eyes blinking/squint [42], along with facial expression mouth gestures [55]. recognition [40], [64], [65] within the context of SL. The There are many difficulties that Deaf people encounter authors in [42] employ a 2-layer Conditional Random Field in the every day life.