Arxiv:2003.13830V1 [Cs.CV] 30 Mar 2020
Total Page:16
File Type:pdf, Size:1020Kb
Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation Necati Cihan Camgoz¨ Ƅ, Oscar Koller q, Simon Hadfield Ƅ and Richard Bowden Ƅ ƄCVSSP, University of Surrey, Guildford, UK, qMicrosoft, Munich, Germany fn.camgoz, s.hadfield, [email protected], [email protected] SIGN LANGUAGE GLOSSES Abstract Connectionist Temporal Classification Spoken Language Sentence Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation (effectively Transformer Encoder Transformer Decoder recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-the- Transformer Encoder Transformer Decoder art in translation requires gloss level tokenization in order SLRT SLTT to work. We introduce a novel transformer based architec- Spatial Embedding Word Embedding ture that jointly learns Continuous Sign Language Recogni- tion and Translation while being trainable in an end-to-end Shifted Spoken Language Words manner. This is achieved by using a Connectionist Temporal Classification (CTC) loss to bind the recognition and trans- Figure 1: An overview of our end-to-end Sign Language lation problems into a single unified architecture. This joint Recognition and Translation approach using transformers. approach does not require any ground-truth timing informa- tion, simultaneously solving two co-dependant sequence-to- has focused on recognising the sequence of sign glosses2 sequence learning problems and leads to significant perfor- (Continuous Sign Language Recognition (CSLR)) rather mance gains. than the full translation to a spoken language equivalent We evaluate the recognition and translation perfor- (Sign Language Translation (SLT)). This distinction is im- mances of our approaches on the challenging RWTH- portant as the grammar of sign and spoken languages are PHOENIX-Weather-2014T (PHOENIX14T) dataset. We re- very different. These differences include (to name a few): port state-of-the-art sign language recognition and trans- different word ordering, multiple channels used to convey lation results achieved by our Sign Language Transform- concurrent information and the use of direction and space ers. Our translation networks outperform both sign video to to convey the relationships between objects. Put simply, the spoken language and gloss to spoken language translation mapping between speech and sign is complex and there is models, in some cases more than doubling the performance no simple word-to-sign mapping. (9.58 vs. 21.80 BLEU-4 Score). We also share new baseline Generating spoken language sentences given sign lan- translation results using transformer networks for several guage videos is therefore a spatio-temporal machine trans- other text-to-text sign language translation tasks. lation task [9]. Such a translation system requires us to ac- complish several sub-tasks, which are currently unsolved: 1. Introduction Sign Segmentation: Firstly, the system needs to detect sign sentences, which are commonly formed using topic- arXiv:2003.13830v1 [cs.CV] 30 Mar 2020 Sign Languages are the native languages of the Deaf and comment structures [62], from continuous sign language their main medium of communication. As visual languages, videos. This is trivial to achieve for text based machine 1 they utilize multiple complementary channels to convey in- translation tasks [48], where the models can use punctua- formation [62]. This includes manual features, such as hand tion marks to separate sentences. Speech-based recogni- shape, movement and pose as well as non-manuals features, tion and translation systems, on the other hand, look for such as facial expression, mouth and movement of the head, pauses, e.g. silent regions, between phonemes to segment shoulders and torso [5]. spoken language utterances [69, 76]. There have been stud- The goal of sign language translation is to either convert ies in the literature addressing automatic sign segmentation written language into a video of sign (production) [59, 60] [36, 52, 55, 4, 13]. However to the best of the authors’ or to extract an equivalent spoken language sentence from knowledge, there is no study which utilizes sign segmenta- a video of someone performing continuous sign [9]. How- tion for realizing continuous sign language translation. ever, in the field of computer vision, much of this latter work 2Sign glosses are spoken language words that match the meaning of 1Linguists refer to these channels as articulators. signs and, linguistically, manifest as minimal lexical items. Sign Language Recognition and Understanding: Fol- they represent. By using gloss representations instead of lowing successful segmentation, the system needs to under- the spatial embeddings extracted from the video frames, stand what information is being conveyed within a sign sen- Sign2Gloss2Text avoids the long-term dependency issues, tence. Current approaches tackle this by recognizing sign which Sign2Text suffers from. glosses and other linguistic components. Such methods can We think the second and more critical reason is the be grouped under the banner of CSLR [40, 8]. From a com- lack of direct guidance for understanding sign sentences puter vision perspective, this is the most challenging task. in Sign2Text training. Given the aforementioned complex- Considering the input of the system is high dimensional ity of the task, it might be too difficult for current Neu- spatio-temporal data, i.e. sign videos, models are required ral Sign Language Translation architectures to comprehend that understand what a signer looks like and how they in- sign without any explicit intermediate supervision. In this teract and move within their 3D signing space. Moreover, paper we propose a novel Sign Language Transformer ap- the model needs to comprehend what these aspects mean in proach, which addresses this issue while avoiding the need combination. This complex modelling problem is exacer- for a two-step pipeline, where translation is solely depen- bated by the asynchronous multi-articulatory nature of sign dent on recognition accuracy. This is achieved by jointly languages [51, 58]. Although there have been promising re- learning sign language recognition and translation from sults towards CSLR, the state-of-the-art [39] can only rec- spatial-representations of sign language videos in an end-to- ognize sign glosses and operate within a limited domain of end manner. Exploiting the encoder-decoder based archi- discourse, namely weather forecasts [26]. tecture of transformer networks [70], we propose a multi- Sign Language Translation: Once the information em- task formalization of the joint continuous sign language bedded in the sign sentences is understood by the system, recognition and translation problem. the final step is to generate spoken language sentences. As To help our translation networks with sign language un- with any other natural language, sign languages have their derstanding and to achieve CSLR, we introduce a Sign Lan- own unique linguistic and grammatical structures, which guage Recognition Transformer (SLRT), an encoder trans- often do not have a one-to-one mapping to their spoken former model trained using a CTC loss [2], to predict sign language counterparts. As such, this problem truly repre- gloss sequences. SLRT takes spatial embeddings extracted sents a machine translation task. Initial studies conducted from sign videos and learns spatio-temporal representa- by computational linguists have used text-to-text statistical tions. These representations are then fed to the Sign Lan- machine translation models to learn the mapping between guage Translation Transformer (SLTT), an autoregressive sign glosses and their spoken language translations [45]. transformer decoder model, which is trained to predict one However, glosses are simplified representations of sign lan- word at a time to generate the corresponding spoken lan- guages and linguists are yet to come to a consensus on how guage sentence. An overview of the approach can be seen sign languages should be annotated. in Figure 1. There have been few contributions towards video based The contributions of this paper can be summarized as: continuous SLT, mainly due to the lack of suitable • A novel multi-task formalization of CSLR and SLT datasets to train such models. More recently, Camgoz which exploits the supervision power of glosses, with- et al. [9] released the first publicly available sign lan- out limiting the translation to spoken language. guage video to spoken language translation dataset, namely • The first successful application of transformers for PHOENIX14T. In their work, the authors proposed ap- CSLR and SLT which achieves state-of-the-art results proaching SLT as a Neural Machine Translation (NMT) in both recognition and translation accuracy, vastly problem. Using attention-based NMT models [44, 3], they outperforming all comparable previous approaches. define several SLT tasks and realized the first end-to-end • A broad range of new baseline results to guide future sign language video to spoken language sentence transla- research in this field. tion model, namely Sign2Text. The rest of this paper is organized as follows: In Sec- One of the main findings of [9] was that using gloss tion 2, we survey the previous studies on SLT and the state- based mid-level representations improved the SLT per- of-the-art in the field of NMT. In Section 3, we introduce formance drastically when compared to an end-to-end Sign Language Transformers,