Arxiv:2105.02877V1 [Cs.CV] 6 May 2021 Tles Corresponding to the Audio Content
Total Page:16
File Type:pdf, Size:1020Kb
Aligning Subtitles in Sign Language Videos Hannah Bull1* Triantafyllos Afouras2∗ Gul¨ Varol2;3 Samuel Albanie2 Liliane Momeni2 Andrew Zisserman2 1 LISN, Univ Paris-Saclay, CNRS, France 2 Visual Geometry Group, University of Oxford, UK 3 LIGM, Ecole´ des Ponts, Univ Gustave Eiffel, CNRS, France [email protected]; fafourast,gul,albanie,liliane,[email protected] https://www.robots.ox.ac.uk/˜vgg/research/bslalign/ Overlooked by a small hill known as Leopard the mother must make sure To keep her cubs alive in this dangerous neighbourhood, Saudio Rock. they stay hidden. Sgt Overlooked by a small hill known as Leopard Rock. the mother must make sure they stay hidden. To keep her cubs alive in this dangerous neighbourhood, Time 14:12 14:14 14:16 14:18 14:20 14:22 14:24 14:26 14:28 14:30 Figure 1: Subtitle alignment: We study the task of aligning subtitles to continuous signing in sign language interpreted TV broadcast data. The subtitles in such settings usually correspond to and are aligned with the audio content (top: audio subtitles, Saudio) but are unaligned with the accompanying signing (bottom: Ground Truth annotation of the signing corresponding to the subtitle, Sgt). This is a very challenging task as (i) the order of subtitles varies between spoken and sign languages, (ii) the duration of a subtitle differs considerably between signing and speech, and (iii) the signing corresponds to a translation of the speech as opposed to a transcription. Abstract evaluations, we show substantial improvements over exist- ing alignment baselines that do not make use of subtitle text The goal of this work is to temporally align asyn- embeddings for learning. Our automatic alignment model chronous subtitles in sign language videos. In particular, opens up possibilities for advancing machine translation we focus on sign-language interpreted TV broadcast data of sign languages via providing continuously synchronized comprising (i) a video of continuous signing, and (ii) subti- video-text data. arXiv:2105.02877v1 [cs.CV] 6 May 2021 tles corresponding to the audio content. Previous work ex- ploiting such weakly-aligned data only considered finding keyword-sign correspondences, whereas we aim to localise 1. Introduction a complete subtitle text in continuous signing. We propose Sign languages constitute a key form of communication a Transformer architecture tailored for this task, which we for Deaf communities [53]. Our goal in this paper is to train on manually annotated alignments covering over 15K temporally localise subtitles in continuous signing video. subtitles that span 17.7 hours of video. We use BERT subti- Automatic alignment of subtitle text to signing content has tle embeddings and CNN video representations learned for great potential for a wide range of applications including sign recognition to encode the two signals, which interact assistive tools for education and translation, indexing of through a series of attention layers. Our model outputs sign language video corpora, efficient subtitling technology frame-level predictions, i.e., for each video frame, whether for signing vloggers1, and automatic construction of large- it belongs to the queried subtitle or not. Through extensive 1Unlike spoken vlogs that benefit from automatic closed captioning on *Equal contribution sites such as YouTube, signing vlog creators who wish to provide written 1 scale sign language datasets that support computer vision hands, head movement, pauses, and facial expressions [24]. and linguistic research. However, as shown in our evaluations in Sec.4, such ap- Despite recent advances in computer vision, machine proaches based on prosody-only perform poorly in our set- translation between continuous signing and written lan- ting, where subtitles do not necessarily correspond to com- guage remains largely unsolved [5]. Recent works [10, 11] plete sign sentences with clear visual boundaries. have shown promising translation results, but to date these In this paper, we instead propose to use the subtitle text have been achieved only in constrained settings where con- as an additional signal for better alignment. We make the tinuous signing is manually pre-segmented into clips, with following three contributions: (1) we show that encoding each clip associated to a written sentence from a limited vo- the subtitle text as input to the alignment model significantly cabulary. Two key bottlenecks for scaling up translation improves the temporal localisation quality as opposed to to continuous signing depicting unconstrained vocabularies only relying on visual cues to segment continuous sign lan- are (i) the segmentation of signing into sentence-like units, guage videos into subtitle units; (2) we design a novel for- and (ii) the availability of large-scale sign language training mulation for the subtitle alignment task based on Trans- data. formers; and (3) we present a comprehensive study ablating Manual alignment of subtitles to sign language video is our design choices and provide promising results for this tedious – an expert fluent in sign language takes approxi- new task when evaluating on unseen signers and content. mately 10-15 hours to align subtitles to 1 hour of continu- ous sign language video. In this work, we focus on the task 2. Related Work of aligning a particular known subtitle within a given tem- For a recent comprehensive survey about sign language poral signing window. We explore this task in the context of recognition and translation, see [33]. Here, we review rele- sign language interpreted TV broadcast footage – a readily vant works on temporal localisation at the levels of individ- available and large-scale source of data – where the subti- ual signs and sequences, in addition to more general tempo- tles are synchronised with the audio, but the corresponding ral alignment methods from the literature. sign language translations are largely unaligned due to dif- Temporal localisation of individual signs. A rich body ferences between spoken and sign languages as well as lags of work has considered the task of localising sparse sign from the live interpretation. instances in continuous signing, often referred to as “sign Subtitle alignment to continuous signing remains a very spotting”. Early efforts using signing gloves [38] were fol- challenging task. First, sign languages have grammatical lowed by methods employing hand-crafted visual features structures that vary considerably from those of spoken lan- to represent the hands, face and motion that were integrated guages [53], and as a result the ordering of words within with CRFs [61, 62], HMMs [49] and HSP Trees [45]. Sev- a subtitle as well as the subtitles themselves is often not eral studies have sought to employ subtitles as weak super- maintained in the signing (see Fig.1). Second, the dura- vision for learning to localise and classify signs, using apri- tion of a subtitle varies considerably between signing and ori mining [17] and multiple-instance learning [6,7, 46]. speech due to differences in speed and grammar. Third, the More recent work has leveraged cues such as mouthings [2] signing corresponds to a translation of the speech that ap- and visual dictionaries [42] and by making use of deep neu- pears in the subtitles as opposed to a transcription: there is ral network features with sliding window classifiers [37] no direct one-to-one mapping between subtitle words and and attention learned via a proxy translation task [56]. In signs produced by interpreters, and entire subtitles may not deviation from these works, our objective is to localise com- be signed. plete subtitle units, rather than individual signs. Previous work exploiting such weakly-aligned data has Temporal localisation of sign sequences. The alignment mainly focused on finding sparse correspondences between of subtitles to continuous signing was considered in creative keywords in the subtitle and individual signs [2, 42, 56], as early work by combining cues from multiple sparse corre- opposed to localising the start and end times of a complete spondences [23], but under the assumption that ordering of subtitle text in continuous signing. Though, as we show, lo- words in subtitles are preserved in the signing (which does calising isolated signs identified by keyword spotting never- not hold in our problem setting). Other sequence-level sign theless forms a useful pretraining task for full subtitle align- language temporal localisation tasks that have received at- ment. Most closely related to our work, Bull et al. [8] con- tention in the literature include category-agnostic sign seg- sider the task of segmenting a continuous signing video into mentation [22, 47], active signer detection [4, 16, 43, 52] subtitle units purely based on body keypoints. In fact, sim- and diarisation [1, 26, 27]—each considers a temporal gran- ilarly to speech which can be segmented based on prosodic ularity that differs from subtitle units. Most closely related cues such as pauses, sign sentence boundaries can to an ex- to our work, Bull et al. [8] employ a keypoint-based model tent be detected through visual cues such as lowering the to segment continuous signing into sentence-like units with- subtitles must both translate and align their subtitles manually. out knowledge of the written subtitles during inference. Our 2 Subtitle text “The souffle is just a little bit of Transformer Encoder frame not in subtitle work but it's really worth it.” BERT Linear /////////// frame in subtitle Continuous signing video ///////// Sgt I3D Linear T frames + Transformer Decoder /////////// Spred Linear Sprior Sigmoid T frames ///////////// Linear T frames PE Figure 2: SAT model overview: We input to our model (i) token embeddings of the subtitle text we wish to align, (ii) a sequence of video features extracted from a continuous sign language video segment and (iii) the shifted temporal boundaries F4B6BB of the audio-aligned subtitle, Sprior. Using these inputs, the model outputs a vector of values between 0 and 1 of length T .