MULE: Multimodal Universal Language Embedding

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) MULE: Multimodal Universal Language Embedding Donghyun Kim, Kuniaki Saito, Kate Saenko, Stan Sclaroff, Bryan A. Plummer Boston University {donhk, keisaito, saenko, sclaroff, bplum}@bu.edu Abstract &!" " %" " # ( !!'# ) Existing vision-language methods typically support two languages at a time at most. In this paper, we present a mod- "' "$!! !" " " &% ) ular approach which can easily be incorporated into existing vision-language methods in order to support many lan- "' " guages. We accomplish this by learning a single shared Mul- )( timodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then #" $ ! # ( ) we learn to relate MULE to visual data as if it were a sin- ( #"# ) gle language. Our method is not architecture specific, un- like prior work which typically learned separate branches for "' "$!! !" each language, enabling our approach to easily be adapted " &% ) to many vision-language methods and tasks. Since MULE "' $ %!" learns a single language branch in the multimodal model, we ) # can also scale to support many languages, and languages with fewer annotations can take advantage of the good represen- "' # tation learned from other (more abundant) language data. We ) $0"1 demonstrate the effectiveness of our embeddings on the bidi- rectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Ma- Figure 1: Most prior work on vision-language tasks sup- chine Translation can be used for data augmentation in mul- ports up to two languages where each language is pro- tilingual learning, which, combined with MULE, improves jected into a shared space with the visual features using mean recall by up to 20.2% on a single language compared its own language-specific model parameters (top). Instead, to prior work, with the most significant gains seen on lan- we propose MULE, a language embedding that is visually- guages with relatively few annotations. Our code is publicly semantically aligned across multiple languages (bottom). available1. This enables us to share a single multimodal model, significantly decreasing the number of model parameters, while Introduction also performing better than prior work using separate lan- Vision-language understanding has been an active area of guage branches or multilingual embeddings which were research addressing many tasks such as image caption- aligned using only language data. ing (Fang et al. 2015; Gu et al. 2018), visual question answering (Antol et al. 2015; Goyal et al. 2017), image- sentence retrieval (Wang et al. 2019; Nam, Ha, and Kim However, these methods often learn completely separate lan- 2017), and phrase grounding (Plummer et al. 2015; Hu et guage representations to relate to visual data, resulting in al. 2016). Recently there has been some attention paid to ex- many language-specific model parameters that grow linearly panding beyond developing monolingual (typically English- with the number of supported languages. only) methods by also supporting a second language in the In this paper, we propose a Multimodal Universal Lan- same model (e.g., (Gella et al. 2017; Hitschler, Schamoni, guage Embedding (MULE), an embedding that has been and Riezler 2016; Rajendran et al. 2015; Calixto, Liu, and visually-semantically aligned across many languages. Since Campbell 2017; Li et al. 2019; Lan, Li, and Dong 2017)). each language is embedded into to a shared space, we can Copyright c 2020, Association for the Advancement of Artificial use a single task-specific multimodal model, enabling our Intelligence (www.aaai.org). All rights reserved. approach to scale to support many languages. Most prior 1http://cs-people.bu.edu/donhk/research/MULE.html works use a vision-language model that supports at most 11254 $13% 230 @!$A 3%2$&(% & $'" ) 3%2$&(% '"3" 91) $!$ 3%2$&(% ( % '2 ' (0 & $'" @#0 A & $'" @1$A '"%$1#6 130! 0 0$ 1 1&%% 54 87 &" 9'"3" #$' 1 6 2#$'" 8 '"3" 8 8 $"#(0#(( ('120$'2 )' 1 6 8 '"3" '"3" '2 ' %11$!$ 0 %$"'& '2 & $'" @3$A Figure 2: An overview of the architecture used to train our multimodal universal language embedding (MULE). Training MULE consists of three components: neighborhood constraints which semantically aligns sentences across languages, an adversarial language classifier which encourages features from different languages to have similar distributions, and a multimodal model which helps MULE learn the visual-semantic meaning of words across languages by performing image-sentence matching. two languages with separate language branches (e.g. (Gella Dataset Language # images # descriptions et al. 2017)), significantly increasing the number of parame- English 29K 145K German 29K 145K ters compared to our work (see Fig. 1 for a visualization). A Multi30K significant challenge of multilingual embedding learning is Czech 29K 29K the considerable disparity in the availability of annotations French 29K 29K between different languages. For English, there are many English 121K 606K large-scale vision-language datasets to train a model such MSCOCO Japanese 24K 122K as MSCOCO (Lin et al. 2014) and Flickr30K (Young et Chinese 18K 20K al. 2014), but there are few datasets available in other languages, and some contain limited annotations (see Table 1 Table 1: Available data for each language during training. for a comparison of the multilingual datasets used to train MULE). One could simply use Neural Machine Translation (e.g. (Bahdanau, Cho, and Bengio 2014; Sutskever, Vinyals, guages. Second, motivated by the sentence-level supervi- and Le 2014)) to convert the sentence from the original lan- sion used to train language embeddings (Devlin et al. 2018; guage to a language with a trained model, but this has two Kiela et al. 2018; Lu et al. 2019), we incorporate visual- significant limitations. First, machine translations are not semantic information by learning how to match image- perfect and introduce some noise, making vision-language sentence pairs using a multimodal network similar to (Wang reasoning more difficult. Second, even with a perfect trans- et al. 2019). Third, we ensure semantically similar sentences lation, some information is lost going between languages. are embedded close to each other (referred to as neighbor- For example, is used to refer to a group of women in hood constraints in Fig. 2). Since MULE does not require Chinese. However, it is translated to “they” in English, los- changes to the architecture of the multimodal model like ing all gender information that could be helpful in a down- prior work (e.g., (Gella et al. 2017)), our approach can easily stream task. Instead of fully relying on translations, we in- be incorporated to other multimodal models. troduce a scalable approach that supports queries from many Despite being trained to align languages using additional languages in a single model. large text corpora across each supported language, our ex- An overview of the architecture we use to train MULE is periments will show recent multilingual embeddings like provided in Fig. 2. For each language we use a single fully- MUSE (Conneau et al. 2018) perform significantly worse connected layer on top of each word embedding to project on tasks like multilingual image-sentence matching than our it into an embedding space shared between all languages, approach. In addition, sharing all the parameters of the mul- i.e., our MULE features. Training our embedding consists timodal component of our network enables languages with of three components. First, we use an adversarial language fewer annotations to take advantage of the stronger represen- classifier in order to align feature distributions between lan- tation learned using more data. Thus, as our experiments will 11255 show, MULE obtains its largest performance gains on lan- model, which uses an image as a pivot and enforce the sen- guages with less training data. This gain is boosted further tence representations from English and German to be similar by using Neural Machine Translation as a data augmentation to the pivot image representation, similar to the structure- technique to increase the available vision-language training preserving constraints of (Wang et al. 2019). However, in data. (Gella et al. 2017) each language is modeled with a com- We summarize our contributions as follows: pletely separate language model. While this may be accept- • We propose MULE, a multilingual text representation for able for modeling one or two languages, it would not scale vision-language tasks that can transfer and learn textual well for representing many languages as the number of pa- representations for low-resourced languages from label- rameters would grow too large. (Wehrmann et al. 2019) pro- rich languages, such as English. poses a character-level encoding for a cross-lingual model, which effectively reduces the size of the word embedding • We demonstrate MULE’s effectiveness on a multilingual for languages. However, this approach shows a significant image-sentence retrieval task, where we outperform ex- drop in performance when training for just two languages. tensions of prior work by up to 20.2% on a single lan- In this work we explore multiple languages with un- guage while also using fewer model parameters. derrepresented and low-resourced languages (up to 4 lan- • We show that using Machine Translation is a beneficial guages). We learn a shared representation between all lan- data augmentation technique for training multilingual em- guages, enabling us to scale to many languages with few ad- beddings for vision-language tasks. ditional parameters. This enables feature sharing with low- resourced languages, resulting in significantly improved per- Related Work formance, even when learning many languages. Language Representation Learning. Word embeddings, Neural Machine Translation. In Neural Machine Trans- such as Word2Vec (Mikolov, Yih, and Zweig 2013) and lation (NMT) the goal is to translate text from one language FastText (Bojanowski et al. 2017), play an important role to another language with parallel text corpora (Bahdanau, in vision-language tasks.

Load more