GEM:AGeneralEvaluation Benchmark forMultimodal Tasks Lin Su1 Nan Duan2 Edward Cui1 Lei Ji2 Chenfei Wu2 Huaishao Luo3 Yongfei Liu4 Ming Zhong1 Taroon Bharti1 Arun Sacheti5 1 Bing Multimedia Team, Microsoft, China 2 Natural Language Computing, Microsoft Research Asia, China 3 Southwest Jiaotong University, China 4 ShanghaiTech University, China 5 Bing Multimedia Team, Microsoft, United States flins, nanduan, edwac, leiji, chewu, minzhon, tbharti,
[email protected] [email protected],
[email protected] Abstract to vision-language scenarios (Lu et al., 2019; Chen et al., 2019; Li et al., 2020a,b; Ni et al., 2021; Sun 1 In this paper, we present GEM as a General et al., 2019b,a; Luo et al., 2020) to handle multi- Evaluation benchmark for Multimodal tasks. modal tasks such as image(or video)-text retrieval Different from existing datasets such as GLUE (Wang et al., 2018), SuperGLUE (Wang et al., and image (or video) captioning. However, there is 2019), XGLUE (Liang et al., 2020) and still no comprehensive benchmark dataset for evalu- XTREME (Hu et al., 2020) that mainly fo- ating such multimodal pre-trained models. Besides, cus on natural language tasks, GEM is a large- most existing vision-language datasets are labeled scale vision-language benchmark, which con- in English only, which cannot be used to evaluate sists of GEM-I for image-language tasks and the qualities of such models on other languages. GEM-V for video-language tasks. Compar- Motivated by this, we present GEM, a General ing with existing multimodal datasets such as E M MSCOCO (Chen et al., 2015) and Flicker30K valuation benchmark for ultimodal tasks.