Enhancing Multi-Lingual Information Extraction Via Cross-Media Inference and Fusion

Enhancing Multi-lingual Information Extraction via Cross-Media Inference and Fusion Adam Lee, Marissa Passantino, Heng Ji Guojun Qi, Thomas Huang Computer Science Department Department of Electrical and Computer Queens College and Graduate Center Engineering & Beckman Institute City University of New York University of Illinois at Urbana-Champaign [email protected] [email protected] Abstract The processing methods of texts and images/videos are typically organized into two We describe a new information fusion separate pipelines. Each pipeline has been approach to integrate facts extracted studied separately and quite intensively over the from cross-media objects (videos and past decade. It is critical to move away from texts) into a coherent common represen- single media processing, and instead toward tation including multi-level knowledge methods that make multiple decisions jointly (concepts, relations and events). Beyond using cross-media inference. For example, video standard information fusion, we ex- analysis allows us to find both entities and ploited video extraction results and sig- events in videos, but it’s very challenging to nificantly improved text Information Ex- specify some fine-grained semantic types such traction. We further extended our meth- as proper names (e.g. “Obama Barack”) and ods to multi-lingual environment (Eng- relations among concepts; while the speech lish, Arabic and Chinese) by presenting embedded and the texts surrounding these a case study on cross-lingual comparable videos can significantly enrich such analysis. On corpora acquisition based on video com- the other hand, image/video features can parison. enhance text extraction. For example, entity gender detection from speech recognition output 1 Introduction is challenging because of entity mention recognition errors. However, gender detection An enormous amount of information is widely from corresponding images and videos can available in various data modalities (e.g. speech, achieve above 90% accuracy (Baluja and Rowley, text, image and video). For example, a Web 2006). In this paper, we present a case study on news page about “Health Care Reform in gender detection to demonstrate how text and America” is composed with texts describing video extractions can boost each other. some events (e.g., Final Senate vote for the We can further extend the benefit of cross- reform plans, Obama signs the reform media inference to cross-lingual information agreement), images (e.g., images about various extraction (CLIE). Hakkani-Tur et al. (2007) government involvements over decades) and found that CLIE performed notably worse than videos/speech (e.g. Obama’s speech video about monolingual IE, and indicated that a major the decisions) containing additional information cause was the low quality of machine translation regarding the real extent of the events or (MT). Current statistical MT methods require providing evidence corroborating the text part. large and manually aligned parallel corpora as These cross-media objects exist in redundant input for each language pair of interest. Some and complementary structures, and therefore it recent work (e.g. Munteanu and Marcu, 2005; Ji, is beneficial to fuse information from various 2009) found that MT can benefit from multi- data modalities. The goal of our paper is to lingual comparable corpora (Cheung and Fung, investigate this task from both mono-lingual and 2004), but it is time-consuming to identify pairs cross-lingual perspectives. of comparable texts; especially when there is lack of parallel information such as news release include coreferred persons, geo-political entities dates and topics. However, the images/videos (GPE), locations, organizations, facilities, vehi- embedded in the same documents can provide cles and weapons; relations include 18 types additional clues for similarity computation (e.g. “a town some 50 miles south of Salzburg” because they are ‘language-independent’. We indicates a located relation.); events include the will show how a video-based comparison 33 distinct event types defined in ACE 2005 approach can reliably build large comparable (e.g. “Barry Diller on Wednesday quit as chief text corpora for three languages: English, of Vivendi Universal Entertainment.” indicates a Chinese and Arabic. “personnel-start” event). Names are identified and classified using an HMM-based name tag- 2 Baseline Systems ger. Nominals are identified using a maximum entropy-based chunker and then semantically We apply the following state-of-the-art text and classified using statistics from ACE training video information extraction systems as our corpora. Relation extraction and event extraction baselines. Each system can produce reliable are also based on maximum entropy models, confidence values based on statistical models. incorporating diverse lexical, syntactic, semantic 2.1 Video Concept Extraction and ontological knowledge. The video concept extraction system was 3 Mono-lingual Information Fusion developed by IBM for the TREC Video and Inference Retrieval Evaluation (TRECVID-2005) (Naphade et al., 2005). This system can extract 3.1 Mono-lingual System Overview 2617 concepts defined by TRECVID, such as "Hospital", "Airplane" and "Female-Person". It Multi-media uses support vector machines to learn the Document mapping between low level features extracted from visual modality as well as from transcripts and production related meta-features. It also ASR Speech Videos/Images exploits a Correlative Multi-label Learner (Qi et al., 2007), a Multi-Layer Multi-Instance Kernel Texts (Gu et al., 2007) and Label Propagation through Partitioning Linear Neighborhoods (Wang et al., 2006) to extract all other high-level features. For each Text Information Video Concept classifier, different models are trained on a set Extraction Extraction of different modalities (e.g., the color moments, wavelet textures, and edge histograms), and the Entities/ predictions made by these classifiers are Concepts Relations/Events combined together with a hierarchical linearly- weighted fusion strategy across different modalities and classifiers. Multi-level Concept Fusion 2.2 Text Information Extraction Global Inference We use a state-of-the-art IE system (Ji and Grishman, 2008) developed for the Automatic Enhanced Content Extraction (ACE) program1 to process Concepts/Entities texts and automatic speech recognition output. Relations/Events The pipeline includes name tagging, nominal mention tagging, coreference resolution, time Figure 1. Mono-lingual Cross-Media expression extraction and normalization, rela- Information Fusion and Inference Pipeline tion extraction and event extraction. Entities Figure 1 depicts the general procedure of our mono-lingual information fusion and inference 1 http://www.nist.gov/speech/tests/ace/ approach. After we apply two baseline systems (sequential keyframes) and background speech to the multi-media documents, we use a novel to align concepts. During this fusion process, we multi-level concept fusion approach to extract a compare the normalized confidence values pro- common knowledge representation across texts duced from two pipelines to resolve the follow- and videos (section 3.2), and then apply a global ing three types of cases: inference approach to enhance fusion results • Contradiction – A video fact contradicts a (section 3.3). text fact; we only keep the fact with higher confidence. 3.2 Cross-media Information Fusion • Redundancy – A video fact conveys the • Concept Mapping same content as (or entails, or is entailed by) a text fact; we only keep the unique parts of For each input video, we apply automatic speech the facts. recognition to obtain background texts. Then we • Complementary – A video fact and a text use the baseline IE systems described in section fact are complementary; we merge these 2 to extract concepts from texts and videos. We two to form more complete fact sets. construct mappings on the overlapped facts across TRECVID and ACE. For example, • A Common Representation “LOC.Water-Body” in ACE is mapped to In order to effectively extract compact informa- “Beach, Lakes, Oceans, River, River_Bank” in tion from large amounts of heterogeneous data, TRECVID. we design an integrated XML format to repre- Due to different characteristics of video clips sent the facts extracted from the above multi- and texts, these two tasks have quite different level fusion. We can view this representation as granularities and focus. For example, a set of directed “information graphs” G={Gi “PER.Individual” in ACE is an open set includ- (Vi, Ei)}, where Vi is the collection of concepts ing arbitrary names, while TRECVID only cov- from both texts and videos, and Ei is the collec- ers some famous proper names such as tion of edges linking one concept to the other, “Hu_Jintao” and “John_Edwards”. Geopolitical labeled by relation or event attributes. An exam- entities appear very rarely in TRECVID because ple is presented in Figure 2. This common rep- they are more explicitly presented in back- resentation is applied in both mono-lingual and ground texts. On the other hand, TRECVID de- multi-lingual information fusion tasks described fined much more fine-grained nominals than in next sections. ACE, for example, “FAC.Building-Grounds” in ACE can be divided into 52 possible concept Amina Abbas types such as “Conference_Buildings” and Spouse “Golf_Course” because they can be more easily Leader Child Yasser PLO Mahmoud detected based on video features. We also notice Abbas Abbas that

Enhancing Multi-Lingual Information Extraction Via Cross-Media Inference and Fusion

IAMG Newsletter IAMG Newsletter

25Th International Joint Conference on Artif Icial Intelligence New York

University of California San Diego

Vanderbiltuniversitymedicalcenter

2017 Medford/Somerville Massachusetts

2019 International Joint Conference on Neural Networks (IJCNN 2019)

Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation

Embodied Question Answering

Lecture Notes in Computer Science 9218

Multi-Hypothesis Pose Networks: Rethinking Top-Down Pose

Meshfreeflownet: a Physics-Constrained Deep Continuous Space-Time Super-Resolution Framework

Minimally Interactive Segmentation with Application to Human Placenta in Fetal MR Images