Enhancing Multi-lingual Information Extraction via Cross-Media Inference and Fusion

Adam Lee, Marissa Passantino, Heng Ji Guojun Qi, Thomas Computer Science Department Department of Electrical and Computer Queens College and Graduate Center Engineering & Beckman Institute City University of New York University of Illinois at Urbana-Champaign [email protected] [email protected]

Abstract The processing methods of texts and images/videos are typically organized into two We describe a new information fusion separate pipelines. Each pipeline has been approach to integrate facts extracted studied separately and quite intensively over the from cross-media objects (videos and past decade. It is critical to move away from texts) into a coherent common represen- single media processing, and instead toward tation including multi-level knowledge methods that make multiple decisions jointly (concepts, relations and events). Beyond using cross-media inference. For example, video standard information fusion, we ex- analysis allows us to find both entities and ploited video extraction results and sig- events in videos, but it’s very challenging to nificantly improved text Information Ex- specify some fine-grained semantic types such traction. We further extended our meth- as proper names (e.g. “Obama Barack”) and ods to multi-lingual environment (Eng- relations among concepts; while the speech lish, Arabic and Chinese) by presenting embedded and the texts surrounding these a case study on cross-lingual comparable videos can significantly enrich such analysis. On corpora acquisition based on video com- the other hand, image/video features can parison. enhance text extraction. For example, entity gender detection from output 1 Introduction is challenging because of entity mention recognition errors. However, gender detection An enormous amount of information is widely from corresponding images and videos can available in various data modalities (e.g. speech, achieve above 90% accuracy (Baluja and Rowley, text, image and video). For example, a Web 2006). In this paper, we present a case study on news page about “Health Care Reform in gender detection to demonstrate how text and America” is composed with texts describing video extractions can boost each other. some events (e.g., Final Senate vote for the We can further extend the benefit of cross- reform plans, Obama signs the reform media inference to cross-lingual information agreement), images (e.g., images about various extraction (CLIE). Hakkani-Tur et al. (2007) government involvements over decades) and found that CLIE performed notably worse than videos/speech (e.g. Obama’s speech video about monolingual IE, and indicated that a major the decisions) containing additional information cause was the low quality of machine translation regarding the real extent of the events or (MT). Current statistical MT methods require providing evidence corroborating the text part. large and manually aligned parallel corpora as These cross-media objects exist in redundant input for each language pair of interest. Some and complementary structures, and therefore it recent work (e.g. Munteanu and Marcu, 2005; Ji, is beneficial to fuse information from various 2009) found that MT can benefit from multi- data modalities. The goal of our paper is to lingual comparable corpora (Cheung and Fung, investigate this task from both mono-lingual and 2004), but it is time-consuming to identify pairs cross-lingual perspectives. of comparable texts; especially when there is lack of parallel information such as news release include coreferred persons, geo-political entities dates and topics. However, the images/videos (GPE), locations, organizations, facilities, vehi- embedded in the same documents can provide cles and weapons; relations include 18 types additional clues for similarity computation (e.g. “a town some 50 miles south of Salzburg” because they are ‘language-independent’. We indicates a located relation.); events include the will show how a video-based comparison 33 distinct event types defined in ACE 2005 approach can reliably build large comparable (e.g. “Barry Diller on Wednesday quit as chief text corpora for three languages: English, of Vivendi Universal Entertainment.” indicates a Chinese and Arabic. “personnel-start” event). Names are identified and classified using an HMM-based name tag- 2 Baseline Systems ger. Nominals are identified using a maximum entropy-based chunker and then semantically We apply the following state-of-the-art text and classified using statistics from ACE training video information extraction systems as our corpora. Relation extraction and event extraction baselines. Each system can produce reliable are also based on maximum entropy models, confidence values based on statistical models. incorporating diverse lexical, syntactic, semantic 2.1 Video Concept Extraction and ontological knowledge. The video concept extraction system was 3 Mono-lingual Information Fusion developed by IBM for the TREC Video and Inference Retrieval Evaluation (TRECVID-2005) (Naphade et al., 2005). This system can extract 3.1 Mono-lingual System Overview 2617 concepts defined by TRECVID, such as "Hospital", "Airplane" and "Female-Person". It Multi-media uses support vector machines to learn the Document mapping between low level features extracted from visual modality as well as from transcripts and production related meta-features. It also ASR Speech Videos/Images exploits a Correlative Multi-label Learner (Qi et al., 2007), a Multi-Layer Multi-Instance Kernel Texts (Gu et al., 2007) and Label Propagation through Partitioning Linear Neighborhoods (Wang et al., 2006) to extract all other high-level features. For each Text Information Video Concept classifier, different models are trained on a set Extraction Extraction of different modalities (e.g., the color moments, textures, and edge histograms), and the Entities/ predictions made by these classifiers are Concepts Relations/Events combined together with a hierarchical linearly- weighted fusion strategy across different modalities and classifiers. Multi-level Concept Fusion

2.2 Text Information Extraction Global Inference

We use a state-of-the-art IE system (Ji and Grishman, 2008) developed for the Automatic Enhanced Content Extraction (ACE) program1 to process Concepts/Entities texts and automatic speech recognition output. Relations/Events The pipeline includes name tagging, nominal mention tagging, coreference resolution, time Figure 1. Mono-lingual Cross-Media expression extraction and normalization, rela- Information Fusion and Inference Pipeline tion extraction and event extraction. Entities Figure 1 depicts the general procedure of our mono-lingual information fusion and inference 1 http://www.nist.gov/speech/tests/ace/ approach. After we apply two baseline systems (sequential keyframes) and background speech to the multi-media documents, we use a novel to align concepts. During this fusion process, we multi-level concept fusion approach to extract a compare the normalized confidence values pro- common knowledge representation across texts duced from two pipelines to resolve the follow- and videos (section 3.2), and then apply a global ing three types of cases: inference approach to enhance fusion results • Contradiction – A video fact contradicts a (section 3.3). text fact; we only keep the fact with higher confidence. 3.2 Cross-media Information Fusion • Redundancy – A video fact conveys the • Concept Mapping same content as (or entails, or is entailed by) a text fact; we only keep the unique parts of For each input video, we apply automatic speech the facts. recognition to obtain background texts. Then we • Complementary – A video fact and a text use the baseline IE systems described in section fact are complementary; we merge these 2 to extract concepts from texts and videos. We two to form more complete fact sets. construct mappings on the overlapped facts across TRECVID and ACE. For example, • A Common Representation “LOC.Water-Body” in ACE is mapped to In order to effectively extract compact informa- “Beach, Lakes, Oceans, River, River_Bank” in tion from large amounts of heterogeneous data, TRECVID. we design an integrated XML format to repre- Due to different characteristics of video clips sent the facts extracted from the above multi- and texts, these two tasks have quite different level fusion. We can view this representation as granularities and focus. For example, a set of directed “information graphs” G={Gi “PER.Individual” in ACE is an open set includ- (Vi, Ei)}, where Vi is the collection of concepts ing arbitrary names, while TRECVID only cov- from both texts and videos, and Ei is the collec- ers some famous proper names such as tion of edges linking one concept to the other, “Hu_Jintao” and “John_Edwards”. Geopolitical labeled by relation or event attributes. An exam- entities appear very rarely in TRECVID because ple is presented in Figure 2. This common rep- they are more explicitly presented in back- resentation is applied in both mono-lingual and ground texts. On the other hand, TRECVID de- multi-lingual information fusion tasks described fined much more fine-grained nominals than in next sections. ACE, for example, “FAC.Building-Grounds” in ACE can be divided into 52 possible concept Amina Abbas types such as “Conference_Buildings” and Spouse “Golf_Course” because they can be more easily Leader Child Yasser PLO Mahmoud detected based on video features. We also notice Abbas Abbas that TRECVID concepts can include multiple Elected/ Birth-Place 2008-11-23 levels of ACE facts, for example British “WEA_Shooting” concept can be separated into Safed PLO Located Mandate of “weapon” entities and “attack” events in ACE. Palestine These different definitions bring challenges to cross-media fusion but also opportunities to ex- Figure 2. An example for cross-media common ploit complementary facts to refine both pipe- fact representation lines. We manually resolved these issues and obtained 20 fused concept sets. 3.3 Cross-media Information Inference • Time-stamp based Multi-level Projection • Uncertainty Problem in Cross-Media Fu- sion After extracting facts from videos and texts, we conduct information fusion at all possible levels: However, such a simple merging approach usu- name, nominal, coreference link, relation or ally leads to unsatisfying results due to uncer- event mention. We rely on the timestamp infor- tainty. Uncertainty in multimedia is induced mation associated with video keyframes or shots from noise in the data acquisition procedure (e.g., noise in automatic speech recognition re- In this paper we usedλ1=0.1, λ2=0.1 sults and low-quality camera surveillance vid- andλ3=0.8 which are optimized from a devel- eos) as well as human errors and subjectivity. opment set. Unstructured texts, especially those translated from foreign languages, are difficult to interpret. 4 Cross-lingual Comparable Corpora In addition, automatic IE systems for both vid- Acquisition eos and texts tend to produce errors. In this section we extend the information fusion • Case Study on Mention Gender Detection approach to a task of discovering comparable We employ cross-media inference methods to corpora. reduce uncertainty. We will demonstrate this approach on a case study of gender detection for 4.1 Comparable Documents persons. Automatic gender detection is crucial Figure 3 presents an example of cross-lingual to many natural language processing tasks such comparable documents. They are both about the as pronoun reference resolution (Bergsma, rescue activities for the Haiti earthquake. 2005). Gender detection for last names has proved challenging; Gender for nominals can be highly ambiguous in various contexts. Unfortu- nately most state-of-the-art approaches discover gender information without considering specific contexts in the document. The results were stored either as a knowledge base with prob- abilities (e.g. Ji and Lin, 2009) or as a static gazetteer (e.g. census data). Furthermore, speech recognition normally performs poorly on names, which brings more challenges to gender detec- Figure 3. An example for cross-lingual tion for mis-spelled names. multi-media comparable documents We consider two approaches as our baselines. The first baseline is to discover gender knowl- edge from Google N-grams using specific lexi- Multi-media Multi-media cal patterns (e.g. “[mention] and Document in Document in Language i Language j his/her/its/their”) (Ji and Lin, 2009). The other baseline is a gazetteer matching approach based on census data including person names and gen- Text T i Video V i Video V i Text T j der information, as used in typical text IE sys- tems. Concept Concept Extraction Extraction We introduce the third method based on male/female concept extraction from associated background videos. These concepts are detected Facts-Vi Facts-Vj from context-dependent features (e.g. face rec- ognition). If there are multiple persons in one Similarity Computation snippet associated with one shot, we propagate gender information to all instances. δ We then linearly combine these three methods Similarity> ? based on confidence values. For example, the confidence of predicting a name mention n as a Comparable Docu- male (M) can be computed by combining prob- ments abilities P(n, M, method): confidence(n,male)=λ1*P(n,M,ngram)+ Figure 4. Cross-lingual Comparable Text λ2*P(n,M,census) +λ3*P(n,M,video) Corpora Acquisition based on Video Similarity Computation Traditional text translation based methods tend is a video consisting of a series of k shots V = to miss such pairs due to poor translation quality {S1, …, Sk}, then: of informative words (Ji et al., 2009). However, the background videos and images are language- k tf(,) C V= tf (, C S ) k independent and thus can be exploited to iden- i=1 i tify such comparable documents. This provides a cross-media approach to break language bar- Let p(C, Si) denote the probability that C is ex- rier. tracted from Si, we define two different ways to compute term frequency tf (C, Si): 4.2 Cross-lingual System Overview (1) tfCS(, )= confidenceCS (, ) Figure 4 presents the general pipeline of discov- ii ering cross-lingual comparable documents based and = α confidence(, C Si ) on background video comparison. The detailed (2) tf(, C Si ) video similarity computation method is pre- sented in next section. Where Confidence (C, Si) denotes the probabil- 4.3 Video Concept Similarity Computation ity of detecting a concept C in a shot Si:

Most document clustering systems use represen- confidence(, C S )= p (, C S )if pCS(, )> δ , tations built out of the lexical and syntactic at- iii tributes. These attributes may involve string otherwise 0. matching, agreement, syntactic distance, and = > δ , document release dates. Although gains have Let: df(, C Si ) 1if pCS(,i ) otherwise 0, been made with such methods, there are clearly assuming there are j shots in the entire corpus, cases where shallow information will not be suf- we calculate idf as follows: ficient to resolve clustering correctly. Therefore, we should therefore expect a successful docu- j = ment comparison approach to exploit world idf(,) C V log/ j df (, C Si ) = knowledge, inference, and other forms of se- i 1 mantic information in order to resolve hard cases. For example, if two documents include 5 Experimental Results concepts referring to male-people, earthquake This section presents experimental results of all event, rescue activities, and facility-grounds the three tasks described above. with similar frequency information, we can de- termine they are likely to be comparable. In this 5.1 Data paper we represent each video as a vector of We used 244 videos from TRECVID 2005 data semantic concepts extracted from videos and set as our test set. This data set includes 133,918 then use standard vector space model to com- keyframes, with corresponding automatic pute similarity. speech recognition and translation results (for Let A=(a1, …a|∑|) and B=(b1, …b|∑|) be such foreign languages) provided by LDC. vectors for a pair of videos, then we use cosine similarity to compute similarity: 5.2 Information Fusion Results Table 1 shows information fusion results for ||  ab English, Arabic and Chinese on multiple levels. i=1 ii cos(AB , ) = , It indicates that video and text extraction pipe- || || ab22 lines are complementary – almost all of the ii==11ii video concepts are about nominals and events; while text extraction output contains a large where |∑| contains all possible concepts. We amount of names and relations. Therefore the use traditional TF-IDF (Term Frequency-Inverse results after information fusion produced much Document Frequency) weights for the vector richer knowledge. elements ai and bi. Let C be a unique concept, V Annotation Lev- English Chinese Arabic Table 2 shows that video extraction based ap- els proach can achieve the highest recall among all # of videos 104 84 56 three methods. The combined approach Video Concept 250880 221898 197233 achieved statistically significant improvement Name 17350 22154 20057 on recall. Nominal 31528 21852 16253 Table 3 presents some examples (“F” for fe- Text Relation 9645 20880 16584 male and “M” for male). We found that most Event 31132 10348 7148 speech name recognition errors are propagated to gender detection in the baseline methods, for Table 1. Information Fusion Results example, “Sala Zhang” is mis-spelled in speech recognition output (the correct spelling should It’s also worth noting that the number of con- be “Sarah Chang”) and thus Google N-gram cepts extracted from videos is similar across approach mistakenly predicted it as a male. languages, while much fewer events are ex- Many rare names such as “Wu Ficzek”, tracted from Chinese or Arabic because of “Karami” cannot be predicted by the baselines, speech recognition and machine translation er- Error analysis on video extraction based ap- rors. We took out 1% of the results to measure proach showed that most errors occur on those accuracy against ground-truth in TRECVID and shots including multiple people (males and fe- ACE training data respectively; the mean aver- males). In addition, since the data set is from age precision for video concept extraction is news domain, there were many shots including about 33.6%. On English ASR output the text- reporters and target persons at the same time. IE system achieved about 82.7% F-measure on For example, “Jiang Zemin” was mistakenly labeling names, 80.5% F-measure on nominals associated with a “female” gender because the (regardless of ASR errors), 66% on relations reporter is a female in that corresponding shot. and 64% on events. 5.4 Comparable Corpora Acquisition Re- 5.3 Information Inference Results sults From the test set, we chose 650 persons (492 For comparable corpora acquisition, we meas- males and 158 females) to evaluate gender dis- ured accuracy for the top 50 document pairs. covery. For baselines, we used Google n-gram Due to lack of answer-keys, we asked a bi- (n=5) corpus Version II including 1.2 billion 5- lingual human annotator to judge results manu- grams extracted from about 9.7 billion sentences ally. The evaluation guideline generally fol- (Lin et al., 2010) and census data including lowed the definitions in (Cheung and Fung, 5,014 person names with gender information. 2004). A pair of documents is judged as compa- Since we only have gold-standard gender in- rable if they share a certain amount of informa- formation on shot-level (corresponding to a tion (e.g. entities, events and topics). snippet in ASR output), we asked a human an- Without using IDF, for different parameter α notator to associate ground-truth with individual and δ in the similarity metrics, the results are persons. Table 2 presents overall precision (P), summarized in Figure 5. For comparison we recall (R) and F-measure (F). present the results for mono-lingual and cross- lingual separately. Figure 5 indicates that as the Methods P R F threshold and normalization values increase, the Google N-gram 89.1% 70.2% 78.5% accuracy generally improves. It’s not surprising Census 96.2% 19.4% 32.4% that mono-lingual results are better than cross- lingual results, because generally more videos Video Extraction 88.9% 73.8% 80.6% with comparable topics are in the same language. Combined 89.3% 80.4% 84.6%

Table 2. Gender Discovery Performance

Google Video Correct Mention Census Context Sentence N-gram Extraction Answer World famous meaning violin soloist Zhang M: 1 F: 0.699 - F Zhang Sala recently again to Toronto sym- Sala F: 0 M: 0.301 phony orchestra... M: .979 M: 0.699 Iraq, there are in Lebanon Paris pass Peter Peter M: 1 M F: 0.021 F: 0.301 after 10 five Dar exile without peace... Wu M: 0.699 If you want to do a good job indeed Wu - M Ficzek F: 0.301 Ficzek Labor union of Arab heritage publishers M: .953 M: 0.704 President - M president to call for the opening of the F: 0.047 F: 0.296 Arab Book Exhibition. Jiang M: 1 F: 0.787 It has never stopped the including the for- - M Zemin F: 0 M: 0.213 mer CPC General Secretary Jiang Zemin… all the Gamal Ismail introduced the needs M: 1 M: 0.694 of the Akkar region, referring to the desire Karami - M F: 0 F: 0.306 on the issue of the President Karami to give priority disadvantaged areas

Table 3. Examples for Mention Gender Detection

Figure 5. Comparable Corpora Acquisition Figure 6. Comparable Corpora Acquisition with without IDF IDF (δ=0.6)

We then added IDF to the optimized threshold 6 Related Work and obtained results in Figure 6. The accuracy A large body of prior work has focused on multi- for both languages was further enhanced. We can media information retrieval and document classi- see that under any conditions our approach can fication (e.g. Iria and Magalhaes, 2009). State- discover comparable documents reliably. In or- of-the-art information fusion approaches can be der to measure the impact of concept extraction divided into two groups: formal “top-down” errors, we also evaluated the results for using methods from the generic knowledge fusion ground-truth concepts as shown in Figure 6. Sur- community and quantitative “bottom-up” tech- prisingly it didn’t provide much higher accuracy niques from the Semantic Web community (Ap- than automatic concept extraction, mainly be- priou et al., 2001; Gregoire, 2006). However, cause the similarity can be captured by some very limited research methods have been ex- dominant video concepts. plored to fuse automatically extracted facts from we propose a new approach to integrate informa- texts and videos/images. Our idea of conducting tion extracted from videos and texts into a coher- information fusion on multiple semantic levels is ent common representation including multi-level similar to the kernel method described in (Gu et knowledge (concepts, relations and events). Be- al., 2007). yond standard information fusion, we attempted Most previous work on cross-media information global inference methods to incorporate video extraction focused on one single domain (e.g. e- extraction and significantly enhanced the per- Government (Amato et al., 2010); soccer game formance of text extraction. Finally, we extend (Pazouki and Rahmati, 2009)) and struc- our methods to multi-lingual environment (Eng- tured/semi-structured texts (e.g. product cata- lish, Arabic and Chinese) by presenting a case logues (Labsky et al., 2005)). Saggion et al. study on cross-lingual comparable corpora acqui- (2004) described a multimedia extraction ap- sition. proach to create composite index from multiple We used a dataset which includes videos and and multi-lingual sources. We expand the task to associated speech recognition output (texts), but the more general news domain including unstruc- our approach is applicable to any cases in which tured texts and use cross-media inference to en- texts and videos appear together (from associated hance extraction performance. texts, captions etc.). The proposed common rep- Some recent work has exploited analysis of as- resentation will provide a framework for many sociated texts to improve image annotation (e.g. byproducts. For example, the monolingual fused Deschacht and Moens, 2007; Feng and Lapata, information graphs can be used to generate ab- 2008). Some recent research demonstrated cross- stractive summaries. Given the fused information modal integration can provide significant gains we can also visualize the facts from background in improving the richness of information. For texts effectively. We are also interested in using example, Oviatt et al. (1997) showed that speech video information to discover novel relations and and pen-based gestures can provide complemen- events which are missed in the text IE task. tary capabilities because basic subject, verb, and object constituents almost always are spoken, Acknowledgement whereas those describing locative information This work was supported by the U.S. Army Re- invariably are written or gestured. However, not search Laboratory under Cooperative Agreement much work demonstrated an effective method of Number W911NF-09-2-0053, the U.S. NSF using video/image annotation to improve text CAREER Award under Grant IIS-0953149, extraction. Our experiments provide some case Google, Inc., DARPA GALE Program, CUNY studies in this new direction. Our work can also Research Enhancement Program, PSC-CUNY be considered as an extension of global back- Research Program, Faculty Publication Program ground inference (e.g. Ji and Grishman, 2008) to and GRTI Program. The views and conclusions cross-media paradigm. contained in this document are those of the au- Extensive research has been done on video clus- thors and should not be interpreted as represent- tering. For example, Cheung and Zakhor (2000) ing the official policies, either expressed or im- used meta-data extracted from textual and hyper- plied, of the Army Research Laboratory or the link information to detect similar videos on the U.S. Government. The U.S. Government is au- web; Magalhaes et al. (2008) described a seman- thorized to reproduce and distribute reprints for tic similarity metric based on key word vectors Government purposes notwithstanding any copy- for multi-media fusion. We extend such video right notation here on. similarity computing approaches to a multi- lingual environment. References 7 Conclusion and Future Work Amato, F., Mazzeo, A., Moscato, V. and Picariello, A. 2010. Information Extraction from Multimedia Traditional Information Extraction (IE) ap- Documents for e-Government Applications. proaches focused on single media (e.g. texts), Information Systems: People, Organizations, Insti- with very limited use of knowledge from other tutions, and Technologies. pp. 101-108. data modalities in the background. In this paper Appriou A., A. Ayoun, Benferhat, S., Besnard, P., DARPA Global Autonomous Language Exploita- Cholvy, L., Cooke, R., Cuppens, F., Dubois, D., tion. Springer. Fargier, H., Grabisch, M., Kruse, R., Lang, J. Ji, H. and Lin, D. 2009. Gender and Animacy Knowl- Moral, S., Prade, H., Saffiotti, A., Smets, P., Sos- edge Discovery from Web-Scale N-Grams for Un- sai, C. 2001. Fusion: General concepts and charac- supervised Person Mention Detection. Proc. PA- teristics. International Journal of Intelligent Sys- CLIC 2009. tems 16(10). Oviatt, S. L., DeAngeli, A., & Kuhn, K. 1997. Inte- Baluja, S. and Rowley, H. 2006. Boosting Sex Identi- gration and synchronization of input modes during fication Performance. International Journal of multimodal human-computer interaction. Proceed- . ings of Conference on Human Factors in Comput- Bergsma, S. 2005. Automatic Acquisition of Gender ing Systems (CHI’97), 415-422. New York: ACM Information for Anaphora Resolution. Proc. Cana- Press. dian AI 2005. Labsky, M., Praks, P., Sv´atek1, V., and Svab, O. Cheung, P. and Fung P. 2004. Sentence Alignment in 2005. Multimedia Information Extraction from Parallel, Comparable, and Quasi-comparable Cor- HTML Product Catalogues. Proc. 2005 pora. Proc. LREC 2004. IEEE/WIC/ACM International Conference on Web Intelligence. pp. 401 – 404. Cheung, S.-C. and Zakhor, A. 2000. Efficient video similarity measurement and search. Proc. IEEE In- Lin, D., Church, K., Ji, H., Sekine, S., Yarowsky, D., ternational Conference on Image Processing. Bergsma, S., Patil, K., Pitler, E., Lathbury, R., Rao, V., Dalwani, K. and Narsale, S. 2010. New Data, Deschacht K. and Moens M. 2007. Text Analysis for Tags and Tools for Web-Scale N-grams. Proc. Automatic Image Annotation. Proc. ACL 2007. LREC 2010. Feng, Y. and Lapata, M. 2008. Automatic Image An- Magalhaes, J., Ciravegna, F. and Ruger, S. 2008. Ex- notation Using Auxiliary Text Information. Proc. ploring Multimedia in a Keyword Space. Proc. ACL 2008. ACM Multimedia 2008. Gregoire, E. 2006. An unbiased approach to iterated Munteanu, D. S. and Marcu D. 2005. Improving Ma- fusion by weakening. Information Fusion. 7(1). chine Translation Performance by Exploiting Non- Gu, Z., Mei, T., Hua, X., Tang, J., Wu, X. 2007. Parallel Corpora. Computational Linguistics. Vol- Multi-Layer Multi-Instance Kernel for Video Con- ume 31, Issue 4. pp. 477-504. cept Detection. Proc. ACM Multimedia 2007. Naphade, M. R., Kennedy, L., Kender, J. R., Chang, Hakkani-Tur, D., Ji, H. and Grishman, R. 2007. Using S.-F., Smith, J. R., Over, P., and Hauptmann, A. A Information Extraction to Improve Cross-lingual light scale concept ontology for multimedia under- Document Retrieval. Proc. RANLP 2007 Workshop standing for TRECVID 2005. Technical report, on Multi-Source Multi-lingual Information Extrac- IBM, 2005. tion and Summarization. Pazouki, E. and Rahmati, M. 2009. A novel multime- Iria, J. and Magalhaes, J. 2009. Exploiting Cross- dia data mining framework for information extrac- Media Correlations in the Categorization of Mul- tion of a soccer video stream. Intelligent Data timedia Web Documents. Proc. CIAM 2009. Analysis. pp. 833-857. Ji, H. and Grishman, R. 2008. Refining Event Extrac- Qi,G.-J., Hua,X.-S., Rui, Y., Tang, J., Mei, T., and tion Through Cross-document Inference. Proc. Zhang,H.-J. 2007. Correlative Multi-label Video ACL 2008. Annotation. Proc. ACM Multimedia 2007. Ji, H. 2009. Mining Name Translations from Compa- Saggion, H., Cunningham, H., Bontcheva, K., rable Corpora by Creating Bilingual Information Maynard, D., Hamza, O., and Wilks, Y. 2004. Mul- Networks. Proc. ACL-IJCNLP 2009 workshop on timedia indexing through multi-source and multi- Building and Using Comparable Corpora (BUCC language information extraction: the MUMIS pro- 2009): from parallel to non-parallel corpora. ject. Data Knowlege Engineering, 48, 2, pp. 247- 264. Ji, H., Grishman, R., Freitag, D., Blume, M., Wang, J., Khadivi, S., Zens, R., and Ney, H. 2009. Name Wang, F. and Zhang, C. 2006. Label propagation Translation for Distillation. Handbook of Natural through linear neighborhoods. Proc. ICML 2006. Language Processing and Machine Translation: