Information Access in a Multilingual World ______

Information Access in a Multilingual World __________________________ Proceedings of the SIGIR 2009 Workshop July 23, 2009 Boston, Massachusetts USA Organized by: Fredric Gey, University of California, Berkeley USA Noriko Kando, National Institute of Informatics, Tokyo JAPAN Jussi Karlgren, Swedish Institute of Computer Science, Stockholm SWEDEN In cooperation with the InfoPlosion Project of Japan: http://www.infoplosion.nii.ac.jp/info-plosion/ctr.php/m/IndexEng/a/Index/ LEGAL NOTICE: Copyright for individual papers in these proceedings resides with the individual authors of each paper. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR Workshop on Information Access in a Multilingual World July 23, 2009 Boston, Massachusetts USA Program Committee: Martin Braschler, Zürich University of Applied Sciences, Switzerland Aitao Chen, Yahoo! Research, USA Kuang-hua Chen, National Taiwan University, Taiwan ROC Ruihua Chen, Microsoft Research, China Nicola Ferro, University of Padua, Italy Atsushi Fujii, Tsukuba University, Japan Julio Gonzalo, Universidad Nacional de Educación a Distancia, Spain Gareth Jones, Dublin City University, Ireland Kazuaki Kishida, Keio University, Japan Sadao Kurohashi, Kyoto University, Japan Kazuko Kuriyama, Shirayuri College, Japan Gina-Anne Levow, University of Chicago, USA Chin-Yew Lin, Microsoft Research, China Thomas Mandl, University of Hildesheim, Germany James Mayfield, Johns Hopkins University, USA Mandar Mitra, Indian Statistical Institute, India Tatsunori Mori, Yokohama National University, Japan Isabelle Moulinier, Thomson Reuters, USA Paul McNamee, Johns Hopkins University, USA Douglas W. Oard, University of Maryland, USA Maarten de Rijke, University of Amsterdam, The Netherlands Miguel Ruiz, University of North Texas, USA Yohei Seki, Toyohashi University of Technology, Japan Benjamin Tsou, City University of Hong Kong, China Takehito Utsuro, Tsukuba University, Japan Christa Womser-Hacker, University of Hildesheim, Germany Kam-Fai Wong, Chinese University of Hong Kong, China Masaharu Yoshioka, Hokkaido University, Japan Introduction and Overview This workshop is the third workshop on the topic of multilingual information access held during SIGIR conferences this decade. The first was at SIGIR 2002 on the topic of “Cross Language Information Retrieval: A Research Roadmap”. The second was at SIGIR 2006 on the topic of “New Directions in Multilingual Information Access”. Over the past decade the field has matured and real world applications have appeared. Thus our goal in this 2009 workshop was to collate experiences and plans for the real-world application of multilingual technology to information access. Our aim is to identify the remaining barriers to practical multilingual information access, both technological and from the point of view of user interaction. We were fortunate to obtain as invited keynote speaker Dr Ralf Steinberger of the Joint Research Centre of the European Commission, presenting the Joint Research Centre's multilingual media monitoring and analysis applications, including NewsExplorer. Dr. Steinberger has written an overview paper about their family of applications, which is the first paper in these proceedings. In our call for papers we specified two types of papers, research papers and position papers. Of the 15 papers initially submitted, two were withdrawn and two were rejected. We accepted 3 research papers and 8 position papers, covering topics from evaluation (of image indexing and of cross-language information retrieval in general), Wikipedia and trust, news site characterization, multilinguality in digital libraries, multilingual user interface design, access to less commonly taught languages (e.g. Indian subcontinent languages), implementation and application to health care. We believe these papers represent a cross-section of the work remaining to be done in moving toward full information access in a multilingual world. We would like to thank our program committee who read, reviewed and wrote reviews under major time constraints (in some cases 2 days). Fredric Gey, Noriko Kando and Jussi Karlgren TABLE OF CONTENTS Paper title Page Authors (Affiliation) An introduction to the Europe Media Monitor: family of applications ……… 1 Ralf Steinberger, Bruno Pouliquen, Erik van der Goot (European Commission Joint Research Centre) The Assignment of Tags to Images in Internet: Language Skill Evaluation for Tag Recommendation …………………………………………………………… 9 David Nettleton, Mari-Carmen Marcos (Pompeu Fabra University), Bartolomé Mesa-Lao (Autonomous University of Barcelona) Multilingual Wikipedia, Summarization, and Information Trustworthiness ... 19 Elena Filatova (Fordham University) NSContrast: An Exploratory News Article Analysis System that Characterizes the Differences between News Sites ……………………………...………………….… 25 Masaharu Yoshioka (Hokkaido University) What’s wrong with Cross-Lingual IR? …………………………………………… 31 John I. Tait (Information Retrieval Facility) A Virtual Evaluation Track for Cross Language Link Discovery ………………. 33 Wei Che (Darren) Huang (Queensland University of Technology), Andrew Trotman (University of Otago), Shlomo Geva (Queensland University of Technology) Towards an Integrative Approach to Cross-Language Information Access for Digital Libraries ……………………………………………………………………. 41 Jiangping Chen, Miguel E Ruiz (University of North Texas) Ubiquity: Designing a Multilingual Natural Language Interface ……………… 45 Michael Yoshitaka Erlewine (Mozilla Labs) Romanization – An Untapped Resource for Out-of-Vocabulary MT for CLIR .. 49 Fredric Gey (University of California, Berkeley) Urdu is not Hindi for Information Access ………………………………………… 53 Kashif Riaz (University of Minnesota) A Patient Support System based on Crosslingual IR and Semi-supervised Learning ……………………………………………………………………………… 59 Hideki Isozaki, Tsutomu Hirao, Katsuhito Sudoh, Jun Suzuki, Akinori Fujino, Hajime Tsukada, Masaaki Nagata (NTT Corporation) SIGIR 2009 Workshop Proceedings Information Access in a Multilingual World Boston, Massachusetts USA An introduction to the Europe Media Monitor family of applications Ralf Steinberger, Bruno Pouliquen & Erik van der Goot European Commission Joint Research Centre (JRC) 21027 Ispra (VA), Italy Tel: (+39) 0332 78 5648 [email protected] ABSTRACT organizations such as those that monitor threats to Public Health, Most large organizations have dedicated departments that monitor monitor the multilingual media continuously for early warning the media to keep up-to-date with relevant developments and to and information gathering purposes. In Europe and large parts of keep an eye on how they are represented in the news. Part of this the world, these organizations look at the news in various media monitoring work can be automated. In the European Union languages because the content found across different languages is with its 23 official languages, it is particularly important to cover complementary. In the context of the European Union, which has media reports in many languages in order to capture the 23 official languages, multilinguality is a practical and a complementary news content published in the different countries. diplomatic necessity, and so is the related ability to access It is also important to be able to access the news content across information across languages. languages and to merge the extracted information. We present Current approaches to link and access related textual content here the four publicly accessible systems of the Europe Media across languages use Machine Translation, bilingual dictionaries, Monitor (EMM) family of applications, which cover between 19 or bilingual vector space representations. These approaches are and 50 languages (see http://press.jrc.it/overview.html). We give limited to covering a small number of languages. We present an an overview of their functionality and discuss some of the alternative way of tackling this challenge. It consists of aiming at implications of the fact that they cover quite so many languages. a language-neutral representation of the multilingual document We discuss design issues necessary to be able to achieve this high contents. multilinguality, as well as the benefits of this multilinguality. The European Commission’s Joint Research Centre (JRC) has developed a number of multilingual news monitoring and analysis Categories and Subject Descriptors applications, four of which are publicly accessible to the wider H.3.1 [Content Analysis and Indexing]: Linguistic Processing, public. All four systems can be accessed through the common Thesauruses; H.3.3 [Information Search and Retrieval]: entrance portal http://press.jrc.it/overview.html. Clustering, Information Filtering; H.3.5 [Online Information Services]: Web-based Services; Users of the systems are the EU Institutions, national organizations in the EU Member States (and in some non-EU countries), international organizations, as well as the public. The General Terms freely accessible websites receive between one and two million Languages, Algorithms, Design, Theory. hits per day from thirty to fifty thousand distinct users. The next section provides information on the news data that is Keywords being analyzed every day. Section 3 gives an overview of the four Media monitoring; multilinguality; cross-lingual information publicly accessible media monitoring systems and their specific access; CLIA; information extraction; Europe Media Monitor. functionalities.

Information Access in a Multilingual World ______

Cultural Anthropology Through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-Based Sentiment

Modeling Popularity and Reliability of Sources in Multilingual Wikipedia

International Journal of Computational Linguistics

China Date: 8 January 2007

THE CASE of WIKIPEDIA Shane Greenstein Feng Zhu

A Natural Experiment at Chinese Wikipedia

100125-Wikimedia Feb Board Background Final2 Objectives for This Document

Does Wikipedia Matter? the Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko Discus Si On Paper No

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

The Copycat of Wikipedia in China

Wikipedia Matters∗

Visual Narratives and Collective Memory Across Peer-Produced Accounts of Contested Sociopolitical Events