Workshop Proceedings Here
Total Page:16
File Type:pdf, Size:1020Kb
1st International Workshop on Entity-Oriented Search (EOS) Balog, K., de Vries, A.P., Serdyukov, P., Wen, J-R. Proceedings of the ACM SIGIR Workshop “Entity-Oriented Search (EOS)” held in conjunction with the 34th Annual International ACM SIGIR Conference 28 July 2011, Beijing, China ISBN: 978-94-6186-000-2 TU Delft, The Netherlands contact: [email protected], [email protected] workshop url: http://research.microsoft.com/sigir-entitysearch/ Preface Both commercial systems and the research community are displaying an increased interest in returning “objects,” “entities,” or their properties in response to a users query. While major search engines are capable of recognizing specific types of objects (e.g., locations, events, celebrities), true entity search still has a long way to go. Entity retrieval is challenging as “objects,” unlike documents, are not directly represented and need to be identified and recognized in the mixed space of structured and unstructured Web data. While standard document retrieval methods applied to textual representations of entities do seem to provide reasonable performance, a big open question remains how much influence the entity type should have on the ranking algorithms developed. This ACM SIGIR Workshop on Entity-Oriented Search (EOS) seeks to uncover the next research frontiers in entity-oriented search. It aims to provide a forum to discuss entity- oriented search, without restriction to any particular data collection, entity type, or user task. The program committee accepted 11 papers (6 selected for full and 5 selected for short oral presentation) on a wide range of topics, such as learning-to-rank based entity finding and summarization, entity relations mining, spatio-temporal entity search, entity-search in structured web data, and evaluation of entity-oriented search. In addition, the EOS program includes two invited talks from leading search engineers: Paul Ogilvie (LinkedIn) and Andrey Plakhov (Yandex). The workshop also features a panel discussion. Finally, we would like to thank the ACM and SIGIR for hosting the workshop; Microsoft Research Asia for the computing resources to host the workshop web site; all the authors who submitted papers for the hard work that went into their submissions; the members of our program committee for the thorough reviews in such a short period of time; Paul Ogilvie and Andrey Plakhov for agreeing on giving an invited talk; Yandex for sponsoring the best presentation award. Krisztian Balog, Arjen P. de Vries, Pavel Serdyukov, and Ji-Rong Wen Program Committee Co-Chairs i Table of Contents EOS Organization........................................................................iv Session 1: High Performance Clustering for Web Person Name Disambiguation Using Topic • Capturing...............................................................................1 Zhengzhong Liu, Qin Lu, and Jian Xu (The Hong Kong Polytechnic University) Extracting Dish Names from Chinese Blog Reviews Using Suffix Arrays and a Multi- • Modal CRF Model ......................................................................7 Richard Tzong-Han Tsai (Yuan Ze University, Taiwan) LADS: Rapid Development of a Learning-To-Rank Based Related Entity Finding Sy- • stem using Open Advancement ......................................................14 Bo Lin, Kevin Dela Rosa, Rushin Shah, and Nitin Agarwal (Carnegie Mellon University) Finding Support Documents with a Logistic Regression Approach . 20 • Qi Li and Daqing He (University of Pittsburgh) The Sindice-2011 Dataset for Entity-Oriented Search in the Web of Data . 26 • St«ephane Campinas (National University of Ireland) Diego Ceccarelli (University of Pisa) Thomas E. Perry (National University of Ireland) Renaud Delbru (National University of Ireland) Krisztian Balog (Norwegian University of Science and Technology) Giovanni Tummarello (National University of Ireland) Session 2: Cross-Domain Bootstrapping for Named Entity Recognition ........................33 • Ang Sun and Ralph Grishman (New York University) Semi-supervised Statistical Inference for Business Entities Extraction and Business • Relations Discovery ..................................................................41 Raymond Y.K. Lau and Wenping Zhang (City University of Hong Kong) Unsupervised Related Entity Finding ................................................47 • Olga Vechtomova (University of Waterloo) iii Session 3: Learning to Rank Homepages For Researcher-Name Queries.......................53 • Sujatha Das, Prasenjit Mitra, and C. Lee Giles (The Pennsylvania State University) An Evaluation Framework for Aggregated Temporal Information Extraction. .59 • Enrique Amigo« (UNED University) Javier Artiles (City University of New York) Heng Hi (City University of New York) Qi Li (City University of New York) Entity Search Evaluation over Structured Web Data .................................65 • Roi Blanco (Yahoo! Research) Harry Halpin (University of Edinburgh) Daniel M. Herzig (Karlsruhe Institute of Technology) Peter Mika (Yahoo! Research) Jeffrey Pound (University of Waterloo) Henry S. Thompson (University of Edinburgh) Thanh Tran Duc (Karlsruhe Institute of Technology) Author Index .............................................................................72 iv EOS Organization Program Co-Chair: Krisztian Balog, (NTNU, Norway) Arjen P. de Vries, (CWI/TU Delft, The Netherlands) Pavel Serdyukov, (Yandex, Russia) Ji-Rong Wen, (Microsoft Research Asia, China) Program Committee: Ioannis (Yanni) Antonellis, (Nextag, USA) Wojciech M. Barczynski, (SAP Research, Germany) Indrajit Bhattacharya, (Indian Institute of Science, Bangalore, India) Roi Blanco, (Yahoo! Research Barcelona, Spain) Paul Buitelaar, (DERI - National University of Ireland, Galway, Ireland) Wray Buntine, (NICTA Canberra, Australia) Jamie Callan, (Carnegie Mellon University, USA) Nick Craswell, (Microsoft, USA) Gianluca Demartini, (L3S, Germany) Lise Getoor, (University Maryland, USA) Harry Halpin, (University of Edinburgh, UK) Michiel Hildebrand, (VU University Amsterdam, NL) Prateek Jain, (Wright State University, USA) Arnd Christian K¨onig, (Microsoft, USA) Nick Koudas, (University of Toronto, Canada) David Lewis, (Independent, USA) Jisheng Liang, (Microsoft Bing, USA) Peter Mika, (Yahoo! Research Barcelona, Spain) Marie Francine Moens, (Katholieke Universiteit Leuven, Belgium) Iadh Ounis, (Univsersity of Glasgow, UK) Ralf Schenkel, (Max-Planck Insitute for Computer Science, Germany) Shuming Shi, (Microsoft Research Asia, China) Ian Soboro↵, (NIST, USA) Fabian Suchanek, (INRIA, France) Duc Thanh Tran, (Karlsruhe Institute of Technology, Germany) Wouter Weerkamp, (University of Amsterdam, NL) Jianhan Zhu, (Open University, UK) v High Performance Clustering for Web Person Name Disambiguation Using Topic Capturing Zhengzhong Liu Qin Lu Jian Xu Department of Computing The Hong Kong Polytechnic University Hung Hom, Kowloon, Hong Kong +852-2766-7247 [email protected] [email protected] [email protected] rd ABSTRACT 43 President of U.S. and (2)George W.H. Bush st Searching for named entities is a common task on the web. (The 41 President of U.S.)that are contained with Among different named entities, person names are among the different ranks. However, the mentions of George Bush most frequently searched terms. However, many people can share (Professor of Hebrew University, U.S) would be the same name and the current search engines are not designed to hard to find as they are being returned as very low ranked files identify a specific entity, or a namesake. One possible solution is that would normally be missed. to identify a namesake through clustering webpages for different Identifying a specific entity using the current search engine is time namesakes. In this paper, we propose a clustering algorithm which consuming because the scale of the returned documents must be makes use of topic related information to improve clustering. The large enough to ensure coverage of the different namesakes system is trained on the WePS2 dataset and tested on both the especially to contain all the documents with namesakes of much WePS1 and WePS2 dataset. Experimental results show that our less popular persons. system outperforms all the other systems in the WePS workshops using B-Cubed and Purity based measures. And the performance The task of identifying different namesakes through web search is is also consistent and robust on different datasets. Most important called web persons disambiguation. Recent works on web persons of all, the algorithm is very efficient for web persons disambiguation have been reported in the WePS workshops [1,2]. disambiguation because it only needs to use data already indexed The reported systems used different clustering techniques to in search engines. In other words, only local data is used as organize the searched results according to either closed feature data to avoid query induced web search. information or open information to disambiguate different namesakes. Some systems have achieved competitive Categories and Subject Descriptors performance by incorporating large amount of open information H.3.3 [Information Storage and Retrieval]: Information Search [7] or by conducting expensive training process [14]. In this and Retrieval – Clustering; I.2.7 [Artificial Intelligence]: Natural paper, we present a novel algorithm based on Hierarchical Language Processing - Text analysis. Agglomerative Clustering(HAC) [16] which makes use of topic information using a so called hit list to make