LSDS-IR'09 7Th Workshop on Large-Scale Distributed Systems For

Claudio Lucchese Gleb Skobeltsyn Wai Gen Yee (Eds.) LSDS-IR’09 7th Workshop on Large-Scale Distributed Systems for Information Retrieval Workshop co-located with ACM SIGIR 2009 Boston, MA, USA, July 23, 2009 Proceedings Copyright c 2009 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. Re-publication of material from this volume requires permission by the copyright owners. Editors’ addresses: Claudio Lucchese Gleb Skobeltsyn Wai Gen Yee Istituto di Scienza e Tecnologie dell’Informazione Google Inc. Department of Computer Science Consiglio Nazionale delle Ricerche Illinois Institute of Technology Area della Ricerca di Pisa Via G. Moruzzi, 1 Brandschenkestrasse 110 10 W. 31st Street 56124 Pisa (PI) 8002 Zurich Chicago, IL 60616 Italy Switzerland USA [email protected] [email protected] [email protected] 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’09) Preface The Web is continuously growing. Currently, it contains more than 20 billion pages (some sources suggest more than 100 billion), compared with fewer than 1 billion in 1998. Traditionally, Web-scale search engines employ large and highly replicated systems, operating on computer clusters in one or few data centers. Coping with the increasing number of user requests and indexable pages requires adding more resources. However, data centers cannot grow indefinitely. Scalability problems in Information Retrieval (IR) have to be addressed in the near future, and new distributed applications are likely to drive the way in which people use the Web. Distributed IR is the point in which these two directions converge. The 7th Large-Scale Distributed Systems Workshop (LSDS-IR’09), co-located with the 2009 ACM SIGIR Conference, provides a space for researchers to discuss these problems and to define new research directions in the area. It brings together researchers from the domains of IR and Databases, working on distributed and peer-to-peer (P2P) information systems to foster closer collaboration that could have a large impact on the future of distributed and P2P IR. The LSDS-IR’09 Workshop continues the efforts from previous workshops held in conjunction with leading conferences: • CIKM’08: Workshop on Large-Scale Distributed Systems for Information Retrieval - LSDS-IR’08 • SIGIR’07: Workshop on Large-Scale Distributed Systems for Information Retrieval - LSDS-IR’07 • CIKM’06: Workshop on Information Retrieval in Peer-to-Peer Networks - P2PIR’06 • CIKM’05: Workshop on Information Retrieval in Peer-to-Peer Networks - P2PIR’05 • SIGIR’05: Workshop on Heterogeneous and Distributed Information Retrieval - HDIR’05 • SIGIR’04: Workshop on Information Retrieval in Peer-to-Peer Networks - P2PIR’04 This year’s program features two keynotes, six full and three short papers covering a wide range of topics related to Large Scale Distributed Systems. The first keynote speaker, Leonidas Kontothanassis (Google Inc.), discusses access patterns and trends in video information systems pertaining to YouTube. The second keynote speaker, Dennis Fetterly (Microsoft Research), talks about experiences with using the DryadLINQ system, a programming environment for writing high performance parallel applications on PC clusters, for Information Retrieval experiments. As in the past years, the workshop features several papers on P2P-IR. Bockthing et al. present an approach for collection selection based on indexing of popular term combinations. Esuli suggests using permutation prefixes for similarity search. Ke et al. study the clustering paradox for decentralized search. Finally, Dazzi et al. discuss the clustering of users in a P2P network for sharing browsing interests. Another cluster of papers deals with efficiency of large scale infrastructures for information retrieval. Nguyen proposes a new static index pruning approach. Capannini et al. propose a sorting algorithm that uses CUDA. McCreadie et al. discuss how suitable the MapReduce paradigm is for efficient indexing. Lin addresses the load-balancing issue with MapReduce. Finally, on the search quality side, Yee et al. study the benefits of using user comments for improving search in social Web sites. We thank the authors for their submissions and the program committee for their hard work. July, 2009 Claudio Lucchese, Gleb Skobeltsyn, Wai Gen Yee 3 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’09) Workshop Chairs Claudio Lucchese, National Research Council - ISTI, Italy Gleb Skobeltsyn, EPFL & Google, Switzerland Wai Gen Yee, Illinois Institute of Technology, Chicago, USA Steering Committee Flavio Junqueira, Yahoo! Research Barcelona, Spain Fabrizio Silvestri, ISTI-CNR, Italy Ivana Podnar Zarko, University of Zagreb, Croatia Program Committee Karl Aberer, EPFL, Switzerland Ricardo Baeza-Yates, Yahoo! Research Barcelona, Spain Gregory Buehrer, Microsoft Live Labs, USA Roi Blanco, University of A Coruna, Spain Fabrizio Falchi, ISTI-CNR, Italy Ophir Frieder, Illinois Institute of Technology, Chicago, USA Flavio Junqueira, Yahoo! Research Barcelona, Spain Claudio Lucchese, ISTI-CNR, Italy Sebastian Michel, EPFL, Switzerland Wolfgang Nejdl, University of Hannover, Germany Kjetil Norvag, Norwegian University of Science and Technology, Norway Salvatore Orlando, University of Venice, Italy Josiane Xavier Parreira, Max-Planck-Institut Informatik, Germany Raffaele Perego, ISTI-CNR, Italy Diego Puppin, Google, USA Martin Rajman, EPFL, Switzerland Fabrizio Silvestri, ISTI-CNR, Italy Gleb Skobeltsyn, EPFL & Google, Switzerland Torsten Suel, Polytechnic University, USA Christos Tryfonopoulos, Max-Planck-Institut Informatik, Germany Wai Gen Yee, Illinois Institute of Technology, USA Ivana Podnar Zarko,ˇ University of Zagreb, Croatia Pavel Zezula, Masaryk University of Brno, Czech Republic 4 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’09) Contents Keynotes: 7 The YouTube Video Delivery System Leonidas Kontothanassis (Google Inc., Boston, USA) 7 DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language Dennis Fetterly (Microsoft Research, Silicon Valley, USA) 8 Regular Papers: 9 Collection Selection with Highly Discriminative Keys Sander Bockting (Avanade), Djoerd Hiemstra (University of Twente) 9 PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli (ISTI-CNR, Italy) 17 Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach Linh Thai Nguyen (Illinois Institute of Technology, USA) 25 Sorting using BItonic netwoRk wIth CUDA Gabriele Capannini, Fabrizio Silvestri, Ranieri Baraglia, Franco Maria Nardini (ISTI-CNR, Italy) 33 Comparing Distributed Indexing: To MapReduce or Not? Richard McCreadie, Craig Macdonald, Iadh Ounis (University of Glasgow) 41 Strong Ties vs. Weak Ties: Studying the Clustering Paradox for Decentralized Search Weimao Ke, Javed Mostafa (University of North Carolina) 49 Short Papers: 57 The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce Jimmy Lin (University of Maryland) 57 Are Web User Comments Useful for Search? Wai Gen Yee, Andrew Yates, Shizhu Liu, Ophir Frieder (Illinois Institute of Technology, USA) 63 Peer-to-Peer clustering of Web-browsing users Patrizio Dazzi, Matteo Mordacchini, Raffaele Perego (ISTI-CNR, Italy), Pascal Felber, Lorenzo Leonini (University of Neuchatel), Le Bao Anh, Martin Rajman (Ecole Polytechnique Federale de Lausanne, Switzerland), Etienne Riviere (NTNU, Norway) 71 5 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’09) 6 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’09) The YouTube Video Delivery System Leonidas Kontothanassis Google Inc. This talk will cover the YouTube Video Delivery System. It will discuss access patterns and trends for both video uploads and downloads. It will describe the storage and delivery mechanisms for popular and unpopular content and the impact YouTube has on the network storage infrastructure for Google. We will also discuss the networking impact for ISPs around the world. Leonidas Kontothanassis joined Google in 2006 and immediately started working on networking and video delivery issues and have been ever since. He currently acts as the manager of the teams working in these areas. Previously he has worked in such areas as computer architecture, parallel programming, and content delivery with multiple companies in the Kendall/MIT area include DEC/HP/Intel Labs and Akamai. He received a PhD in computer architecture in 1996 and has served as committee member or organizer for academic conferences and research funding organizations like NSF. 7 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’09) DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language Dennis Fetterly Microsoft Research The goal of DryadLINQ is to make distributed computing on large compute clusters simple. DryadLINQ combines two important pieces of technology: the Dryad distributed execution engine and the .NET Language INtegrated Query (LINQ). Dryad provides reliable, distributed computing on thousands of applications in a SQL-like query language, relying on the entire -NET library and using Visual Studio. DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

LSDS-IR'09 7Th Workshop on Large-Scale Distributed Systems For

An Efficient Evolutionary Algorithm for Solving Incrementally Structured

Sorting Algorithm 1 Sorting Algorithm

How to Sort out Your Life in O(N) Time

Sorting Algorithm 1 Sorting Algorithm

Sorting Using Bitonic Network with CUDA

Engineering Burstsort: Toward Fast In-Place String Sorting

Sorting Using Bitonic Network with CUDA

In Search of the Fastest Sorting Algorithm

Cache-Efficient String Sorting Using Copying

Efficient Trie-Based Sorting of Large Sets of Strings

Using Random Sampling to Build Approximate Tries for Efficient

Performance Improvement for Multi-Key Quick Sort Using Kepler Gpus