Pretrained Transformers for Text Ranking: BERT and Beyond

Total Page:16

File Type:pdf, Size:1020Kb

Pretrained Transformers for Text Ranking: BERT and Beyond Pretrained Transformers for Text Ranking: BERT and Beyond Jimmy Lin,1 Rodrigo Nogueira,1 and Andrew Yates2;3 1 David R. Cheriton School of Computer Science, University of Waterloo 2 University of Amsterdam 3 Max Planck Institute for Informatics Version 0.99 — August 20, 2021 Abstract The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query for a particular task. Although the most common formulation of text ranking is search, instances of the task can also be found in many text processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. For text ranking, transformer-based models produce high quality results across many domains, tasks, and settings. This survey provides a synthesis of existing work as a single point of entry for practitioners who wish to deploy transformers for text ranking and researchers who wish to pursue work in this area. We cover a wide range of techniques, grouped into two categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. Examples in the first category include approaches based on relevance classification, evidence aggregation from multiple segments of text, and document and query expansion. The second category involves using transformers to learn dense representations of texts, where ranking is formulated as comparisons between query and document representations that take advantage of nearest neighbor search. At a high level, there are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) arXiv:2010.06467v3 [cs.IR] 19 Aug 2021 and efficiency (e.g., query latency, model and index size). Much effort has been devoted to developing ranking models that address the mismatch between document lengths and the length limitations of existing transformers. The computational costs of inference with transformers has led to alternatives and variants that aim for different tradeoffs, both within multi-stage architectures as well as with dense learned representations. Although transformer architectures and pretraining techniques are recent innova- tions, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading. Contents 1 Introduction 4 1.1 Text Ranking Problems . .6 1.2 A Brief History . 10 1.2.1 The Beginnings of Text Ranking . 10 1.2.2 The Challenges of Exact Match . 12 1.2.3 The Rise of Learning to Rank . 14 1.2.4 The Advent of Deep Learning . 15 1.2.5 The Arrival of BERT . 18 1.3 Roadmap, Assumptions, and Omissions . 19 2 Setting the Stage 20 2.1 Texts . 20 2.2 Information Needs . 21 2.3 Relevance . 23 2.4 Relevance Judgments . 25 2.5 Ranking Metrics . 26 2.6 Community Evaluations and Reusable Test Collections . 30 2.7 Descriptions of Common Test Collections . 34 2.8 Keyword Search . 41 2.9 Notes on Parlance . 43 3 Multi-Stage Architectures for Reranking 46 3.1 A High-Level Overview of BERT . 47 3.2 Simple Relevance Classification: monoBERT . 51 3.2.1 Basic Design of monoBERT . 52 3.2.2 Exploring monoBERT . 56 3.2.3 Investigating How BERT Works . 61 3.2.4 Nuances of Training BERT . 63 3.3 From Passage to Document Ranking . 67 3.3.1 Document Ranking with Sentences: Birch . 68 3.3.2 Passage Score Aggregation: BERT–MaxP and Variants . 72 3.3.3 Leveraging Contextual Embeddings: CEDR . 77 3.3.4 Passage Representation Aggregation: PARADE . 82 3.3.5 Alternatives for Tackling Long Texts . 86 3.4 From Single-Stage to Multi-Stage Rerankers . 87 3.4.1 Reranking Pairs of Texts . 90 3.4.2 Reranking Lists of Texts . 93 3.4.3 Efficient Multi-Stage Rerankers: Cascade Transformers . 94 3.5 Beyond BERT . 97 2 3.5.1 Knowledge Distillation . 98 3.5.2 Ranking with Transformers: TK, TKL, CK . 101 3.5.3 Ranking with Sequence-to-Sequence Models: monoT5 . 104 3.5.4 Ranking with Sequence-to-Sequence Models: Query Likelihood . 109 3.6 Concluding Thoughts . 110 4 Refining Query and Document Representations 112 4.1 Query and Document Expansion: General Remarks . 113 4.2 Pseudo-Relevance Feedback with Contextualized Embeddings: CEQE . 115 4.3 Document Expansion via Query Prediction: doc2query . 118 4.4 Term Reweighting as Regression: DeepCT . 122 4.5 Term Reweighting with Weak Supervison: HDCT . 125 4.6 Combining Term Expansion with Term Weighting: DeepImpact . 127 4.7 Expansion of Query and Document Representations . 128 4.8 Concluding Thoughts . 131 5 Learned Dense Representations for Ranking 132 5.1 Task Formulation . 132 5.2 Nearest Neighbor Search . 137 5.3 Pre-BERT Text Representations for Ranking . 138 5.4 Simple Transformer Bi-encoders for Ranking . 139 5.4.1 Basic Bi-encoder Design: Sentence-BERT . 141 5.4.2 Bi-encoders for Dense Retrieval: DPR and ANCE . 143 5.4.3 Bi-encoders for Dense Retrieval: Additional Variations . 148 5.5 Enhanced Transformer Bi-encoders for Ranking . 150 5.5.1 Multiple Text Representations: Poly-encoders and ME-BERT . 151 5.5.2 Per-Token Representations and Late Interactions: ColBERT . 153 5.6 Knowledge Distillation for Transformer Bi-encoders . 155 5.7 Concluding Thoughts . 158 6 Future Directions and Conclusions 161 6.1 Notable Content Omissions . 161 6.2 Open Research Questions . 162 6.3 Final Thoughts . 170 Acknowledgements 171 Version History 172 References 173 3 1 Introduction The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query for a particular task. The most common formulation of text ranking is search, where the search engine (also called the retrieval system) produces a ranked list of texts (web pages, scientific papers, news articles, tweets, etc.) ordered by estimated relevance with respect to the user’s query. In this context, relevant texts are those that are “about” the topic of the user’s request and address the user’s information need. Information retrieval (IR) researchers call this the ad hoc retrieval problem.1 With keyword search, also called keyword querying (for example, on the web), the user typically types a few query terms into a search box (for example, in a browser) and gets back results containing representations of the ranked texts. These results are called ranked lists, hit lists, hits, “ten blue links”,2 or search engine results pages (SERPs). The representations of the ranked texts typically comprise the title, associated metadata, “snippets” extracted from the texts themselves (for example, an extractive keyword-in-context summary where the user’s query terms are highlighted), as well as links to the original sources. While there are plenty of examples of text ranking problems (see Section 1.1), this particular scenario is ubiquitous and undoubtedly familiar to all readers. This survey provides an overview of text ranking with a family of neural network models known as transformers, of which BERT (Bidirectional Encoder Representations from Transformers) [Devlin et al., 2019], an invention of Google, is the best-known example. These models have been responsible for a paradigm shift in the fields of natural language processing (NLP) and information retrieval (IR), and more broadly, human language technologies (HLT), a catch-all term that includes technologies to process, analyze, and otherwise manipulate (human) language data. There are few endeavors involving the automatic processing of natural language that remain untouched by BERT.3 In the context of text ranking, BERT provides results that are undoubtedly superior in quality than what came before. This is a robust and widely replicated empirical result, across many text ranking tasks, domains, and problem formulations. A casual skim through paper titles in recent proceedings from NLP and IR conferences will leave the reader without a doubt as to the extent of the “BERT craze” and how much it has come to dominate the current research landscape. However, the impact of BERT, and more generally, transformers, has not been limited to academic research. In October 2019, a Google blog post4 confirmed that the company had improved search “by applying BERT models to both ranking and featured snippets”. Ranking refers to “ten blue links” and corresponds to most users’ understanding of web search; “feature snippets” represent examples of question answering5 (see additional discussion in Section 1.1). Not to be outdone, in November 2019, a Microsoft blog post6 reported that “starting from April of this year, we used large transformer models to deliver the largest quality improvements to our Bing customers in the past year”. As a specific instance of transformer architectures, BERT has no doubt improved how users find relevant information. Beyond search, other instances of the model have left their marks as well. For example, transformers dominate approaches to machine translation, which is the automatic translation of natural language text7 from one human language to another, for example, from English to French. 1There are many footnotes in this survey. Since nobody reads footnotes, we wanted to take one opportunity to inform the reader here that we’ve hidden lots of interesting details in the footnotes. But this message is likely to be ignored anyway. 2Here’s the first interesting tidbit: The phrase “ten blue links” is sometimes used to refer to web search and has a fascinating history.
Recommended publications
  • Intelligent Document Processing (IDP) State of the Market Report 2021 – Key to Unlocking Value in Documents June 2021: Complimentary Abstract / Table of Contents
    State of the Service Optimization Market Report Technologies Intelligent Document Processing (IDP) State of the Market Report 2021 – Key to Unlocking Value in Documents June 2021: Complimentary Abstract / Table of Contents Copyright © 2021 Everest Global, Inc. We encourage you to share these materials internally within your company and its affiliates. In accordance with the license granted, however, sharing these materials outside of your organization in any form—electronic, written, or EGR-2021-38-CA-4432 verbal—is prohibited unless you obtain the express, prior, and written consent of Everest Global, Inc. It is your organization’s responsibility to maintain the confidentiality of these materials in accordance with your license of them. Our research offerings This report is included in the following research program(s): Service Optimization Technologies ► Application Services ► Finance & Accounting ► Market Vista™ If you want to learn whether your ► Banking & Financial Services BPS ► Financial Services Technology (FinTech) ► Mortgage Operations organization has a membership agreement or request information on ► Banking & Financial Services ITS ► Global Business Services ► Multi-country Payroll pricing and membership options, please ► Catalyst™ ► Healthcare BPS ► Network Services & 5G contact us at [email protected] ► Clinical Development Technology ► Healthcare ITS ► Outsourcing Excellence ► Cloud & Infrastructure ► Human Resources ► Pricing-as-a-Service Learn more about our ► Conversational AI ► Insurance BPS ► Process Mining custom
    [Show full text]
  • Sociedade Brasileira De Cardiologia • ISSN-0066-782X • Volume 104, Nº 2, February 2015
    www.arquivosonline.com.br Sociedade Brasileira de Cardiologia • ISSN-0066-782X • Volume 104, Nº 2, February 2015 Figure 1 – Ilustration of cell interactions during atheroma formation. Page 170 Editorial Fatty Acid and Cholesterol Concentrations in Usually Consumed The representativeness of the Arquivos Brasileiros de Cardiologia for Fish in Brazil Brazilian Cardiology Science Assessment of Myocardial Infarction by Cardiac Magnetic Resonance Special Article Imaging and Long-Term Mortality Multivariate Analysis for Animal Selection in Experimental Research Review Article Original Articles Microparticles as Potential Biomarkers of Cardiovascular Disease Reproducibility of left ventricular mass by echocardiogram in the Letter to the Editor ELSA-Brasil Acute Effects of Continuous Positive Airway Pressure on Pulse Pressure in CHF Relationship between Neutrophil-To-Lymphocyte Ratio and Electrocardiographic Ischemia Grade in STEMI Sudden Cardiac Death in Brazil: A Community-Based Autopsy Series Eletronic Pages (2006-2010) Clinicoradiological Session Impact of Light Salt Substitution for Regular Salt on Blood Pressure of Case 2/2015 A 33-Year-Old Woman with Double Right Ventricular Hypertensive Patients Chamber and Ventricular Septal Defect Effects of Ischemic Postconditioning on the Hemodynamic Parameters Case Report and Heart Nitric Oxide Levels of Hypothyroid Rats Giant Left Atrial Thrombus with Double Coronary Vascularization Neural Mechanisms and Delayed Gastric Emptying of Liquid Induced Image Through Acute Myocardial Infarction in Rats Mitral
    [Show full text]
  • Arvedon, Sessler Win Mixed on Score Change
    Friday, March 18, 2011 Volume 54, Number 8 Daily Bulletin NABC National Tournament • Louisville • March 10-20, 2011 54th Spring North American Bridge Championships Editors: Brent Manley and Dave Smith Arvedon, Sessler win No. 1 seed falls In Vanderbilt Mixed on score change The top-seeded Team Monaco, with four Just as they had apparently lost out on members of last year’s winning squad, are on the Wednesday night to a score correction, Lloyd sidelines today after being trounced by the original Arvedon and Carolyn Sessler were returned to first No. 17 seed, captained by Seymon Deutsch. They place in the Rockwell Mixed Pairs on Thursday by ousted the top seed 151-87. another score change. The No. 2 seed, Nick Nickell, had to rally in The pair in first place when everyone went to the fourth quarter for a 108-105 victory, while No. bed on Wednesday – Gail Greenberg and Jeff Hand 3 John Diamond cruised against No. 14 Richard – dropped to second. Schwartz. No. 4 James Cayne routed original No. The original margin of victory for Arvedon and 20 Barbara Sonsini. Sessler was four-tenths of a matchpoint. It appeared The No. 5 seed, led by Aubrey Strul, was that Greenberg and Hand had won by sixth-tenths. ousted by Joe Grue, originally seeded No. 21, and The new standings put Arvedon-Sessler back in No. 7 Carolyn Lynch was defeated by No. 22 Mary first by their original margin. Ann Berg. The winners were playing together for the first As this issue went to press, one match – Meng time in about five years.
    [Show full text]
  • Introduction to Information Retrieval
    DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 1 1 Boolean retrieval The meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study, INFORMATION information retrieval might be defined thus: RETRIEVAL Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar pro- fessional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email.1 Information retrieval is fast becoming the dominant form of information access, overtaking traditional database- style searching (the sort that is going on when a clerk says to you: “I’m sorry, I can only look up your order if you can give me your Order ID”). IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to main- tain product inventories and personnel records.
    [Show full text]
  • Your Intelligent Digital Workforce: How RPA and Cognitive Document
    Work Like Tomorw. YOUR INTELLIGENT DIGITAL WORKFORCE HOW RPA AND COGNITIVE DOCUMENT AUTOMATION DELIVER THE PROMISE OF DIGITAL BUSINESS CONTENTS Maximizing the Value of Data ...................................................................... 3 Connecting Paper, People and Processes: RPA + CDA .......................10 Data Driven, Document Driven ....................................................................4 RPA + CDA: Driving ROI Across Many Industries ................................. 12 Capture Your Content, Capture Control ..................................................5 6 Business Benefits of RPA + CDA ............................................................ 15 Kofax Knows Capture: An Innovation Timeline .......................................6 All CDA Solutions Are Not Created Equal .............................................. 16 Artificial Intelligence: The Foundation of CDA ....................................... 7 Additional Resources ....................................................................................17 AI: Context is Everything ..............................................................................8 AI: Learning by Doing .....................................................................................9 MAXIMIZING THE VALUE OF DATA Data drives modern business. This isn’t surprising when you consider that 90 percent of the world’s data has been created in the last two years alone. The question is: how can you make this profusion of data work for your business and not against it? Enter
    [Show full text]
  • Optical Image Scanners and Character Recognition Devices: a Survey and New Taxonomy
    OPTICAL IMAGE SCANNERS AND CHARACTER RECOGNITION DEVICES: A SURVEY AND NEW TAXONOMY Amar Gupta Sanjay Hazarika Maher Kallel Pankaj Srivastava Working Paper #3081-89 Massachusetts Institute of Technology Cambridge, MA 02139 ABSTRACT Image scanning and character recognition technologies have matured to the point where these technologies deserve serious consideration for significant improvements in a diverse range of traditionally paper-oriented applications, in areas ranging from banking and insurance to engineering and manufacturing. Because of the rapid evolution of various underlying technologies, existing techniques for classifying and evaluating alternative concepts and products have become largely irrelevant. A new taxonomy for classifying image scanners and optical recognition devices is presented in this paper. This taxonomy is based on the characteristics of the input material, rather than on speed, technology or application domain. 2 1. INTRODUCTION The concept of automated transfer of information from paper documents to computer-accessible media dates back to 1954 when the first Optical Character Recognition (OCR) device was introduced by Intelligent Machines Research Corporation [1]. By 1970, approximately 1000 readers were in use and the volume of sales had grown to one hundred million dollars per annum [3]. In spite of these early developments, through the seventies and early eighties scanning technology was utilized only in highly specialized applications. The lack of popularity of automated reading systems stemmed from the fact that commercially available systems were unable to handle documents as prepared for human use. The constraints placed by such systems served as barriers, severely limiting their applicability. In 1982, Ullmann [2] observed: "A more plausible view is that in the area of character recognition some vital computational principles have not yet been discovered or at least have not been fully mastered.
    [Show full text]
  • Shreddr: Pipelined Paper Digitization for Low-Resource Organizations
    Shreddr: pipelined paper digitization for low-resource organizations Kuang Chen Akshay Kannan Yoriyasu Yano Dept. of EECS Captricity, Inc. Captricity, Inc. UC Berkeley [email protected] [email protected] [email protected] Joseph M. Hellerstein Tapan S. Parikh Dept. of EECS School of Information UC Berkeley UC Berkeley [email protected] [email protected] ABSTRACT able remote agents to directly enter information at the point of ser- For low-resource organizations working in developing regions, in- vice, replacing data entry clerks and providing near immediate data frastructure and capacity for data collection have not kept pace with availability. However, mobile direct entry usually replace existing the increasing demand for accurate and timely data. Despite con- paper-based workflows, creating significant training and infrastruc- tinued emphasis and investment, many data collection efforts still ture challenges. As a result, going “paperless” is not an option for suffer from delays, inefficiency and difficulties maintaining quality. many organizations [16]. Paper remains the time-tested and pre- Data is often still “stuck” on paper forms, making it unavailable ferred data capture medium for many situations, for the following for decision-makers and operational staff. We apply techniques reasons: from computer vision, database systems and machine learning, and • leverage new infrastructure – online workers and mobile connec- Resource limitations: lack of capital, stable electricity, IT- tivity – to redesign
    [Show full text]
  • Information Retrieval of Text, Structure and Sequential Data in Heterogeneous XML Document Collections Eugen Popovici
    Information Retrieval of Text, Structure and Sequential Data in Heterogeneous XML Document Collections Eugen Popovici To cite this version: Eugen Popovici. Information Retrieval of Text, Structure and Sequential Data in Heterogeneous XML Document Collections. Computer Science [cs]. Université de Bretagne Sud; Université Européenne de Bretagne, 2008. English. tel-00511981 HAL Id: tel-00511981 https://tel.archives-ouvertes.fr/tel-00511981 Submitted on 26 Aug 2010 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE SOUTENUE DEVANT L’UNIVERSITÉ EUROPÉENNE DE BRETAGNE pour obtenir le grade de DOCTEUR DE L’UNIVERSITÉ EUROPÉENNE DE BRETAGNE Mention : SCIENCES ET TECHNOLOGIES DE L’INFORMATION ET DE LA COMMUNICATION par EUGEN-COSTIN POPOVICI Information Retrieval of Text, Structure and Sequential Data in Heterogeneous XML Document Collections Recherche et filtrage d’information multimédia (texte, structure et séquence) dans des collections de documents XML hétérogènes Présentée le 10 janvier 2008 devant la commission d’examen composée de : M. BOUGHANEM Professeur, Université Paul Sabatier, Toulouse III Rapporteur P. GROS Directeur de Recherche, INRIA, Rennes Examinateur M. LALMAS Professeur, Queen Mary University of London Rapporteur P.-F. MARTEAU Professeur, Université de Bretagne-Sud Directeur G.
    [Show full text]
  • Document and Form Processing Automation with Document AI Using Machine Learning to Automate Document and Form Processing ___
    Document and Form Processing Automation with Document AI Using Machine Learning to Automate Document and Form Processing ___ Unleash the value of your unstructured document data or speed up manual document processing tasks with the help of Google AI. We will help you build an end-to-end, production-capable document processing solution with Google’s industry-leading Document AI tools, customized to your case. Business Challenge Most business transactions begin, involve, or end with a document. However, approximately 80% of enterprise data is unstructured which historically has made it expensive and difficult to harness that data. The inability to understand unstructured data can decrease operational efficiency, impact decision making, and even increase compliance costs. Decision makers today need the ability to quickly and cost effectively process and make use of their rapidly growing unstructured datasets. Document Workflows Unstructured Data Free Form Text RPA vendors estimate that ~50% Approximately 80% of enterprise 70% is free-form text such as of their workflows begin with a data is unstructured including written documents and emails document machine and human generated Solution Overview Google’s mission is “to organize the world's information and make it universally accessible and useful”. This has led Google to create a comprehensive set of technologies to read (Optical Character Recognition), understand (Natural Language Processing) and make useful (data warehousing, analytics, and visualization) documents, forms, and handwritten text. Google’s Document AI technologies provide OCR (optical character recognition) capabilities that deliver unprecedented accuracy by leveraging advanced deep-learning neural network algorithms. Document AI has support for 200 languages and handwriting recognition of 50 languages.
    [Show full text]
  • CNN-Based Page Segmentation and Object Classification for Counting
    Journal of Imaging Article CNN-Based Page Segmentation and Object Classification for Counting Population in Ottoman Archival Documentation Yekta Said Can * and M. Erdem Kabadayı College of Social Sciences and Humanities, Koc University, Rumelifeneri Yolu, 34450 Sarıyer, Istanbul, Turkey; [email protected] * Correspondence: [email protected] Received: 31 March 2020; Accepted: 11 May 2020; Published: 14 May 2020 Abstract: Historical document analysis systems gain importance with the increasing efforts in the digitalization of archives. Page segmentation and layout analysis are crucial steps for such systems. Errors in these steps will affect the outcome of handwritten text recognition and Optical Character Recognition (OCR) methods, which increase the importance of the page segmentation and layout analysis. Degradation of documents, digitization errors, and varying layout styles are the issues that complicate the segmentation of historical documents. The properties of Arabic scripts such as connected letters, ligatures, diacritics, and different writing styles make it even more challenging to process Arabic script historical documents. In this study, we developed an automatic system for counting registered individuals and assigning them to populated places by using a CNN-based architecture. To evaluate the performance of our system, we created a labeled dataset of registers obtained from the first wave of population registers of the Ottoman Empire held between the 1840s and 1860s. We achieved promising results for classifying different types of objects and counting the individuals and assigning them to populated places. Keywords: page segmentation; historical document analysis; convolutional neural networks; Arabic script layout analysis 1. Introduction Historical documents are valuable cultural resources that provide the examination of the historical, social, and economic aspects of the past.
    [Show full text]
  • On the Collaboration Support in Information Retrieval
    Open Archive TOULOUSE Archive Ouverte ( OATAO ) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 19042 To link to this article : DOI : 10.1145/3092696 URL : http://doi.org/10.1145/3092696 To cite this version : Soulier, Laure and Tamine-Lechani, Lynda On the collaboration support in Information Retrieval. (2017) ACM Computing Surveys, vol. 50 (n° 4). pp. 1. ISSN 0360-0300 Any correspondence concerning this service should be sent to the repository administrator: [email protected] On the Collaboration Support in Information Retrieval LAURE SOULIER, Sorbonne Universités, UPMC Univ Paris 06, UMR 7606, LIP6 LYNDA TAMINE, IRIT, Université de Toulouse, UPS Collaborative Information Retrieval (CIR) is a well-known setting in which explicit collaboration occurs among a group of users working together to solve a shared information need. This type of collaboration has been deemed as bene!cial for complex or exploratory search tasks. With the multiplicity of factors im- pacting on the search e#ectiveness (e.g., collaborators’ interactions or the individual perception of the shared information need), CIR gives rise to several challenges in terms of collaboration support through algorithmic approaches. More particularly, CIR should allow us to satisfy the shared information need by optimizing the collaboration within the search session over all collaborators, while ensuring that both mutually bene!cial goals are reached and that the cognitive cost of the collaboration does not impact the search e#ectiveness.
    [Show full text]
  • Developing a Personalized Article Retrieval System for Pubmed
    DEVELOPING A PERSONALIZED ARTICLE RETRIEVAL SYSTEM FOR PUBMED By Sachintha Pitigala A dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in Computational Science Middle Tennessee State University August 2016 Dissertation Committee: Dr. Cen Li, Committee Chair Dr. Suk Seo Dr. John Wallin Dr. Qiang Wu ACKNOWLEDGEMENTS Firstly, I thank my advisor Dr. Cen Li, Professor of Computer Science, for her unrelent- ing advice and support throughout the past six years. I would like to thank Dr. John Wallin, Director of the Computational Science Program, for his invaluable advice and support throughout the years. My special gratitude also goes to Dr. Suk Seo and Dr. Qiang Wu for their advice in presenting this thesis. Further, I would like to take the opportunity to thank all the professors in the computational science program for teaching me new and valuable materials to successfully finish my PhD studies. My heartfelt gratitude also goes to all who participated in the study of the PARS system by volunteering their time to evaluate the search outputs. I would also like to thank all the academic and nonacademic staff of the Department of Computer Science, College of Basic and Applied Sciences and the Graduate School for supporting me in various ways with regards to my studies throughout the past six years. Finally, I thank my parents for their support and encouragement over the years. ii ABSTRACT PubMed keyword based search often results in many citations not directly relevant to the user information need. Personalized Information Retrieval (PIR) systems aim to improve the quality of the retrieval results by letting the users supply information other than keywords.
    [Show full text]