Neural Models for Information Retrieval: Towards Asymmetry Sensitive Approaches Based on Attention Models

Neural models for information retrieval : towards asymmetry sensitive approaches based on attention models Thiziri Belkacem To cite this version: Thiziri Belkacem. Neural models for information retrieval : towards asymmetry sensitive approaches based on attention models. Information Retrieval [cs.IR]. Université Paul Sabatier - Toulouse III, 2019. English. NNT : 2019TOU30167. tel-02499432 HAL Id: tel-02499432 https://tel.archives-ouvertes.fr/tel-02499432 Submitted on 5 Mar 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THTHESEESE`` En vue de l’obtention du DOCTORAT DE L’UNIVERSITE´ DE TOULOUSE Délivré par : l’UniversitéToulouse 3 Paul Sabatier (UT3 Paul Sabatier) Présentée et soutenue le 28/11/2019 par : Thiziri BELKACEM Neural Models for Information Retrieval: Towards Asymmetry Sensitive Approaches Based on Attention Models JURY Dr. HDR. Anne-Laure Maˆıtressede conférencesHDR Rapporteure Ligozat d’UniversitéParis-Saclay Pr. Eric Gaussier Professeur d’Université Rapporteur Grenoble Alps Pr. Gael¨ Dias Professeur d’Universitéde Caen Examinateur Normandie Pr. Mohand Boughanem Professeur d’UniversitéPaul Directeur de Thèse Sabatier Dr. Taoufiq Dkaki Professeur Associéd’Université co-Directeur de Thèse Paul Sabatier Dr. Jose G. Moreno Professeur Associéd’Université Encadrant Paul Sabatier Ecole´ doctorale et spécialité : MITT : Domaine STIC : Réseaux, Télécoms, Systèmeset Architecture Unité de Recherche : Institut de Recherche en Informatique de Toulouse (UMR 5505) Directeur(s) de Thèse : Pr. Mohand Boughanem, Dr. Taoufiq Dkaki Rapporteurs : Pr. Eric Gaussier et Dr. HDR. Anne-Laure Ligozat ACKNOWLEDGEMENT Firstly, I would like to express my sincere gratitude to my thesis supervisors Pr. Mohand Boughanem, Dr. Jose G. Moreno, and Dr. Taoufiq Dkaki, for the continuous support they gave me since the Master’s degree as well as during the last three years preparing this thesis. I would like to thank each one of them for the continued assistance of my Ph.D. study and related research, for their patience, for all the advice and motivation, and immense knowledge that they shared with me. Their guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisers and mentors for my Ph.D study. Besides my advisers, I would like to thank the rest of my thesis commit- tee: I thank the referees Dr. HDR. Anne-Laure Ligozat and Pr. Eric Gaussier for agreeing to review our research works, and the examiner Pr. GaëlDias to evaluate our work. I thank them all for the insightful comments and encour- agement, but also for the hard question which incited me to widen my research from various perspectives. My sincere thanks also goes to Dr. Gilles Hubert the IRIS team leader and all the other permanent members: Pr. Lynda Tamine-Lechani, Dr. HDR. Karen Pinel-Sauvagnat, Dr. HDR. Guillaum Cabanac, and Dr. Yoan Pytarch, who welcomed me as an intern during the preparation of my Master’s project, who gave me access to the laboratory and research facilities, and who continued to support me during the preparation of this thesis. I thank them all for their good relationships, team spirit and warm exchanges that we have had during all these four years of work within the same team. I thank my fellow lab-mates in for the stimulating discussions, for the rest- less periods we were working together before deadlines, and for all the fun we have had while working all together in the last four years. Last but not the least, I would like to thank my family: my dear parents without whom I would never have arrived here, those who gave me life but who also made sure to make it beautiful and impressive, my father my source of life knowledge and experience, of positive energy and insight; and my mother my source of tenderness, patience, smiling and joy. I also thank all my brothers and sisters, who are my dear and precious friends, for supporting me spiritually throughout the adventure that I have preparing this thesis and my life in general. And I will not fail to thank someone who encouraged me a lot and supported me too, thank you very much Smail for all the good moments shared together despite my busy schedule and my thesis stress. KB: Ulama berriket tzitwit, tamment-is zidet. FR: “Meme si l’abeille est brune (noire), son miel est bon (sucré)” EN : “Even if the bee is brown (black), its honey is good (sweet)” KB: Ulac win izegren asif ur yellixs. FR: “Personne ne peut traverser une rivièresans se mouiller” EN : “No one can cross a river without getting wet” (Kabyle sayings) CONTENTS I Preface9 II Background 16 1 Basic Concepts in Information Retrieval 17 1.1 Introduction .............................. 17 1.2 Definitions ............................... 17 1.2.1 Sequence ........................... 17 1.2.2 Document ........................... 19 1.2.3 Query ............................. 19 1.2.4 Relevance ........................... 19 1.3 Text Representation ......................... 19 1.3.1 Bag-of-Words (BoW) Representations . 20 1.3.2 Semantic-based Representations . 20 1.4 Text Matching Process ........................ 21 1.5 Evaluation in IR ........................... 21 1.5.1 Evaluation Measures ..................... 22 1.5.2 Benchmarks and Campaigns . 24 1.6 Text Matching Issues ......................... 24 1.7 Conclusion .............................. 25 2 Basic Concepts in Neural Networks and Deep Learning 26 2.1 Introduction .............................. 26 2.2 Main Concepts and Definitions ................... 27 2.2.1 Notations ........................... 27 2.2.2 Artificial Neurons ...................... 27 2.2.3 The Activation Function ................... 29 2.2.4 Artificial Neural Networks . 30 2.3 Some NN Architectures ....................... 31 2.3.1 Convolution Neural Networks (CNN) . 31 2.3.2 Recurrent Neural Networks (RNN) . 32 2.3.3 Transformers ......................... 34 2.4 Neural Models Training ....................... 34 2.4.1 Supervised Training ..................... 35 2.4.2 Weakly-Supervised and Unsupervised Training . 35 2.4.3 Unsupervised Training .................... 36 2.5 Training Algorithms ......................... 36 2.5.1 Backpropagation ....................... 36 5 2.5.2 Gradient Descent....................... 36 2.6 Over-fitting and Regularization ................... 37 2.7 Conclusion .............................. 37 III State of The Art Overview 38 3 Text representation models 39 3.1 Introduction .............................. 39 3.2 Distributed representations of words . 40 3.2.1 Matrix Factorization Methods ................ 40 3.2.2 Local Context Window Methods . 41 3.3 Distributed Representations of Sentences . 43 3.3.1 Aggregated Representations . 43 3.3.2 Non-Aggregated Representations . 45 3.4 Text Matching Using Distributed Representations . 46 3.4.1 Direct Matching ....................... 46 3.4.2 Query Expansion ....................... 49 3.5 Issues Related to Distributed Representations . 50 3.6 Discussion ............................... 51 3.7 Conclusion .............................. 52 4 Deep Learning in Text Matching Applications 53 4.1 Introduction .............................. 53 4.2 Machine Learning for Information Retrieval . 54 4.2.1 LTR Algorithms ....................... 55 4.2.2 Related Issues ........................ 55 4.3 Deep Learning for Text Matching . 57 4.3.1 Unified Model Formulation . 57 4.3.2 Representation-focused vs Interaction-focused . 58 4.3.3 Attention-based vs Position-based . 63 4.4 Discussion ............................... 66 4.5 Conclusion .............................. 66 IV Contributions 67 5 Experimental setup 68 5.1 Introduction .............................. 68 5.2 Datasets ................................ 68 5.2.1 WikiQA ............................ 68 5.2.2 QuoraQP ........................... 69 5.2.3 Ad-hoc Document Ranking Datasets . 70 5.3 Evaluation metrics .......................... 71 5.4 Baseline models ............................ 71 5.4.1 Classical models ....................... 71 5.4.2 Classical models with word embeddings . 72 5.4.3 Neural models ........................ 72 5.5 Tools and frameworks ........................ 72 5.6 Conclusion .............................. 73 6 Query words impact in document ranking using word embeddings 74 6.1 Introduction.............................. 74 6.2 Motivation .............................. 75 6.3 Classical Query-Document Matching . 76 6.4 Matching Strategies Using Semantic Word Similarities . 76 6.4.1 Presence/Absence Split ................... 76 6.4.2 Exact/Semantic Matching Split . 77 6.4.3 Relations Between the Different Matching Strategies . 77 6.5 Experiments .............................. 78 6.5.1 Evaluation methodology ................... 78 6.5.2 Parameter Setting and Impact Analysis . 78 6.5.3 Results and discussion .................... 79 6.6 Conclusion .............................. 85 7 Neural models for short text matching using attention-based models 86 7.1 Introduction .............................. 86 7.2 An asymmetry sensitive approach

Neural Models for Information Retrieval: Towards Asymmetry Sensitive Approaches Based on Attention Models

ABSTRACT Stereotypes of Asians and Asian Americans in the U.S. Media

Diagnosing Drama: Grey's Anatomy, Blind Casting, and the Politics Of

Grey's Anatomy

Ethical Issues and Consequences As Portrayed by Medical Dramas: an Analysis of the Effect of Cultivation Theory Molly Johnson [email protected]

Grey's Anatomy

Mark Canha? Didn’T Boring

Grey's Anatomy 101 : Seattle Grace, Unauthorized / Edited by Leah Wilson

DECEMBER 16, 2005 Classrooms, Record Of- Diversity Poll Finds Diversity of Opinion Importance of Racial Diver- by ADAM GOLDSTEIN (VI) Sity

Grey's Anatomy, “America in Black and White,” Nightline, ABC, New York, 20 March 2006

An Analysis of Tyra Banks, Tyler Perry and Shonda Rhimes, 2005-2010

Cristina Yang and the Breaking of the Abortion Taboo in Grey's Anatomy

Gender and Aging: an Investigation of Television's Infatuation with Youth