French Language DRS Parsing Ngoc Luyen Le
Total Page:16
File Type:pdf, Size:1020Kb
French language DRS parsing Ngoc Luyen Le To cite this version: Ngoc Luyen Le. French language DRS parsing. Computation and Language [cs.CL]. Ecole nationale supérieure Mines-Télécom Atlantique, 2020. English. NNT : 2020IMTA0202. tel-03132658 HAL Id: tel-03132658 https://tel.archives-ouvertes.fr/tel-03132658 Submitted on 5 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE DE DOCTORAT DE L’ÉCOLE NATIONALE SUPÉRIEURE MINES-TELECOM ATLANTIQUE BRETAGNE PAYS DE LA LOIRE - IMT ATLANTIQUE Ecole Doctorale N°601 Mathèmatique et Sciences et Technologies de l’Information et de la Communication Spécialité : Informatique Par Ngoc Luyen LE French Language DRS Parsing Thèse présentée et soutenue à PLOUZANÉ, le 15 septembre 2020 Unité de recherche : Lab-STICC UMR 6285 - CNRS Thèse N° : 2020IMTA0202 Rapporteurs avant soutenance : Panayota Tita Kyriacopoulou Professeure des universités, Université Gustave Eiffel Benoît Crabbé Professeur, Université Paris Diderot & Institut Universitaire de France Composition du jury : Président : Ismaïl Biskri Professeur, Université du Québec à Trois-Rivières Examinateur : Panayota Tita Kyriacopoulou Professeure des universités, Université Gustave Eiffel Benoît Crabbé Professeur, Université Paris Diderot & Institut Universitaire de France Annie Forêt Maître de Conféfrences HDR, IRISA-ISTIC Dir. de thèse : Philippe Lenca Professeur, IMT Atlantique Co-dir. de thèse : Yannis Haralambous Directeur d’études, IMT Atlantique IMT ATLANTIQUE DOCTORAL THESIS French Language DRS Parsing Author: Supervisors: Ngoc Luyen L Yannis HARALAMBOUS Philippe LENCA A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Thesis prepared at Department of Computer Science DECIDE - Lab-STICC - CNRS, UMR 6285 IMT Atlantique This research was sponsored by Crédit Mutuel Arkéa Bank Plouzané, Autumn 2020 iii Acknowledgements First and foremost I express my deep sense of gratitude and profound respect to my thesis advisors, professor Yannis Haralambous and professor Philippe Lenca who have helped and encouraged me at all stages of my thesis work with great patience and immense care. I am spe- cially indebted to Yannis Haralambous who gave me the golden opportunity to do this wonderful thesis and gave me the opportunity to acquire a great amount of knowledge and experience in the fields of natural language processing as well as deep learning, and devoted his time to help me both in resolving scientific matters as well as correcting the manuscripts that I wrote during these years. I gratefully acknowledge the members of my Individual Thesis Monitoring Committee for their serving time and valuable feedback on my report in the academic years. I would par- ticularly like to acknowledge Dr. Éric Saux and Professor Serge Garlatti, who gave me their extremely useful help, advice and encouragements. I also would like to thank all my colleagues from the Computer Science department of IMT Atlantique and the CNRS Lab-STICC DECIDE research team for making these past years be unforgettable memories. Among others, a very special thank you to Lina Fahed and Kafcah Emani Cheikh for their invaluable advice on my research and for always being supportive of my work. I am also very grateful to Armelle Lannuzel who all helped me in numerous ways during various stages of my PhD. I took part in many other activities with countless people teaching and helping me every step. Thank you to my friends from Plouzané, Brest and other cities, especially tonton Riquet, Francette, Denise, Daniel, Nanette, Landy, Fabrice, Gildas and Polla. I am really thankful to Dr. Pierre Larmande who have gave a chance to study in France and guided me during my master internship. I would like to thank to my teachers and colleagues at information technology faculty of Da Lat University. As always it is impossible to mention everybody who had an impact to this work however there are those whose spiritual support is even more important. I feel a deep sense of gratitude for my grand parents, father, mother, my brothers, who formed part of my vision and taught me myriad good things that really matter in life. I owe my deepest gratitude towards my beloved and supportive wife, Hoai Phuong who is always by my side when times I needed her most and helped a lot in making this study, and my lovable son An Lap, who served as my inspiration to pursue this undertaking. Their infallible love and support has always been my strength. Their patience and sacrifice will remain my inspiration throughout my life. This research is financed in a collaboration between Crédit Mutuel Arkéa Bank and IMT Atlantique. I would especially like to thank to Dr. Riwal Lefort, Maxime Havez and Bertrand Mignon from the Innovation IA Department of Arkéa who gave me permission to work in their industrial laboratory. I also appreciate very much the discussions that we had. Last, I dedicate this thesis to the memory of my grandfather and my grandmother, whose role in my life was, and remains, immense. v Abstract IMT Atlantique Department of Computer Science French Language DRS Parsing by Ngoc Luyen L The rise of the internet, of personal computers and of mobile devices has been changing various communication forms from one-way communication, such as the press or television, to two- way flows of information or interactive communications. In particular, the advent of social networking platforms makes this communication trend ever more prevalent. User-generated contents from the social networking services become a giant source of information which can be useful for organizations or businesses in the sense that users are regarded as clients or potential clients for businesses or members of organizations. The exploitation of user-generated texts can help to identify their sentiments or intentions, or reduce the effort of agents in businesses or organizations who are responsible for gathering or receiving information on social networking services. In this thesis, we realized a study about semantic analysis and representation for natural language texts in various formats such discourses, utterances, and conversations from interactive communication on the social networking platforms. With the purpose of finding an effective way to analyze and represent semantics of natu- ral language utterances, we examine and discuss various approaches ranging from the using rule-based methods to current deep neural network approaches. Deep learning approaches require massive amounts of data, in our case: natural language utterance and their meaning representations—to leverage this requirement we employ an empirical approach and propose a general architecture for a meaning representation framework for the French language. First of all, for each sequence of input texts, we analyze each word morphologically and syntactically using the formalism of dependency syntax, and this constitutes the first module of our architecture. During this step, we explore lemmas, part-of-speech tags and dependencies as features of words. Then, a bridge between syntax and semantic is built based on the formalism of Combi- natory Categorial Grammars (CCG), which provides a transparent syntax-semantic interface. This constitutes the second module of our architecture. The morphological and syntactic data obtained from the previous module are employed as input in the process of extraction of a CCG derivation tree. More precisely, this process consists of two stages: the first one is the task of the assignment of lexical categories to each word depending on its position and its relationship with other words in the sentence; the second one focuses on the binarization of dependency trees. The parsing of CCG derivation trees is realized on binary trees by applying the combinatory rules defined in CCG theory. Finally, we construct a meaning representation for utterances based on the Discourse Rep- resentation Theory (DRT) which is built from Discourse Representation Structure (DRS) and the Boxer tool by Johan Bos. This constitutes the last module of our architecture. Data such as CCG derivation trees obtained by the previous module are used as input for this module, together with additional information such as chunks and entities. The transformation of in- put CCG derivation trees into the DRS format is able to process linguistic phenomena such as anaphoras, coreferences and others. As output, we obtain data either in FOL or in the DRS boxing format. By implementing our architecture we have built a French CCG corpus based on the French Tree Bank corpus (FTB). Furthermore, we have proven efficiency of the use of embedding vi features from lemmas, POS tags and dependency relations in order to improve the accuracy of the CCG supertagging task using deep neural networks. vii Résumé IMT Atlantique Département Informatique Analyse de la structure de représentation du discours pour le français par Ngoc Luyen LE L’essor d’Internet, des ordinateurs personnels, des appareils numériques et mobiles changent diverses formes de communication, passant à sens unique comme