Using NLP Techniques to Enhance Content Discoverability and Reusability for Adaptive Systems
Total Page:16
File Type:pdf, Size:1020Kb
Using NLP Techniques to Enhance Content Discoverability and Reusability for Adaptive Systems A thesis submitted to the University of Dublin, Trinity College in fulfilment of the requirements of the degree of Doctor of Philosophy Mostafa Bayomi Knowledge and Data Engineering Group (KDEG) School of Computer Science and Statistics Trinity College, Dublin Ireland Supervised by Prof. Séamus Lawless 2019 Declaration I declare that this thesis has not been submitted as an exercise for a degree at this or any other university and it is entirely my own work. I agree to deposit this thesis in the University’s open access institutional repository or allow the library to do so on my behalf, subject to Irish Copyright Legislation and Trin- ity College Library conditions of use and acknowledgement. Signature ___________________________ Date ___________________ Mostafa Bayomi ii Permission to lend or copy I, the undersigned, agree that the Trinity College Library may lend or copy this thesis upon request. Signature ___________________________ Date ___________________ Mostafa Bayomi iii Dedication To my mother’s soul iv Acknowledgements First and foremost, all thanks and praise are due to God, who has granted me this great success in my PhD, and has bestowed upon me all the necessary strength, health, wits, patience, and perseverance to complete it. I would like to express my gratitude to numerous people who have helped me to complete my work over the last number of years. I would like to express my utmost gratitude to my supervisor Prof. Séamus Lawless for all his guidance, patience, feedback, encourage- ment, and for his great support personally and professionally throughout my PhD journey. I could not have done it without him. I would like to thank all my colleagues in KDEG and ADAPT who have directly and indirectly supported me both personally and academically during the years of the PhD. My utmost gratefulness and sincerest thanks to my father (Mohamed Bayomi), my broth- ers (Ahmed and Amr) and my sister (Rasha) who have always believed in me and for all their care, support and encouragement. To my lovely wife Alaa, for all her love, devotion, never-ending support and encouragement. To my kids, Hamza and Omar who are the joy of my life. I would like to thank all my Egyptian friends in Dublin who have been a major part of this journey. And a special thanks to two very special people, Ramy Shosha and Rami Ghorab. Words of gratitude and thanks can never do them their rights. And I’ve saved the best for last: My utmost gratefulness and sincerest thanks to my mother (Boushra Shams), whom I wish that she is with me in this moment. She has been and will always be the greatest supportive person in my life, especially in my PhD jour- ney. You have left this world, but you will always be with me. Thanks mum for every- thing, words will never express my feelings for you. May God bless your soul. v Abstract The volume of digital content resources written as text documents is growing every day, at an unprecedented rate. Because this content is generally not structured as easy-to-han- dle units, it can be very difficult for users to find information they are interested in, or to help them accomplish their tasks. This in turn has increased the need for producing tai- lored content that can be adapted to the needs of individual users. A key challenge for producing such tailored content lies in the ability to understand how this content is struc- tured. Hence, the efficient analysis and understanding of unstructured text content has become increasingly important. This has led to the increasing use of Natural Language Processing (NLP) techniques to help with processing unstructured text documents. Amongst the different NLP techniques, Text Segmentation is specifically used to under- stand the structure of textual documents. However, current approaches to text segmenta- tion are typically based upon using lexical and/or syntactic representation to build a struc- ture from the unstructured text documents. However, the relationship between segments may be semantic, rather than lexical or syntactic. Furthermore, text segmentation research has primarily focused on techniques that can be used to process text documents but not on how these techniques can be utilised to produce tailored content that can be adapted to the needs of individual users. In contrast, the field of Adaptive Systems has inherently focused on the challenges associated with dynami- cally adapting and delivering content to individual users. However, adaptive systems have primarily focused upon the techniques of adapting content, not on how to understand and structure this content. Even systems that have focused on structuring content are limited in that they rely upon the original structure of the content resource, which reflects the perspective of its author. Therefore, these systems are limited in that they do not deeply “understand” the structure of the content, which in turn, limits their capability to discover and supply appropriate content for use in defined contexts, and limits the content’s ame- nability for reuse within various independent adaptive systems. In order to utilise the strength of NLP techniques to overcome the challenges of under- standing unstructured text content, this thesis investigates how NLP techniques can be utilised in order to enhance the supply of content to adaptive systems. Specifically, the contribution of this thesis is concerned with addressing the challenges associated with hierarchical text segmentation techniques, and with content discoverability and reusabil- ity for adaptive systems. vi Firstly, this research proposes a novel hierarchical text segmentation approach, named C- HTS, that builds a structure from text documents based on the semantic representation of text. Semantic representation is a method that replaces keyword-based text representation with concept-based features, where the meaning of a piece of text is represented as a vector of knowledge concepts automatically extracted from massive human knowledge repositories such as Wikipedia. Using this approach, C-HTS represents the content of a document as a tree-like hierarchy. This way of structuring the document can be regarded as a hierarchically coherent tree that is useful for supporting a variety of search methods as it provides different levels of granularity for the underlying content. Secondly, this research proposes a novel content-supply service named CROCC. The aim of CROCC is to utilise the produced structure of C-HTS in order to overcome the limitations of the state of the art content-supply approaches. Finally, this research conducts an evaluation of the extent to which the CROCC service enhances content discoverability and reusabil- ity for adaptive systems. vii Table of Contents Abstract ............................................................................................................................ vi 1. Introduction ............................................................................................................... 1 1.1 Motivation ......................................................................................................... 1 1.2 Research Question ............................................................................................ 5 1.2.1 Research Objectives ...................................................................................... 5 1.3 Research Contributions ..................................................................................... 6 1.4 Research Methodology ..................................................................................... 8 1.5 Thesis Overview ............................................................................................. 10 2. State of the Art ........................................................................................................ 13 2.1 Introduction ..................................................................................................... 13 2.2 Natural Language Processing ......................................................................... 13 2.2.1 Overview ..................................................................................................... 13 2.2.2 Low-level NLP Tasks ................................................................................. 14 2.2.3 High-level NLP Tasks ................................................................................. 16 2.2.4 Summary ..................................................................................................... 18 2.3 Text Segmentation .......................................................................................... 18 2.3.1 Overview ..................................................................................................... 18 2.3.2 Content-based and Discourse-based ........................................................... 19 2.3.3 Supervised and Unsupervised ..................................................................... 20 2.3.4 Borderline sentences detection methods ..................................................... 20 2.3.5 Linear and Hierarchical ............................................................................... 21 2.3.6 Hierarchical Text Segmentation Techniques .............................................. 23 2.3.7 Summary ..................................................................................................... 25 2.4 Adaptive Systems ...........................................................................................