Natural Language Processing for Book Recommender Systems

Natural Language Processing for Book Recommender Systems by Haifa Alharthi Thesis submitted in partial fulfillment of the requirements for the PhD degree in Computer Science School of Electrical Engineering and Computer Science Faculty of Engineering University of Ottawa c Haifa Alharthi, Ottawa, Canada, 2019 Abstract The act of reading has benefits for individuals and societies, yet studies show that reading declines, especially among the young. Recommender systems (RSs) can help stop such de- cline. There is a lot of research regarding literary books using natural language processing (NLP) methods, but the analysis of textual book content to improve recommendations is rel- atively rare. We propose content-based recommender systems that extract elements learned from book texts to predict readers’ future interests. One factor that influences reading preferences is writing style; we propose a system that recommends books after learning their authors’ writing style. To our knowledge, this is the first work that transfers the information learned by an author-identification model to a book RS. Another approach that we propose uses over a hundred lexical, syntactic, stylometric, and fiction-based features that might play a role in gen- erating high-quality book recommendations. Previous book RSs include very few stylometric features; hence, our study is the first to include and analyze a wide variety of textual elements for book recommendations. We evaluated both approaches according to a top-k recommendation scenario. They give better accuracy when compared with state-of-the-art content and collaborative filtering methods. We highlight the significant factors that contributed to the accuracy of the recommendations using a forest of randomized regression trees. We also con- ducted a qualitative analysis by checking if similar books/authors were annotated similarly by experts. Our content-based systems suffer from the new user problem, well-known in the field of RSs, that hinders their ability to make accurate recommendations. Therefore, we propose a Topic Model-Based book recommendation component (TMB) that addresses the issue by using the topics learned from a user’s shared text on social media, to recognize their interests and map them to related books. To our knowledge, there is no literature regarding book RSs that exploits public social networks other than book-cataloging websites. Using topic modeling ii techniques, extracting user interests can be automatic and dynamic, without the need to search for predefined concepts. Though TMB is designed to complement other systems, we evaluated it against a traditional book CB. We assessed the top k recommendations made by TMB and CB and found that both retrieved a comparable number of books, even though CB relied on users’ rating history, while TMB only required their social profiles. iii Acknowledgements I would like to express my deepest gratitude and thanks to Professor Diana Inkpen, my su- pervisor, mentor, and role model. Her genuine care, help, and patience brought out the best in me and other students. I am grateful for all of the resources that Professor Inkpen has provided from Tamale seminars and group discussions to the productive and energetic lab working environment. It is an honour working with and learning from such an icon. Thank you to the Saudi Electronic University (SEU) and the Natural Sciences and Engi- neering Research Council of Canada (NSERC) for funding my research. Also, I am thankful to NVIDIA Corporation for their generous donation of a Titan X GPU which helped reduce the experimentation time. Thank you to all of the professors at University of Ottawa who sparked my passion in the field and provided me with the knowledge I needed. I am grateful for all my Ph.D. committee members, Professor Anna Feldman, Professor Chris Tanasescu, Professor Stan Szpakowicz, and Professor Tony White for their insightful comments and feedback. Special thanks to Pro- fessor Szpakowicz for his constructive criticism and challenging questions and to Professor Tanasescu for his valuable suggestions. Thank you to Dr. Kenton White and Dr. Jelber Shirabad for their time and feedback. Thank you to all of the Tamale seminar speakers who motivated us to explore new methods. Thank you to the researchers, especially Dr. Paula Cristina Vaz and Dr. Julian Brooke, who shared their collected datasets and developed tools that facilitated my work. Thank you to my colleagues, especially Zunaira Jamil, Ruba Skaik, Prasadith Buddhitha, SHY, Ehsan Amjadian, and Doaa Ibrahim, for being there when I needed help. Thank you to the city’s baristas who brought colour and joy to my day-to-day research life. I thank my mom and all my family members who have endlessly supported and believed in me. Special thanks to Khalid who has always been there through the ups and downs and to iv whom I owe this achievement. None of this could have been achieved without the help of God to whom I am most grateful for inspiring me, putting amazing people along my way, and helping me through hardships. v Contents 1 Introduction 1 1.1 Overview . .1 1.2 Motivation . .2 1.2.1 The importance of reading and book recommender systems . .2 1.2.2 The use of book textual content . .4 1.2.3 User cold start . .8 1.3 Problem Statement . .9 1.3.1 Book-recommendations based on textual content . 10 1.3.2 Topic modeling in book RSs for new users . 11 1.4 Contributions . 12 1.5 Organization of the Thesis . 14 1.6 Published Papers . 15 2 Background 16 2.1 Collaborative Filtering . 16 2.2 Content-based Recommender Systems . 18 2.2.1 Book recommendations based on their metadata . 20 2.2.2 Recommendations based on books’ textual content . 21 vi 2.2.3 Recommendations of text-based items . 23 2.3 Social Recommender Systems and User-generated Content . 24 2.3.1 Recommendations on book-cataloguing platforms . 24 2.3.2 Recommendations based on social media . 25 2.3.3 Recommendations based on user-generated texts . 27 2.4 Book Recommendations based on Association Rules . 29 2.5 Other Approaches to Book Recommendations . 31 2.6 Deep Learning and Recommender Systems . 33 2.6.1 Basics of deep neural networks . 33 2.6.2 Deep learning recommender systems . 37 2.7 Datasets for Evaluating Book RSs . 38 3 Authorship Identification for Literary Book Recommendations 40 3.1 Overview . 40 3.2 Methodology . 43 3.2.1 Author identification . 43 3.2.2 Book recommendations . 45 3.3 Evaluation . 46 3.3.1 Dataset and preprocessing . 46 3.3.2 The experimental setting . 47 3.4 Results and Analysis . 50 3.4.1 Limitations . 56 3.5 Conclusion and Future Work . 57 4 Study of Linguistic Features Incorporated in a Literary Book Recommender Sys- tem 59 vii 4.1 Overview . 59 4.2 Methodology . 60 4.2.1 Feature selection . 60 4.2.2 Recommendation procedure . 66 4.3 Evaluation . 68 4.3.1 Dataset and preprocessing . 68 4.4 Results and Analysis . 70 4.5 Conclusion and Future Work . 76 5 Topic Model-Based Book Recommendation Component 79 5.1 Overview . 79 5.2 Topic Model-Based Book Recommender System . 80 5.2.1 Book and user profiles . 80 5.2.2 The recommendation procedure . 83 5.3 Data Preparation and System Implementation . 83 5.3.1 Data collection . 84 5.3.2 Data preprocessing . 86 5.3.3 System implementation . 87 5.4 Experiments and Results . 88 5.5 Discussion . 90 5.5.1 Error analysis . 92 5.5.2 Limitations . 94 5.6 Conclusion and Future Work . 94 6 Conclusion and Future Work 97 6.1 Summary . 97 viii 6.2 Highlights and Challenges . 99 6.2.1 Explainability . 99 6.2.2 Overspecialization . 99 6.2.3 Availability of adequate content . 100 6.2.4 Books in multiple languages . 100 6.2.5 Limitations in the evaluation . 101 6.3 Deployment of Our Systems . 102 6.4 Future Work . 102 6.4.1 Deep-learning-based RSs . 103 6.4.2 Graph-based RSs . 104 6.4.3 The effect of mood on book selection . 104 6.4.4 Books for non-readers . 104 A Details about the Linguistic Features 106 A.1 The full list of LIWC features with examples . 106 A.2 The full list of importance values generated by ET . 111 A.3 The correlation values between highly correlated features . 117 B Additional TMB Preprocessing Information 133 B.1 100 most frequent nouns in Oxford English Corpus . 133 B.2 Words with low IDF weights . 134 C Additional Experiments 135 C.1 The difference in performance when predicting binary vs. 1-5 ratings . 135 C.2 Embeddings of fictional characters and places . 136 ix List of Tables 2.1 An example of a ratings matrix of 4 user-book preferences . 17 2.2 Features of book-recommendation datasets . 39 3.1 Author information of books similar to “ The Fitz-Boodle Papers“ . 54 4.1 Categories of features included in our study . 61 4.2 Most and least important features according to ET model . 75 4.3 Top ten similar books to Madame de Treymes by Edith Wharton . 78 A.1 LIWC Dimensions, labels and examples . 106 A.2 All Features ordered according to ET importance values. 111 A.3 Highly correlated features in descending order . 117 C.1 Extension of the results presented in table 3.2 . 135 C.2 Extension of the results presented in table 4.1 . 136 C.3 The accuracy of book recommendations based on different combinations of embeddings . 138 x List of Figures 2.1 The process of recommendation in content-based RSs (Lops et al., 2011) . 19 2.2 The architecture of Skip-gram model (Mikolov et al., 2013) .

Load more