Fast Data Analytics by Learning

Total Page:16

File Type:pdf, Size:1020Kb

Fast Data Analytics by Learning Fast Data Analytics by Learning by Yongjoo Park A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 2017 Doctoral Committee: Associate Professor Michael Cafarella, Co-Chair Assistant Professor Barzan Mozafari, Co-Chair Associate Professor Eytan Adar Professor H. V. Jagadish Associate Professor Carl Lagoze Yongjoo Park [email protected] ORCID iD: 0000-0003-3786-6214 c Yongjoo Park 2017 To my mother ii Acknowledgements I thank my co-advisors, Michael Cafarella and Barzan Mozafari. I am deeply grateful that I could start my graduate studies with Mike Cafarella. Mike was the one of the nicest persons I have met (I can say, throughout my life). He also deeply cared of my personal life as well as my graduate study. He always encouraged and supported me so I could make the best decisions for my life. He was certainly more than a simple academic advisor. Mike also has excellent talents in presentation. His descriptions (on any subject) are well-organized and straightforward. Sometimes, I find myself trying to imitate his presentation styles when I give my own presentations, which, I believe, is natural since I have been advised by him for more than five years now. I started work with Barzan after a few years since I came to Michigan. He is excellent in writing research papers and presenting work in the most interesting, formal, and concise way. He devoted tremendous amount of time for me. I learned great amount of writing and presentation skills from him. I am glad I will be his first PhD student, and I hope he is glad about the fact as well. Now as I finish my long (and much fun) PhD study, I deeply understand the importance of having good advisors, and I am sincerely grateful that I was advised by Mike and Barzan. I always think my life has been full of fortunes I cannot explain, and having them as my advisors was one of them. I am sincerely grateful of my thesis committee members. Both Carl Lagoze and Eytan Adar were kind enough to serve on my thesis committee, and they were supportive. Eytan Adar gave detailed feedbacks on this dissertation, which helped me improve its completeness greatly. H.V. Jagadish gave me wonderful advice on presentations and writing. Even from my relatively short experience with him, I could immediately sense that he is wise and is a great mentor. My PhD study was greatly helped by the members of the database group at the University of Michigan, Ann Arbor. Dolan Antenucci helped me setting up systems and shared with me many interesting real-world datasets. Some of the results in this dissertation are based on the datasets provided by him. I also had good discussions with Mike Anderson, Shirley Zhe Chen, and Matthew Burgess. I regret that I did not have more collaborations with the people at the database group. iii My graduate study was not possible without the financial support from Jeongsong Cultural Foundation. When I first came to the United States as a Masters student (with- out guaranteed funding), Jeongsong Cultural Foundation provided me with tuition sup- port. I always thank it, and I am going to pay my debt to the society. Also, Kwanjeong Educational Foundation supported me for my Phd studies. With its help, I could only focus on my research without any financial concerns. I will be someone like them, who can contribute to the society. My research is one part of such efforts, but I hope I can do more. Last, but most importantly, I deeply thank my parents and my sister who always supported me and my study. I was fortunate enough to have my family. I have always been a taker. I should mention my mother, my beloved mother, who devoted her entire life to her children. My life has become better and better, over time. Now I do not have to worry too much about paying phone bills or bus fees. Now I can easily afford Starbucks coffee and commute by drive. But I want to go back, to the time, when I was young. I miss my mother. I miss the time when I fell asleep by her, watching a late night movie on television. I miss the food she made for me, I miss the conversations with her. I miss her smile, and I miss that there always was a person in this world who would stand and speak for me however bad things I did. I received the best education in the world only due to her selfless and stupid amount of sacrifice. She only wanted her children to receive the best support, have the most tasty food, and enjoy every thing other kids would enjoy. Yet, she was foolish enough to never enjoy any such. To her, raising her kids, me and my sister, was her only happiness, and she conducted the task better than anyone. Both of her children grew up wonderfully. I conducted fine research during my graduate studies, and I will continue to do so. My sister became a medical doctor. I want to tell my mom that her life was worth more than any, and she should be proud of herself. iv Contents Dedication ........................................... ii Acknowledgements ..................................... iii List of Figures ......................................... ix List of Tables ......................................... xiii Abstract ............................................ xiv Chapter 1 Introduction ........................................ 1 1.1 Data Analytics Tasks and Synopses Construction................ 3 1.1.1 Data Analytics Tasks............................ 3 1.1.2 Approaches to Building Synopses .................... 4 1.2 Overview and Contributions ........................... 5 1.2.1 Exploiting Past Computations....................... 5 1.2.2 Task-Aware Synopses............................ 7 1.2.3 Summary of Results ............................ 9 2 Background ........................................ 10 2.1 Techniques for Exact Data Analytics....................... 10 2.1.1 Horizontal Scaling ............................. 10 2.1.2 Columnar Databases............................ 11 2.1.3 In-memory Databases ........................... 11 2.2 Approaches for Approximate Query Processing ................ 11 2.2.1 Random Sampling ............................. 11 2.2.2 Random Projections ............................ 13 v Part I Exploiting Past Computations 14 3 Faster Data Aggregations by Learning from the Past ................ 15 3.1 Motivation...................................... 15 3.2 Verdict Overview .................................. 19 3.2.1 Architecture and Workflow........................ 19 3.2.2 Supported Queries............................. 21 3.2.3 Internal Representation .......................... 22 3.2.4 Why and When Verdict Offers Benefit.................. 23 3.2.5 Limitations.................................. 24 3.3 Inference....................................... 24 3.3.1 Problem Statement............................. 25 3.3.2 Inference Overview............................. 26 3.3.3 Prior Belief.................................. 27 3.3.4 Model-based Answer............................ 28 3.3.5 Key Challenges............................... 29 3.4 Estimating Query Statistics ............................ 30 3.4.1 Covariance Decomposition ........................ 31 3.4.2 Analytic Inter-tuple Covariances..................... 32 3.5 Verdict Process Summary ............................. 33 3.6 Deployment Scenarios ............................... 34 3.7 Formal Guarantees ................................. 36 3.8 Parameter Learning................................. 38 3.9 Model Validation .................................. 39 3.10 Generalization of Verdict under Data Additions ................ 41 3.11 Experiments..................................... 42 3.11.1 Experimental Setup............................. 43 3.11.2 Generality of Verdict ........................... 45 3.11.3 Speedup and Error Reduction....................... 46 3.11.4 Confidence Interval Guarantees...................... 48 3.11.5 Memory and Computational Overhead................. 49 3.11.6 Impact of Data Distributions and Workload Characteristics . 49 3.11.7 Accuracy of Parameter Learning..................... 52 3.11.8 Model Validation.............................. 52 3.11.9 Data Append ................................ 54 3.11.10 Verdict vs. Simple Answer Caching ................... 55 vi 3.11.11 Error Reductions for Time-Bound AQP Engines............ 56 3.12 Prevalence of Inter-tuple Covariances in Real-World.............. 57 3.13 Technical Details................................... 58 3.13.1 Double-integration of Exp Function ................... 58 3.13.2 Handling Categorical Attributes ..................... 58 3.13.3 Analytically Computed Parameter Values................ 59 3.14 Related Work .................................... 60 3.15 Summary....................................... 61 Part II Building Task-aware Synopses 63 4 Accurate Approximate Searching by Learning from Data ............. 64 4.1 Motivation...................................... 64 4.2 Hashing-based kNN Search............................ 68 4.2.1 Workflow................................... 68 4.2.2 Hash Function Design........................... 70 4.3 Neighbor-Sensitive Hashing............................ 71 4.3.1 Formal Verification of Our Claim..................... 72 4.3.2 Neighbor-Sensitive Transformation.................... 74 4.3.3 Our Proposed NST............................. 76 4.3.4 Our NSH Algorithm............................ 82 4.4 Experiments..................................... 86 4.4.1 Setup..................................... 87 4.4.2 Validating Our Main
Recommended publications
  • O. Peter Buneman Curriculum Vitæ – Jan 2008
    O. Peter Buneman Curriculum Vitæ { Jan 2008 Work Address: LFCS, School of Informatics University of Edinburgh Crichton Street Edinburgh EH8 9LE Scotland Tel: +44 131 650 5133 Fax: +44 667 7209 Email: [email protected] or [email protected] Home Address: 14 Gayfield Square Edinburgh EH1 3NX Tel: 0131 557 0965 Academic record 2004-present Research Director, Digital Curation Centre 2002-present Adjunct Professor of Computer Science, Department of Computer and Information Science, University of Pennsylvania. 2002-present Professor of Database Systems, Laboratory for the Foundations of Computer Science, School of Informatics, University of Edinburgh 1990-2001 Professor of Computer Science, Department of Computer and Information Science, University of Pennsylvania. 1981-1989 Associate Professor of Computer Science, Department of Computer and Information Science, University of Pennsylvania. 1981-1987 Graduate Chairman, Department of Computer and Information Science, University of Pennsyl- vania 1975-1981 Assistant Professor of Computer Science at the Moore School and Assistant Professor of Decision Sciences at the Wharton School, University of Pennsylvania. 1969-1974 Research Associate and subsequently Lecturer in the School of Artificial Intelligence, Edinburgh University. Education 1970 PhD in Mathematics, University of Warwick (Supervisor E.C. Zeeman) 1966 MA in Mathematics, Cambridge. 1963-1966 Major Scholar in Mathematics at Gonville and Caius College, Cambridge. Awards, Visting positions Distinguished Visitor, University of Auckland, 2006 Trustee, VLDB Endowment, 2004 Fellow of the Royal Society of Edinburgh, 2004 Royal Society Wolfson Merit Award, 2002 ACM Fellow, 1999. 1 Visiting Resarcher, INRIA, Jan 1999 Visiting Research Fellowship sponsored by the Japan Society for the Promotion of Science, Research Institute for the Mathematical Sciences, Kyoto University, Spring 1997.
    [Show full text]
  • RSE Fellows Ordered by Area of Expertise As at 11/10/2016
    RSE Fellows ordered by Area of Expertise as at 11/10/2016 HRH Prince Charles The Prince of Wales KG KT GCB Hon FRSE HRH The Duke of Edinburgh KG KT OM, GBE Hon FRSE HRH The Princess Royal KG KT GCVO, HonFRSE A1 Biomedical and Cognitive Sciences 2014 Professor Judith Elizabeth Allen FRSE, FMedSci, Professor of Immunobiology, University of Manchester. 1998 Dr Ferenc Andras Antoni FRSE, Honorary Fellow, Centre for Integrative Physiology, University of Edinburgh. 1993 Sir John Peebles Arbuthnott MRIA, PPRSE, FMedSci, Former Principal and Vice-Chancellor, University of Strathclyde. Member, Food Standards Agency, Scotland; Chair, NHS Greater Glasgow and Clyde. 2010 Professor Andrew Howard Baker FRSE, FMedSci, BHF Professor of Translational Cardiovascular Sciences, University of Glasgow. 1986 Professor Joseph Cyril Barbenel FRSE, Former Professor, Department of Electronic and Electrical Engineering, University of Strathclyde. 2013 Professor Michael Peter Barrett FRSE, Professor of Biochemical Parasitology, University of Glasgow. 2005 Professor Dame Sue Black DBE, FRSE, Director, Centre for Anatomy and Human Identification, University of Dundee. ; Director, Centre for Anatomy and Human Identification, University of Dundee. 2007 Professor Nuala Ann Booth FRSE, Former Emeritus Professor of Molecular Haemostasis and Thrombosis, University of Aberdeen. 2001 Professor Peter Boyle CorrFRSE, FMedSci, Former Director, International Agency for Research on Cancer, Lyon. 1991 Professor Sir Alasdair Muir Breckenridge CBE KB FRSE, FMedSci, Emeritus Professor of Clinical Pharmacology, University of Liverpool. 2007 Professor Peter James Brophy FRSE, FMedSci, Professor of Anatomy, University of Edinburgh. Director, Centre for Neuroregeneration, University of Edinburgh. 2013 Professor Gordon Douglas Brown FRSE, FMedSci, Professor of Immunology, University of Aberdeen. 2012 Professor Verity Joy Brown FRSE, Provost of St Leonard's College, University of St Andrews.
    [Show full text]
  • WWW2007 Program Guide
    AOL_AD_WWW2007-Open_FIXED.pdf 4/13/2007 3:13:25 PM C M Y CM MY CY CMY K www2007.org Message from the Chair of IW3C2 On behalf of the International World Wide Web Conference Steering Committee (IW3C2), I welcome you to the 16th conference of our series. As you already know, the World Wide Web was first conceived in 1989 by Tim Berners-Lee at CERN in Geneva, Switzerland. The first conference, WWW1, was held at CERN in 1994. The conference series aims to provide the world a premier forum about the evolution of the Web, the standardization of its associated technologies, and the Web’s impact on society and culture. These conferences bring together researchers, developers, users, and vendors – indeed all of you who are passionate about the Web and what it has to offer, now and in the future. The IW3C2 was founded by Joseph Hardin and Robert Cailliau in August 1994 and has been responsible for the International WWW Conference series ever since. Except for 1994 and 1995 when two conferences were held each year, WWWn became an annual event held in late April or early May. The location of the conferences rotates among North America, Europe, and Asia. In 2002 we changed the conference designator from a number (1 through 10) to the year it is held; i.e., WWW11 became known as WWW2002, and so on. You may browse our website http://www. iw3c2.org/ for information on past and future conferences. I wish to recognize our long-term partner, the World Wide Web Consortium (W3C), who has contributed a significant part of the conferences’ contents since the very beginning.
    [Show full text]
  • Data Quality: from Theory to Practice
    Data Quality: From Theory to Practice Wenfei Fan School of Informatics, University of Edinburgh, and RCBD, Beihang University [email protected] ABSTRACT “more than 25% of critical data in the world’s top compa- Data quantity and data quality, like two sides of a coin, are nies is flawed” [53], and “pieces of information perceived equally important to data management. This paper provides as being needed for clinical decisions were missing from an overview of recent advances in the study of data quality, 13.6% to 81% of the time” [76]. It is also estimated that from theory to practice. We also address challenges intro- “2% of records in a customer file become obsolete in one duced by big data to data quality management. month” [31] and hence, in a customer database, 50% of its records may be obsolete and inaccurate within two years. 1. INTRODUCTION Dirty data is costly. Statistics shows that “bad data or poor When we talk about big data, we typically emphasize the data quality costs US businesses $600 billion annually” [31], quantity (volume) of the data. We often focus on techniques “poor data can cost businesses 20%-35% of their operating that allow us to efficiently store, manage and query the data. revenue” [92], and that “poor data across businesses and the For example, there has been a host of work on developing government costs the US economy $3.1 trillion a year” [92]. scalable algorithms that, given a query Q and a dataset D, Worse still, when it comes to big data, the scale of the data compute query answers Q(D) when D is big.
    [Show full text]
  • Download the Trustees' Report and Financial Statements 2018-2019
    Science is Global Trustees’ report and financial statements for the year ended 31 March 2019 The Royal Society’s fundamental purpose, reflected in its founding Charters of the 1660s, is to recognise, promote, and support excellence in science and to encourage the development and use of science for the benefit of humanity. The Society is a self-governing Fellowship of distinguished scientists drawn from all areas of science, technology, engineering, mathematics and medicine. The Society has played a part in some of the most fundamental, significant, and life-changing discoveries in scientific history and Royal Society scientists – our Fellows and those people we fund – continue to make outstanding contributions to science and help to shape the world we live in. Discover more online at: royalsociety.org BELGIUM AUSTRIA 3 1 NETHERLANDS GERMANY 5 12 CZECH REPUBLIC SWITZERLAND 3 2 CANADA POLAND 8 1 Charity Case study: Africa As a registered charity, the Royal Society Professor Cheikh Bécaye Gaye FRANCE undertakes a range of activities that from Cheikh Anta Diop University 25 provide public benefit either directly or in Senegal, Professor Daniel Olago from the University of indirectly. These include providing financial SPAIN UNITED STATES Nairobi in Kenya, Dr Michael OF AMERICA 18 support for scientists at various stages Owor from Makerere University of their careers, funding programmes 33 in Uganda and Professor Richard that advance understanding of our world, Taylor from University College organising scientific conferences to foster London are working on ways to discussion and collaboration, and publishing sustain low-cost, urban water supply and sanitation systems scientific journals.
    [Show full text]
  • Duplication in Biological Databases Further Limit the Development of the Related Duplicate Detection Methods
    School of Computing and Information Systems The University of Melbourne DUPLICATIONINBIOLOGICAL DATABASES:DEFINITIONS, IMPACTSANDMETHODS Qingyu Chen ORCID ID: 0000-0002-6036-1516 Supervisors: Prof. Justin Zobel Prof. Karin Verspoor Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy Produced on archival quality paper August 2017 ABSTRACT Duplication is a pressing issue in biological databases. This thesis concerns duplication, in terms of its definitions (what records are duplicates), impacts (why duplicates are significant) and solutions (how to address duplication). The volume of biological databases is growing at an unprecedented rate, populated by complex records drawn from heterogeneous sources; the huge data volume and the diverse types cause concern for the underlying data quality. A specific challenge is dupli- cation, that is, the presence of redundant or inconsistent records. While existing studies concern duplicates, the definitions of duplicates are not clear; the foundational under- standing of what records are considered as duplicates by database stakeholders is lacking. The impacts of duplication are not clear either; existing studies have different or even inconsistent views on the impacts. The unclear definitions and impacts of duplication in biological databases further limit the development of the related duplicate detection methods. In this work, we refine the definitions of duplication in biological databases through a retrospective analysis of merged groups in primary nucleotide databases – the duplicates identified by record submitters and database staff (or biocurators) – to understand what types of duplicates matter to database stakeholders. This reveals two primary representa- tions of duplication under the context of biological databases: entity duplicates, multiple records belonging to the same entities, which particularly impact record submission and curation, and near duplicates (or redundant records), records sharing high similarities, particularly impact database search.
    [Show full text]
  • Adding Regular Expressions to Graph Reachability and Pattern Queries
    Front. Comput. Sci., 2012, 6(3): 313–338 DOI 10.1007/s11704-012-1312-y Adding regular expressions to graph reachability and pattern queries Wenfei FAN1,2, Jianzhong LI2, Shuai MA 3, Nan TANG4, Yinghui WU1 1 School of Informatics, University of Edinburgh, Edinburgh EH8 9YL, UK 2 Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin 150001, China 3 NLSDE Lab, Beihang University, Beijing 100919, China 4 Qatar Computing Research Institute, Qatar Foundation, Doha, Qatar c Higher Education Press and Springer-Verlag Berlin Heidelberg 2012 Abstract It is increasingly common to find graphs in which edges are of different types, indicating a variety of relation- 1 Introduction ships. For such graphs we propose a class of reachability queries and a class of graph patterns, in which an edge is It is increasingly common to find data modeled as graphs in specified with a regular expression of a certain form, ex- a variety of areas, e.g., computer vision, knowledge discov- pressing the connectivity of a data graph via edges of var- ery, biology, chem-informatics, dynamic network traffic, so- ious types. In addition, we define graph pattern matching cial networks, semantic Web, and intelligence analysis. To based on a revised notion of graph simulation. On graphs in query data graphs, two classes of queries are being widely emerging applications such as social networks, we show that used: these queries are capable of finding more sensible informa- (a) Reachability queries, asking whether there exists a path tion than their traditional counterparts. Better still, their in- from one node to another [1–6].
    [Show full text]
  • Data Quality: from Theory to Practice
    Data Quality: From Theory to Practice Wenfei Fan School of Informatics, University of Edinburgh, and RCBD, Beihang University [email protected] ABSTRACT deed, some attributes of t1,t2 and t3 have become obso- 2 Data quantity and data quality, like two sides of a coin, lete and thus inaccurate. are equally important to data management. This paper provides an overview of recent advances in the study of data quality, from theory to practice. We also address The example shows that if the quality of the data is challenges introduced by big data to data quality man- bad, we cannot find correct query answers no matter agement. how scalable and efficient our query evaluation algo- rithms are. 1. INTRODUCTION Unfortunately, real-life data is often dirty: inconsis- When we talk about big data, we typically empha- tent, inaccurate, incomplete, obsolete and duplicated. size the quantity (volume) of the data. We often focus Indeed, “more than 25% of critical data in the world’s on techniques that allow us to efficiently store, manage top companies is flawed” [53], and “pieces of infor- and query the data. For example, there has been a host mation perceived as being needed for clinical decisions of work on developing scalable algorithms that, given a were missing from 13.6% to 81% of the time” [76]. It query Q and a dataset D, compute query answers Q(D) is also estimated that “2% of records in a customer file when D is big. becomeobsoletein one month”[31] and hence, in a cus- tomer database, 50% of its records may be obsolete and But can we trust Q(D) as correct query answers? inaccurate within two years.
    [Show full text]
  • Some Undecidable Implication Problems for Path Constraints
    University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science April 1997 Some Undecidable Implication Problems for Path Constraints Peter Buneman University of Pennsylvania Wenfei Fan University of Pennsylvania Scott Weinstein University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/cis_reports Recommended Citation Peter Buneman, Wenfei Fan, and Scott Weinstein, "Some Undecidable Implication Problems for Path Constraints", . April 1997. University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-97-14. This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_reports/84 For more information, please contact [email protected]. Some Undecidable Implication Problems for Path Constraints Abstract We present a class of path constraints of interest in connection with both structured and semi-structured databases, and investigate their associated implication problems. These path constraints are capable of expressing natural integrity constraints that are not only a fundamental part of the semantics of the data, but are also important in query optimization. We show that, despite the simple syntax of the constraints, the implication problem for the constraints is r.e. complete and the finite implication problem for the constraints is co-r.e. complete. Indeed, we establish the existence of a conservative reduction of the set of all first-order sentences to the path constraint
    [Show full text]
  • RSE Fellows Ordered by Academic Discipline As at 15/04/2014
    RSE Fellows ordered by Academic Discipline as at 15/04/2014 HRH Prince Charles The Prince of Wales KG KT GCB Hon FRSE HRH The Duke of Edinburgh KG KT OM, GBE Hon FRSE HRH The Princess Royal KG KT GCVO, HonFRSE A1 Biomedical and Cognitive Sciences 2014 Professor Judith Elizabeth Allen FRSE, Professor of Immunobiology and Director of Research, School of Biological Sciences, University of Edinburgh. 1998 Dr Ferenc Andras Antoni FRSE, Head, Division of Preclinical Research, EGIS Pharmaceuticals plc. 1993 Sir John Peebles Arbuthnott MRIA PRSE, FMedSci, Former Principal and Vice-Chancellor, University of Strathclyde. Member, Food Standards Agency, Scotland; Chair, NHS Greater Glasgow and Clyde. 2010 Professor Andrew Howard Baker FRSE, Professor of Molecular Medicine, University of Glasgow. 1986 Professor Joseph Cyril Barbenel FRSE, Professor, Department of Electronic and Electrical Engineering, University of Strathclyde. 2013 Professor Michael Peter Barrett FRSE, Professor of Biochemical Parasitology, University of Glasgow. 2005 Professor Susan Margaret Black OBE FRSE, Director, Centre for Anatomy and Human Identification, University of Dundee. Director, Centre for Anatomy and Human Identification, University of Dundee. 2007 Professor Nuala Ann Booth FRSE, Personal Professor of Molecular Haemostasis and Thrombosis, University of Aberdeen. 2001 Professor Peter Boyle CorrFRSE, FMedSci, Former Director, International Agency for Research on Cancer, Lyon. 1991 Professor Sir Alasdair Muir Breckenridge CBE KB FRSE, FMedSci, Emeritus Professor of Clinical Pharmacology, University of Liverpool. 2007 Professor Peter James Brophy FRSE, FMedSci, Professor of Anatomy, University of Edinburgh. Director, Centre for Neuroregeneration, University of Edinburgh. 2013 Professor Gordon Douglas Brown FRSE, Professor of Immunology, University of Aberdeen. 2012 Professor Verity Joy Brown FRSE, Provost of St Leonard's College, University of St Andrews.
    [Show full text]
  • Download Version of Record (PDF / 2MB)
    Open Research Online The Open University’s repository of research publications and other research outputs Capturing and Exploiting Citation Knowledge for the Recommendation of Scientific Publications Thesis How to cite: Khadka, Anita (2020). Capturing and Exploiting Citation Knowledge for the Recommendation of Scientific Publications. PhD thesis The Open University. For guidance on citations see FAQs. c 2020 The Author https://creativecommons.org/licenses/by-nc-nd/4.0/ Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.21954/ou.ro.00011b89 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk CAPTURINGANDEXPLOITINGCITATIONKNOWLEDGE FORTHERECOMMENDATIONOFSCIENTIFIC PUBLICATIONS Anita Khadka Thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Knowledge Media Institute Faculty of Science, Technology, Engineering & Mathematics The Open University September 2020 ABSTRACT With the continuous growth of scientific literature, it is becoming in- creasingly challenging to discover relevant scientific publications from the plethora of available academic digital libraries. Despite the current scale, important efforts have been achieved towards the research and development of academic search engines, reference management tools, review management platforms, scientometrics systems, and recom- mender systems that help finding a variety of relevant scientific items, such as publications, books, researchers, grants and events, among others. This thesis focuses on recommender systems for scientific public- ations. Existing systems do not always provide the most relevant scientific publications to users, despite they are present in the recom- mendation space.
    [Show full text]