Data Fusion Ontology: Enabling a Paradigm Shift from Data Warehousing to Crowdsourcing For
Total Page:16
File Type:pdf, Size:1020Kb
DATA FUSION ONTOLOGY: ENABLING A PARADIGM SHIFT FROM DATA WAREHOUSING TO CROWDSOURCING FOR ACCELERATED PACE OF RESEARCH Dissertation Presented in Partial Fulfillments of the Requirements for the Degree Doctor of Philosophy in the Graduate School of Ohio State University By Satyajeet Raje, MS Graduate Program in Computer Science and Engineering The Ohio State University 2016 Dissertation Committee Dr. Jayashree Ramanathan, PhD Dr. Philip Payne, PhD Dr. Rajiv Ramnath, PhD Dr. Gagan Agarwal, PhD Copyright by Satyajeet Raje 2016 “The Web as I envisaged it, we have not seen it yet. The future is still so much bigger than the past.” - Tim Berners-Lee, Inventor of the World Wide Web ABSTRACT The last decade has seen an explosion of publicly available, crowd-sourced research datasets in various basic and applied science domains. Typically, research datasets are much smaller than the industry counter-parts, but still contain rich knowledge that can be reused to maximize new scientific discoveries. This availability of data has made data-driven academic research across domain at large- scale a real possibility. However, the availability of the data has not translated to its effective usage in research. A large amount of data within the data repositories remains underutilized. This research determined that a critical factor for this inefficiency is that majority of the researchers' time is devoted to the pre-processing activities rather than actual analysis. The amount of time and resources required to complete the beginning of this progression appear to limit the efficacy of the remaining research process. The thesis proposes research-centric "data fusion platforms" to accelerate the research process through reduction in the time required for the quantitative data analysis cycle. This is achieved by automating the critical data preprocessing steps in the workflow. First, the thesis defines the required computational framework to enable implementation of such platforms. It conceptualizes a novel crowdsourcing solution ii to enable a paradigm shift from “in-advance” to “on-demand” data integration mechanisms. This is achieved through a middle-layer ontological architecture placed over a shared, crowd-sourced data repository. Highly granular semantic annotations anchor the column-level entities and relationships within the datasets to global reference ontologies. The core idea is that these annotations provide the declarative meta-data that can leverage the reference ontologies. The thesis demonstrates how these annotations provide global interoperability across datasets without a pre-determined schema. This enables automation of the data pre- processing operations. The research is centered on a Data Fusion Ontology (DFO) as the required representational framework to enable the above computational solution. The DFO is designed to enable sophisticated data integration platforms with capabilities to support the research data cycle of discovery, mediation, integration and visualization. The DFO is validated using a prototype data fusion platform that can leverage the DFO specified annotations to meet the functional requirements to accelerate the pace of research. iii DEDICATION To my parents, Sanjeev and Swati Raje, and my sister Surabhi for their constant belief in my ability to achieve any goal set and for teaching me to believe in myself. iv ACKNOWLEDGEMENTS A special thank you to my advisors, Dr. Jayashree Ramanathan, Dr. Philip Payne and Dr. Rajiv Ramnath for their guidance throughout this project and my graduate studies. I have benefited greatly from your advice, continual support and direction and hope to continue to do so in the future. I would like to thank Lakshmikanth Krishnan Kaushik, Justin Dano, Marques Mayoras, Kuhn, Bradley Myers and Tyler Kuhn for their contributions in implementation of the prototype data fusion platform demonstrated in this dissertation. I would like to acknowledge my colleagues at CETI and the Payne lab; Kayhan Moharreri, Zachery Abrams, Kelly Regan, Tasneem Motiwala, Bobbie Kite among many others. I have been fortunate to be involved in several projects with them during the course of my PhD at OSU. I have thoroughly enjoyed working with all of you and hope to continue to collaborate in future research endeavors. v VITA EDUCATION Degree University Graduation Date Field of Study BE University of Pune, India July 2009 Computer Engineering The Ohio State University, Columbus, Computer Science & MS December 2012 Ohio, USA Engineering The Ohio State University, Columbus, Computer Science & PhD May 2016 Ohio, USA Engineering RELATED PUBLICATIONS Raje, S. (2015, November). Accelerating Biomedical Informatics Research with Interactive Multidimensional Data Fusion Platforms. American Medical Informatics Association (AMIA) Annual Symposium [Poster]. Raje, S., Kite, B., Ramanathan, J., & Payne, P. (2014). Real-time Data Fusion Platforms: The Need of Multi-dimensional Data-driven Research in Biomedical Informatics. Studies in health technology and informatics, 216, 1107-1107. Lele, O., Raje, S., Yen, P. Y., & Payne, P. (2015). ResearchIQ: Design of a Semantically Anchored Integrative Query Tool. AMIA Summits on Translational Science Proceedings, 2015, 97. Raje, S. (2012). ResearchIQ: An End-To-End Semantic Knowledge Platform For Resource Discovery in Biomedical Research (Masters thesis, The Ohio State University). OTHER PUBLICATIONS Regan, K., Raje, S., Saravanamuthu, C., & Payne, P. R. (2014). Conceptual Knowledge Discovery in Databases for Drug Combinations Predictions in Malignant Melanoma. Studies in health technology and informatics, 216, 663-667. vi Motiwala T, Shivade C, Raje S, Lai A, Payne, P. (2015, March) RECRUIT: Roadmap to Enhance Clinical Trial Recruitment Using Information Technology. AMIA Joint Summits on Translational Science - Clinical Research Informatics Summit, March 2015. Abstract. Raje, S., Davuluri, C., Freitas, M., Ramnath, R., & Ramanathan, J. (2012, July). Using Semantic Web Technologies for RBAC in Project-Oriented Environments. In Computer Software and Applications Conference (COMPSAC), 2012 IEEE 36th Annual (pp. 521-530). IEEE. Raje, S., Davuluri, C., Freitas, M., Ramnath, R., & Ramanathan, J. (2012, March). Using ontology-based methods for implementing role-based access control in cooperative systems. In Proceedings of the 27th Annual ACM Symposium on Applied Computing (SAC) (pp. 763-764). ACM. Herold, M. J., Raje, S., Ramanathan, J., Ramnath, R., & Xu, Z. (2012). Human Computation Recommender for Inter-Enterprise Data Sharing and ETL Processes. Tech Report. The Ohio State University. Raje, S., Tulangekar, S., Waghe, R., Pathak, R., & Mahalle, P. (2009, July). Extraction of key phrases from document using statistical and linguistic analysis. In 2009 4th International Conference on Computer Science & Education (ICCSE). IEEE FIELDS OF STUDY Major: Computer Science and Engineering Minor: Biomedical Informatics vii TABLE OF CONTENTS Abstract …………………………………………………………………………………………………………………………………....………………………. ii Dedication …...……...………………………………………………………………………………………………………………...…..……………………… iv Acknowledgements ..………………………………………………………………………………………………………………….………………………. v Vita ..………………………………………………………………………………………………………………………………………………….……………… vi Table of Contents ..…………………………………………………………………………………………………………………………………………….. viii List of Figures ..………………………………………………………………………………………………………………………………………………….. xii List of Tables .……………………………………………………………………………………………………………………………………………………. xiv CHAPTER 1: INTRODUCTION ............................................................................................................. 1 1.1 METHODOLOGY .................................................................................................................................................. 5 1.1.1 Identifying the Research Problem ................................................................................................... 7 1.1.2 Traditional data warehousing approach is not suitable for research environment ... 9 1.2 PROPOSED RESEARCH SOLUTION: DATA CROWDSOURCING .................................................................. 10 1.3 THE DATA FUSION ONTOLOGY METHODOLOGY ....................................................................................... 11 1.4 PROTOTYPE DATA FUSION PLATFORM IMPLEMENTATION .................................................................... 12 CHAPTER 2: NEED FOR DATA FUSION PLATFORMS IN RESEARCH ..................................... 13 2.1 METHODS .......................................................................................................................................................... 17 2.1.1 Semi-structured interviews .............................................................................................................. 17 2.1.2 Analysis of current available tools ................................................................................................ 19 2.1.3 Comparison of research process versus business analytics process ................................. 19 2.2 INTERVIEW QUESTIONS AND RESPONSES .................................................................................................. 19 2.2.1 Summary ................................................................................................................................................. 35 2.3 ANALYSIS OF CURRENT AVAILABLE TOOLS AND SYSTEMS ....................................................................... 36 2.3.1 Data Repositories ...............................................................................................................................