Data Fusion Ontology: Enabling a Paradigm Shift from Data Warehousing to Crowdsourcing For

Total Page:16

File Type:pdf, Size:1020Kb

Data Fusion Ontology: Enabling a Paradigm Shift from Data Warehousing to Crowdsourcing For DATA FUSION ONTOLOGY: ENABLING A PARADIGM SHIFT FROM DATA WAREHOUSING TO CROWDSOURCING FOR ACCELERATED PACE OF RESEARCH Dissertation Presented in Partial Fulfillments of the Requirements for the Degree Doctor of Philosophy in the Graduate School of Ohio State University By Satyajeet Raje, MS Graduate Program in Computer Science and Engineering The Ohio State University 2016 Dissertation Committee Dr. Jayashree Ramanathan, PhD Dr. Philip Payne, PhD Dr. Rajiv Ramnath, PhD Dr. Gagan Agarwal, PhD Copyright by Satyajeet Raje 2016 “The Web as I envisaged it, we have not seen it yet. The future is still so much bigger than the past.” - Tim Berners-Lee, Inventor of the World Wide Web ABSTRACT The last decade has seen an explosion of publicly available, crowd-sourced research datasets in various basic and applied science domains. Typically, research datasets are much smaller than the industry counter-parts, but still contain rich knowledge that can be reused to maximize new scientific discoveries. This availability of data has made data-driven academic research across domain at large- scale a real possibility. However, the availability of the data has not translated to its effective usage in research. A large amount of data within the data repositories remains underutilized. This research determined that a critical factor for this inefficiency is that majority of the researchers' time is devoted to the pre-processing activities rather than actual analysis. The amount of time and resources required to complete the beginning of this progression appear to limit the efficacy of the remaining research process. The thesis proposes research-centric "data fusion platforms" to accelerate the research process through reduction in the time required for the quantitative data analysis cycle. This is achieved by automating the critical data preprocessing steps in the workflow. First, the thesis defines the required computational framework to enable implementation of such platforms. It conceptualizes a novel crowdsourcing solution ii to enable a paradigm shift from “in-advance” to “on-demand” data integration mechanisms. This is achieved through a middle-layer ontological architecture placed over a shared, crowd-sourced data repository. Highly granular semantic annotations anchor the column-level entities and relationships within the datasets to global reference ontologies. The core idea is that these annotations provide the declarative meta-data that can leverage the reference ontologies. The thesis demonstrates how these annotations provide global interoperability across datasets without a pre-determined schema. This enables automation of the data pre- processing operations. The research is centered on a Data Fusion Ontology (DFO) as the required representational framework to enable the above computational solution. The DFO is designed to enable sophisticated data integration platforms with capabilities to support the research data cycle of discovery, mediation, integration and visualization. The DFO is validated using a prototype data fusion platform that can leverage the DFO specified annotations to meet the functional requirements to accelerate the pace of research. iii DEDICATION To my parents, Sanjeev and Swati Raje, and my sister Surabhi for their constant belief in my ability to achieve any goal set and for teaching me to believe in myself. iv ACKNOWLEDGEMENTS A special thank you to my advisors, Dr. Jayashree Ramanathan, Dr. Philip Payne and Dr. Rajiv Ramnath for their guidance throughout this project and my graduate studies. I have benefited greatly from your advice, continual support and direction and hope to continue to do so in the future. I would like to thank Lakshmikanth Krishnan Kaushik, Justin Dano, Marques Mayoras, Kuhn, Bradley Myers and Tyler Kuhn for their contributions in implementation of the prototype data fusion platform demonstrated in this dissertation. I would like to acknowledge my colleagues at CETI and the Payne lab; Kayhan Moharreri, Zachery Abrams, Kelly Regan, Tasneem Motiwala, Bobbie Kite among many others. I have been fortunate to be involved in several projects with them during the course of my PhD at OSU. I have thoroughly enjoyed working with all of you and hope to continue to collaborate in future research endeavors. v VITA EDUCATION Degree University Graduation Date Field of Study BE University of Pune, India July 2009 Computer Engineering The Ohio State University, Columbus, Computer Science & MS December 2012 Ohio, USA Engineering The Ohio State University, Columbus, Computer Science & PhD May 2016 Ohio, USA Engineering RELATED PUBLICATIONS Raje, S. (2015, November). Accelerating Biomedical Informatics Research with Interactive Multidimensional Data Fusion Platforms. American Medical Informatics Association (AMIA) Annual Symposium [Poster]. Raje, S., Kite, B., Ramanathan, J., & Payne, P. (2014). Real-time Data Fusion Platforms: The Need of Multi-dimensional Data-driven Research in Biomedical Informatics. Studies in health technology and informatics, 216, 1107-1107. Lele, O., Raje, S., Yen, P. Y., & Payne, P. (2015). ResearchIQ: Design of a Semantically Anchored Integrative Query Tool. AMIA Summits on Translational Science Proceedings, 2015, 97. Raje, S. (2012). ResearchIQ: An End-To-End Semantic Knowledge Platform For Resource Discovery in Biomedical Research (Masters thesis, The Ohio State University). OTHER PUBLICATIONS Regan, K., Raje, S., Saravanamuthu, C., & Payne, P. R. (2014). Conceptual Knowledge Discovery in Databases for Drug Combinations Predictions in Malignant Melanoma. Studies in health technology and informatics, 216, 663-667. vi Motiwala T, Shivade C, Raje S, Lai A, Payne, P. (2015, March) RECRUIT: Roadmap to Enhance Clinical Trial Recruitment Using Information Technology. AMIA Joint Summits on Translational Science - Clinical Research Informatics Summit, March 2015. Abstract. Raje, S., Davuluri, C., Freitas, M., Ramnath, R., & Ramanathan, J. (2012, July). Using Semantic Web Technologies for RBAC in Project-Oriented Environments. In Computer Software and Applications Conference (COMPSAC), 2012 IEEE 36th Annual (pp. 521-530). IEEE. Raje, S., Davuluri, C., Freitas, M., Ramnath, R., & Ramanathan, J. (2012, March). Using ontology-based methods for implementing role-based access control in cooperative systems. In Proceedings of the 27th Annual ACM Symposium on Applied Computing (SAC) (pp. 763-764). ACM. Herold, M. J., Raje, S., Ramanathan, J., Ramnath, R., & Xu, Z. (2012). Human Computation Recommender for Inter-Enterprise Data Sharing and ETL Processes. Tech Report. The Ohio State University. Raje, S., Tulangekar, S., Waghe, R., Pathak, R., & Mahalle, P. (2009, July). Extraction of key phrases from document using statistical and linguistic analysis. In 2009 4th International Conference on Computer Science & Education (ICCSE). IEEE FIELDS OF STUDY Major: Computer Science and Engineering Minor: Biomedical Informatics vii TABLE OF CONTENTS Abstract …………………………………………………………………………………………………………………………………....………………………. ii Dedication …...……...………………………………………………………………………………………………………………...…..……………………… iv Acknowledgements ..………………………………………………………………………………………………………………….………………………. v Vita ..………………………………………………………………………………………………………………………………………………….……………… vi Table of Contents ..…………………………………………………………………………………………………………………………………………….. viii List of Figures ..………………………………………………………………………………………………………………………………………………….. xii List of Tables .……………………………………………………………………………………………………………………………………………………. xiv CHAPTER 1: INTRODUCTION ............................................................................................................. 1 1.1 METHODOLOGY .................................................................................................................................................. 5 1.1.1 Identifying the Research Problem ................................................................................................... 7 1.1.2 Traditional data warehousing approach is not suitable for research environment ... 9 1.2 PROPOSED RESEARCH SOLUTION: DATA CROWDSOURCING .................................................................. 10 1.3 THE DATA FUSION ONTOLOGY METHODOLOGY ....................................................................................... 11 1.4 PROTOTYPE DATA FUSION PLATFORM IMPLEMENTATION .................................................................... 12 CHAPTER 2: NEED FOR DATA FUSION PLATFORMS IN RESEARCH ..................................... 13 2.1 METHODS .......................................................................................................................................................... 17 2.1.1 Semi-structured interviews .............................................................................................................. 17 2.1.2 Analysis of current available tools ................................................................................................ 19 2.1.3 Comparison of research process versus business analytics process ................................. 19 2.2 INTERVIEW QUESTIONS AND RESPONSES .................................................................................................. 19 2.2.1 Summary ................................................................................................................................................. 35 2.3 ANALYSIS OF CURRENT AVAILABLE TOOLS AND SYSTEMS ....................................................................... 36 2.3.1 Data Repositories ...............................................................................................................................
Recommended publications
  • Normalized Form Snowflake Schema
    Normalized Form Snowflake Schema Half-pound and unascertainable Wood never rhubarbs confoundedly when Filbert snore his sloop. Vertebrate or leewardtongue-in-cheek, after Hazel Lennie compartmentalized never shreddings transcendentally, any misreckonings! quite Crystalloiddiverted. Euclid grabbles no yorks adhered The star schemas in this does not have all revenue for this When done use When doing table contains less sensible of rows Snowflake Normalizationde-normalization Dimension tables are in normalized form the fact. Difference between Star Schema & Snow Flake Schema. The major difference between the snowflake and star schema models is slot the dimension tables of the snowflake model may want kept in normalized form to. Typically most of carbon fact tables in this star schema are in the third normal form while dimensional tables are de-normalized second normal. A relation is danger to pause in First Normal Form should each attribute increase the. The model is lazy in single third normal form 1141 Options to Normalize Assume that too are 500000 product dimension rows These products fall under 500. Hottest 'snowflake-schema' Answers Stack Overflow. Learn together is Star Schema Snowflake Schema And the Difference. For step three within the warehouses we tested Redshift Snowflake and Bigquery. On whose other hand snowflake schema is in normalized form. The CWM repository schema is a standalone product that other products can shareeach product owns only. The main difference between in two is normalization. Families of normalized form snowflake schema snowflake. Star and Snowflake Schema in Data line with Examples. Is spread the dimension tables in the snowflake schema are normalized. Like price weight speed and quantitiesie data execute a numerical format.
    [Show full text]
  • A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index
    University of California Santa Barbara A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index A Thesis submitted in partial satisfaction of the requirements for the degree Master of Arts in Geography by Bo Yan Committee in charge: Professor Krzysztof Janowicz, Chair Professor Werner Kuhn Professor Emerita Helen Couclelis June 2016 The Thesis of Bo Yan is approved. Professor Werner Kuhn Professor Emerita Helen Couclelis Professor Krzysztof Janowicz, Committee Chair May 2016 A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index Copyright c 2016 by Bo Yan iii Acknowledgements I would like to thank the members of my committee for their guidance and patience in the face of obstacles over the course of my research. I would like to thank my advisor, Krzysztof Janowicz, for his invaluable input on my work. Without his help and encour- agement, I would not have been able to find the light at the end of the tunnel during the last stage of the work. Because he provided insight that helped me think out of the box. There is no better advisor. I would like to thank Yingjie Hu who has offered me numer- ous feedback, suggestions and inspirations on my thesis topic. I would like to thank all my other intelligent colleagues in the STKO lab and the Geography Department { those who have moved on and started anew, those who are still in the quagmire, and those who have just begun { for their support and friendship. Last, but most importantly, I would like to thank my parents for their unconditional love.
    [Show full text]
  • The Design of Multidimensional Data Model Using Principles of the Anchor Data Modeling: an Assessment of Experimental Approach Based on Query Execution Performance
    WSEAS TRANSACTIONS on COMPUTERS Radek Němec, František Zapletal The Design of Multidimensional Data Model Using Principles of the Anchor Data Modeling: An Assessment of Experimental Approach Based on Query Execution Performance RADEK NĚMEC, FRANTIŠEK ZAPLETAL Department of Systems Engineering Faculty of Economics, VŠB - Technical University of Ostrava Sokolská třída 33, 701 21 Ostrava CZECH REPUBLIC [email protected], [email protected] Abstract: - The decision making processes need to reflect changes in the business world in a multidimensional way. This includes also similar way of viewing the data for carrying out key decisions that ensure competitiveness of the business. In this paper we focus on the Business Intelligence system as a main toolset that helps in carrying out complex decisions and which requires multidimensional view of data for this purpose. We propose a novel experimental approach to the design a multidimensional data model that uses principles of the anchor modeling technique. The proposed approach is expected to bring several benefits like better query execution performance, better support for temporal querying and several others. We provide assessment of this approach mainly from the query execution performance perspective in this paper. The emphasis is placed on the assessment of this technique as a potential innovative approach for the field of the data warehousing with some implicit principles that could make the process of the design, implementation and maintenance of the data warehouse more effective. The query performance testing was performed in the row-oriented database environment using a sample of 10 star queries executed in the environment of 10 sample multidimensional data models.
    [Show full text]
  • QUERY-DRIVEN TEXT ANALYTICS for KNOWLEDGE EXTRACTION, RESOLUTION, and INFERENCE by CHRISTAN EARL GRANT a DISSERTATION PRESENTED
    QUERY-DRIVEN TEXT ANALYTICS FOR KNOWLEDGE EXTRACTION, RESOLUTION, AND INFERENCE By CHRISTAN EARL GRANT A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2015 c 2015 Christan Earl Grant To Jesus my Savior, Vanisia my wife, my daughter Caliah, soon to be born son and my parents and siblings, whom I strive to impress. Also, to all my brothers and sisters battling injustice while I battled bugs and deadlines. ACKNOWLEDGMENTS I had an opportunity to see my dad, a software engineer from Jamaica work extremely hard to get a master's degree and work as a software engineer. I even had the privilege of sitting in some of his classes as he taught at a local university. Watching my dad work towards intellectual endeavors made me believe that anything is possible. I am extremely privileged to have someone I could look up to as an example of being a man, father, and scholar. I had my first taste of research when Dr. Joachim Hammer went out of his way to find a task for me on one of his research projects because I was interested in attending graduate school. After working with the team for a few weeks he was willing to give me increased responsibility | he let me attend the 2006 SIGMOD Conference in Chicago. It was at this that my eyes were opened to the world of database research. As an early graduate student Dr. Joseph Wilson exercised superhuman patience with me as I learned to grasp the fundamentals of paper writing.
    [Show full text]
  • Usage-Dependent Maintenance of Structured Web Data Sets
    Usage-dependent maintenance of structured Web data sets Dissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat) am Institut f¨urInformatik des Fachbereichs Mathematik und Informatik der Freien Unviersit¨atBerlin vorgelegt von Dipl. Inform. Markus Luczak-R¨osch Berlin, August 2013 Referent: Prof. Dr.-Ing. Robert Tolksdorf (Freie Universit¨atBerlin) Erste Korreferentin: Natalya F. Noy, PhD (Stanford University) Zweite Korreferentin: Dr. rer. nat. Elena Simperl (University of Southampton) Tag der Disputation: 13.01.2014 To Johanna. To Levi, Yael and Mili. Vielen Dank, dass ich durch Euch eine Lebenseinstellung lernen durfte, \. die bereit ist, auf kritische Argumente zu h¨oren und von der Erfahrung zu lernen. Es ist im Grunde eine Einstellung, die zugibt, daß ich mich irren kann, daß Du Recht haben kannst und daß wir zusammen vielleicht der Wahrheit auf die Spur kommen." { Karl Popper Abstract The Web of Data is the current shape of the Semantic Web that gained momentum outside of the research community and becomes publicly visible. It is a matter of fact that the Web of Data does not fully exploit the primarily intended technology stack. Instead, the so called Linked Data design issues [BL06], which are the basis for the Web of Data, rely on the much more lightweight technologies. Openly avail- able structured Web data sets are at the beginning of being used in real-world applications. The Linked Data research community investigates the overall goal to approach the Web-scale data integration problem in a way that distributes efforts between three contributing stakeholders on the Web of Data { the data publishers, the data consumers, and third parties.
    [Show full text]
  • Chapter 7 Multi Dimensional Data Modeling
    Chapter 7 Multi Dimensional Data Modeling Fundamentals of Business Analytics” Content of this presentation has been taken from Book “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Published by Wiley India Pvt. Ltd. and it will always be the copyright of the authors of the book and publisher only. Basis • You are already familiar with the concepts relating to basics of RDBMS, OLTP, and OLAP, role of ERP in the enterprise as well as “enterprise production environment” for IT deployment. In the previous lectures, you have been explained the concepts - Types of Digital Data, Introduction to OLTP and OLAP, Business Intelligence Basics, and Data Integration . With this background, now its time to move ahead to think about “how data is modelled”. • Just like a circuit diagram is to an electrical engineer, • an assembly diagram is to a mechanical Engineer, and • a blueprint of a building is to a civil engineer • So is the data models/data diagrams for a data architect. • But is “data modelling” only the responsibility of a data architect? The answer is Business Intelligence (BI) application developer today is involved in designing, developing, deploying, supporting, and optimizing storage in the form of data warehouse/data marts. • To be able to play his/her role efficiently, the BI application developer relies heavily on data models/data diagrams to understand the schema structure, the data, the relationships between data, etc. In this lecture, we will learn • About basics of data modelling • How to go about designing a data model at the conceptual and logical levels? • Pros and Cons of the popular modelling techniques such as ER modelling and dimensional modelling Case Study – “TenToTen Retail Stores” • A new range of cosmetic products has been introduced by a leading brand, which TenToTen wants to sell through its various outlets.
    [Show full text]
  • ER/Studio Enterprise Data Modeling
    ER/Studio Enterprise Data Modeling ER/Studio®, a model-driven data architecture and database design solution, helps companies discover, document, and reuse data assets. With round-trip database support, data architects have the power to thoroughly analyze existing data sources as well as design and implement high quality databases that reflect business needs. The highly-readable visual format enhances communication across job functions, from business analysts to application developers. ER/Studio Enterprise also enables team and enterprise collaboration with its repository. • Enhance visibility into your existing data assets • Effectively communicate models across the enterprise Related Products • Improve data consistency • Trace data origins and whereabouts to enhance data integration and accuracy ER/Studio Viewer View, navigate and print ER/Studio ENHANCE VISIBILITY INTO YOUR EXISTING DATA ASSETS models in a view-only environ- ment. As data volumes grow and environments become more complex corporations find it increasingly difficult to leverage their information. ER/Studio provides an easy- Describe™ to-use visual medium to document, understand, and publish information about data assets so that they can be harnessed to support business objectives. Powerful Design, document, and maintain reverse engineering of industry-leading database systems allow a data modeler to enterprise applications written in compare and consolidate common data structures without creating unnecessary Java, C++, and IDL for better code duplication. Using industry standard notations, data modelers can create an infor- quality and shorter time to market. mation hub by importing, analyzing, and repurposing metadata from data sources DT/Studio® such as business intelligence applications, ETL environments, XML documents, An easy-to-use visual medium to and other modeling solutions.
    [Show full text]
  • Advantages of Dimensional Data Modeling
    Advantages of Dimensional Data Modeling 2997 Yarmouth Greenway Drive Madison, WI 53711 (608) 278-9964 www.sys-seminar.com Advantages of Dimensional Data Modeling 1 Top Ten Reasons Why Your Data Model Needs a Makeover 1. Ad hoc queries are difficult to construct for end-users or must go through database “gurus.” 2. Even standard reports require considerable effort and detail knowledge of the database. 3. Data is not integrated or is inconsistent across sources. 4. Changes in data values or in data sources cannot be handled gracefully. 5. The structure of the data does not mirror business processes or business rules. 6. The data model limits which BI tools can be used. 7. There is no system for maintaining change history or collecting metadata. 8. Disk space is wasted on redundant values. 9. Users who might benefit from the data don’t use it. 10.Maintenance is tedious and ad hoc. 2 Advantages of Dimensional Data Modeling Part 1 3 Part 1 - Data Model Overview •What is data modeling and why is it important? •Three common data models: de-normalized (SAS data sets) normalized dimensional model •Benefits of the dimensional model 4 What is data modeling? • The generalized logical relationship among tables • Usually reflected in the physical structure of the tables • Not tied to any particular product or DBMS • A critical design consideration 5 Why is data modeling important? •Allows you to optimize performance •Allows you to minimize costs •Facilitates system documentation and maintenance • The dimensional data model is the foundation of a well designed data mart or data warehouse 6 Common data models Three general data models we will review: De-normalized Expected by many SAS procedures Normalized Often used in transaction based systems such as order entry Dimensional Often used in data warehouse systems and systems subject to ad hoc queries.
    [Show full text]
  • The Application of Data Fusion Method in the Analysis of Ocean and Meteorology Observation Data
    International Journal of Hydrology Research Article Open Access The application of data fusion method in the analysis of ocean and meteorology observation data Abstract Volume 3 Issue 3 - 2019 Data fusion is a new technology of data processing for multi - source ocean meteorology 1 2 3 data. In order to clearly understand how to apply this technology in the data processing Wang Wei, Liu Zhi-jie, Zhao Yue, Gong 1 of ocean meteorology observation, the author first divides the fusion methods into three Xiao-qing categories, i.e., pixel-level fusion, feature-level fusion and decision-making level fusion. 1Tianjin Institute of Meteorological Science, China Then, the corresponding methods of each data fusion are discussed in detail. Afterward, the 2Meteorology Administrative of Xiqing District, China SST fusion data from China Meteorological administrative is analyzed in order to explain 3Meteorology Administrative of Jinnan District, China that the fusion data is applied in numerical forecast model on the China’s Bohai sea region. Finally, the authors believe that the data fusion technology of multi-source observation data Correspondence: Wang Wei, Tianjin Institute of has a good application value when the observation data is processed into grid data of ocean Meteorological Science, Tianjin, China, Tel 086-022-23358783, meteorology, which can be used as the grid truth-value of numerical model inspection. Email In addition, the fusion technology at decision-making level can also develop with the intelligent technology, and deeply mine the application of data fusion technology to the Received: May 13, 2019 | Published: June 03, 2019 ocean meteorology observation data. Keyword: data fusion, ocean meteorology, observation data, data processing Introduction evaluation tasks.
    [Show full text]
  • Bio-Ontologies Submission Template
    Relational to RDF mapping using D2R for translational research in neuroscience Rudi Verbeeck*1, Tim Schultz2, Laurent Alquier3 and Susie Stephens4 Johnson & Johnson Pharmaceutical Research and Development 1 Turnhoutseweg 30, Beerse, Belgium; 2 Welch & McKean Roads, Spring House, PA, United States; 3 1000 Route 202, Rari- tan, NJ, United States and 4 145 King of Prussia Road, Radnor, PA, United States ABSTRACT Relational database technology has been developed as an Motivation: To support translational research and external approach for managing and integrating data in a highly innovation, we are evaluating the potential of the semantic available, secure and scalable architecture. With this ap- web to integrate data from discovery research through to the proach, all metadata is embedded or implicit in the applica- clinical environment. This paper describes our experiences tion or metadata schema itself, which results in performant in mapping relational databases to RDF for data sets relating queries. However, this architecture makes it difficult to to neuroscience. share data across a large organization where different data- Implementation: We describe how classes were identified base schemata and applications are being used. in the original data sets and mapped to RDF, and how con- Semantic web offers a promising approach to interconnect nections were made to public ontologies. Special attention databases across an organization, since the technology was was paid to the mapping of experimental measures to RDF designed to function within the distributed environment of and how it was impacted by the relational schemata. the web. Resource Description Framework (RDF) and Web Results: Mapping from relational databases to RDF can Ontology Language (OWL) are the two main semantic web benefit from techniques borrowed from dimensional model- standard recommendations.
    [Show full text]
  • L Dataspaces Make Data Ntegration Obsolete?
    DBKDA 2011, January 23-28, 2011 – St. Maarten, The Netherlands Antilles DBKDA 2011 Panel Discussion: Will Dataspaces Make Data Integration Obsolete? Moderator: Fritz Laux, Reutlingen Univ., Germany Panelists: Kazuko Takahashi, Kwansei Gakuin Univ., Japan Lena Strömbäck, Linköping Univ., Sweden Nipun Agarwal, Oracle Corp., USA Christopher Ireland, The Open Univ., UK Fritz Laux, Reutlingen Univ., Germany DBKDA 2011, January 23-28, 2011 – St. Maarten, The Netherlands Antilles The Dataspace Idea Space of Data Management Scalable Functionality and Costs far Web Search Functionality virtual Organization pay-as-you-go, Enterprise Dataspaces Admin. Portal Schema Proximity Federated first, DBMS DBMS scient. Desktop Repository Search DBMS schemaless, near unstructured high Semantic Integration low Time and Cost adopted from [Franklin, Halvey, Maier, 2005] DBKDA 2011, January 23-28, 2011 – St. Maarten, The Netherlands Antilles Dataspaces (DS) [Franklin, Halevy, Maier, 2005] is a new abstraction for Information Management ● DS are [paraphrasing and commenting Franklin, 2009] – Inclusive ● Deal with all the data of interest, in whatever form => but semantics matters ● We need access to the metadata! ● derive schema from instances? ● Discovering new data sources => The Münchhausen bootstrap problem? Theodor Hosemann (1807-1875) DBKDA 2011, January 23-28, 2011 – St. Maarten, The Netherlands Antilles Dataspaces (DS) [Franklin, Halevy, Maier, 2005] is a new abstraction for Information Management ● DS are [paraphrasing and commenting Franklin, 2009] – Co-existence
    [Show full text]
  • Causal Inference and Data Fusion in Econometrics∗
    TECHNICAL REPORT R-51 First version: Dec, 2019 Last revision: March, 2021 Causal Inference and Data Fusion in Econometrics∗ Paul Hunermund¨ Elias Bareinboim Copenhagen Business School Columbia University [email protected] [email protected] Learning about cause and effect is arguably the main goal in applied economet- rics. In practice, the validity of these causal inferences is contingent on a number of critical assumptions regarding the type of data that has been collected and the substantive knowledge that is available about the phenomenon under investiga- tion. For instance, unobserved confounding factors threaten the internal validity of estimates, data availability is often limited to non-random, selection-biased sam- ples, causal effects need to be learned from surrogate experiments with imperfect compliance, and causal knowledge has to be extrapolated across structurally hetero- geneous populations. A powerful causal inference framework is required in order to tackle all of these challenges, which plague essentially any data analysis to varying degrees. Building on the structural approach to causality introduced by Haavelmo (1943) and the graph-theoretic framework proposed by Pearl (1995), the artificial intelligence (AI) literature has developed a wide array of techniques for causal learn- ing that allow to leverage information from various imperfect, heterogeneous, and biased data sources (Bareinboim and Pearl, 2016). In this paper, we discuss recent advances made in this literature that have the potential to contribute to econo- metric methodology along three broad dimensions. First, they provide a unified and comprehensive framework for causal inference, in which the above-mentioned problems can be addressed in full generality.
    [Show full text]