Compiling Mappings to Bridge Applications and Databases

Total Page:16

File Type:pdf, Size:1020Kb

Compiling Mappings to Bridge Applications and Databases Compiling Mappings to Bridge Applications and Databases Sergey Melnik Atul Adya Philip A. Bernstein Microsoft Research Microsoft Corporation Microsoft Research One Microsoft Way One Microsoft Way One Microsoft Way Redmond, WA 98052, U.S.A. Redmond, WA 98052, U.S.A. Redmond, WA 98052, U.S.A. [email protected] [email protected] [email protected] ABSTRACT relational (O/R) mapping scenarios where a set of objects is par- Translating data and data access operations between applications titioned across several relational tables may require transforma- and databases is a longstanding data management problem. We tions that contain outer joins, nested queries, and case statements present a novel approach to this problem, in which the relationship in order to reassemble objects from tables (we will see an example between the application data and the persistent storage is specified shortly). Implementing such transformations is difficult, especially using a declarative mapping, which is compiled into bidirectional since the data usually needs to be updatable, a common require- views that drive the data transformation engine. Expressing the ap- ment for many enterprise applications. A study referred to in [22] plication model as a view on the database is used to answer queries, found that coding and configuring O/R data access accounts for up while viewing the database in terms of the application model al- to 40% of total project effort. lows us to leverage view maintenance algorithms for update trans- Since the mid 1990’s, client-side data mapping layers have be- lation. This approach has been implemented in a commercial prod- come a popular alternative to handcoding the data access logic, fu- uct. It enables developers to interact with a relational database via eled by the growth of Internet applications. A core function of such a conceptual schema and an object-oriented programming surface. a layer is to provide an updatable view that exposes a data model We outline the implemented system and focus on the challenges of closely aligned with the application’s data model, driven by an ex- mapping compilation, which include rewriting queries under con- plicit mapping. Many commercial products (e.g., Oracle TopLink) straints and supporting non-relational constructs. and open source projects (e.g., Hibernate) have emerged to offer these capabilities. Virtually every enterprise framework provides Categories and Subject Descriptors: a client-side persistence layer (e.g., EJB in J2EE). Most packaged H.2 [Database Management], D.3 [Programming Languages] business applications, such as ERP and CRM applications, incor- General Terms: Algorithms, Performance, Design, Languages porate proprietary data access interfaces (e.g., BAPI in SAP R/3). Keywords: mapping, updateable views, query rewriting Today’s client-side mapping layers offer widely varying degrees of capability, robustness, and total cost of ownership. Typically, the mapping between the application and database artifacts is rep- 1. INTRODUCTION resented as a custom structure or schema annotations that have Developers of data-centric solutions routinely face situations in vague semantics and drive case-by-case reasoning. A scenario- which the data representation used by an application differs sub- driven implementation limits the range of supported mappings and stantially from the one used by the database. The traditional rea- often yields a fragile runtime that is difficult to extend. Few data son is the impedance mismatch between programming language access solutions leverage data transformation techniques developed abstractions and persistent storage [12]: developers want to encap- by the database community. Furthermore, building such solutions sulate business logic into objects, yet most enterprise data is stored using views, triggers, and stored procedures is problematic for a in relational database systems. A second reason is to enable data in- number of reasons. First, views containing joins or unions are usu- dependence. Even if applications and databases start with the same ally not updatable. Second, defining custom database views and data representation, they can evolve, leading to differing data rep- triggers for every application accessing mission-critical enterprise resentations that must be bridged. A third reason is independence data is rarely acceptable due to security and manageability risks. from DBMS vendors: many enterprise applications run in the mid- Moreover, SQL dialects, object-relational features, and procedural dle tier and need to support backend database systems of varying extensions vary significantly from one DBMS to the next. capabilities, which require different data representations. Thus, in In this paper we describe an approach to building a mapping- many enterprise systems the separation between application mod- driven data access layer that addresses some of these challenges. els and database models has become a design choice rather than a As a first contribution, we present a novel mapping architecture technical impediment. that provides a general-purpose mechanism for supporting updat- The data transformations required to bridge applications and able views. It enables building client-side data access layers in a databases can be very complex. Even relatively simple object-to- principled way but could also be exploited inside a database en- gine. The architecture is based on three main steps: Permission to make digital or hard copies of all or part of this work for • personal or classroom use is granted without fee provided that copies are Specification: Mappings are specified using a declarative not made or distributed for profit or commercial advantage and that copies language that has well-defined semantics and puts a wide bear this notice and the full citation on the first page. To copy otherwise, to range of mapping scenarios within reach of non-expert users. republish, to post on servers or to redistribute to lists, requires prior specific • Compilation: Mappings are compiled into bidirectional permission and/or a fee. SIGMOD’07, June 12–14, 2007, Beijing, China. views, called query and update views, that drive query and Copyright 2007 ACM 978-1-59593-686-8/07/0006 ...$5.00. update processing in the runtime engine. 461 1. Data manipulation "SELECT p FROM SalesPeople AS p ApplicationsUsers Tools using Entity SQL WHERE p.Hired.Year > 2000" 2. CRUD on objects foreach(Person p in aw.SalesPeople) Mapping compiler UI Tools (create/read/ if(p.Hired.Year > 2000) {...} Query Update aw.SaveChanges(); views update/delete) views API Merge views 3. Language-integrated var people = from p in aw.SalesPeople Mapping generator where p.Hired.Year > 2000 queries (LINQ) select p; Language integration Object services Query pipeline Update pipeline Figure 1: Interacting with data in the Entity Framework Entity SQL Entities ∆Entities • Unfold query views Apply view Assemble • • Push relational ops maintenance to Execution: Update translation is done using algorithms for entities bidirectional views materialized view maintenance. Query translation uses view Runtime• Optimize time Design unfolding. Expression trees Tuples ∆Tables Data providers To the best of our knowledge, this mapping approach has not been exploited previously in the research literature or in commer- Database system cial products. It raises interesting research challenges. Our second contribution addresses one of these challenges— Figure 2: System architecture formulation of the mapping compilation problem and algorithms that solve it. The algorithms that we developed are based on tech- niques for answering queries using views and incorporate a number EDM is an extended entity-relationship model. It distinguishes of novel aspects. One is the concept of bipartite mappings, which entity types, complex types, and primitive types. Instances of en- are mappings that can be expressed as a composition of a view and tity types, called entities, can be organized into persistent collec- an inverse view. Using bipartite mappings enables us to focus on tions called entity sets. An entity set of type T holds entities of either left or right ‘part’ of mapping constraints at a time when gen- type T or any type that derives from T . Each entity type has a key, erating query and update views. Another novel aspect is a schema which uniquely identifies an entity in the entity set. Entities and partitioning approach that allows us to rewrite the mapping con- complex values may have properties holding other complex values straints using union queries and facilitates view generation. Fur- or primitive values. Like entity types, complex types can be spe- thermore, we discuss support for object-oriented constructs, case cialized through inheritance. However, complex values can exist statement generation, bookkeeping of tuple provenance, and sim- only as part of some entity. Entities can participate in 1:1, 1:n,or plification of views under constraints. m:n associations, which essentially relate the keys of the respec- The presented approach has been implemented in a commercial tive entities. Notice that a relational schema with key and foreign product, the Microsoft ADO.NET Entity Framework [1, 11]. A key constraints can be viewed as a simple EDM schema that con- pre-release of the product is available for download. tains a distinct entity set and entity type for each relational table. This paper is structured as follows. In Section 2 we give an We exploit this property to uniformly specify and manipulate map- overview of the Entity Framework. In Section 3 we outline our pings. Henceforth we use the general term extent to refer to entity mapping approach and illustrate it using a motivating example. The sets, associations, and tables. mapping compilation problem is stated formally in Section 4. The Entity SQL is a data manipulation language based on SQL. En- implemented algorithms are presented in Section 5. Experimental tity SQL allows retrieving entities from entity sets and navigating results are in Section 6. Related work is discussed in Section 7. from an entity to a collection of entities reachable via a given asso- Section 8 is the conclusion. ciation. Path expressions can be used to ‘dot’ into complex values.
Recommended publications
  • Data Mapping: Definition, Tools, and Techniques
    Data Mapping: Definition, Tools, and Techniques Enterprise data is getting more dispersed and voluminous by the day, and at the same time, it has become more important than ever for businesses to leverage data and transform it into actionable insights. However, enterprises today collect information from an array of data points, and they may not always speak the same language. To integrate this data and make sense of it, data mapping is used which is the process of establishing relationships between separate data models. What is Data Mapping? In simple words, data mapping is the process of mapping data fields from a source file to their related target fields. For example, in Figure 1, ‘Name,’ ‘Email,’ and ‘Phone’ fields from an Excel source are mapped to the relevant fields in a Delimited file, which is our destination. Data mapping tasks vary in complexity, depending on the hierarchy of the data being mapped, as well as the disparity between the structure of the source and the target. Every business application, whether on-premise or cloud, uses metadata to explain the data fields and attributes that constitute the data, as well as semantic rules that govern how data is stored within that application or repository. For example, Microsoft Dynamics CRM contains several data objects, such as Leads, Opportunities, and Competitors. Each of these data objects has several fields like Name, Account Owner, City, Country, Job Title, and more. The application also has a defined schema along with attributes, enumerations, and mapping rules. Therefore, if a new record is to be added to the schema of a data object, a data map needs to be created from the data source to the Microsoft Dynamics CRM account.
    [Show full text]
  • SAQE: Practical Privacy-Preserving Approximate Query Processing for Data Federations
    SAQE: Practical Privacy-Preserving Approximate Query Processing for Data Federations Johes Bater Yongjoo Park Xi He Northwestern University University of Illinois (UIUC) University of Waterloo [email protected] [email protected] [email protected] Xiao Wang Jennie Rogers Northwestern University Northwestern University [email protected] [email protected] ABSTRACT 1. INTRODUCTION A private data federation enables clients to query the union of data Querying the union of multiple private data stores is challeng- from multiple data providers without revealing any extra private ing due to the need to compute over the combined datasets without information to the client or any other data providers. Unfortu- data providers disclosing their secret query inputs to anyone. Here, nately, this strong end-to-end privacy guarantee requires crypto- a client issues a query against the union of these private records graphic protocols that incur a significant performance overhead as and he or she receives the output of their query over this shared high as 1,000× compared to executing the same query in the clear. data. Presently, systems of this kind use a trusted third party to se- As a result, private data federations are impractical for common curely query the union of multiple private datastores. For example, database workloads. This gap reveals the following key challenge some large hospitals in the Chicago area offer services for a cer- in a private data federation: offering significantly fast and accurate tain percentage of the residents; if we can query the union of these query answers without compromising strong end-to-end privacy. databases, it may serve as invaluable resources for accurate diag- To address this challenge, we propose SAQE, the Secure Ap- nosis, informed immunization, timely epidemic control, and so on.
    [Show full text]
  • Managing Data in Motion This Page Intentionally Left Blank Managing Data in Motion Data Integration Best Practice Techniques and Technologies
    Managing Data in Motion This page intentionally left blank Managing Data in Motion Data Integration Best Practice Techniques and Technologies April Reeve AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann is an imprint of Elsevier Acquiring Editor: Andrea Dierna Development Editor: Heather Scherer Project Manager: Mohanambal Natarajan Designer: Russell Purdy Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA Copyright r 2013 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
    [Show full text]
  • WIN: an Efficient Data Placement Strategy for Parallel XML Databases
    WIN : An Efficient Data Placement Strategy for Parallel XML Databases Nan Tang† Guoren Wang‡ Jeffrey Xu Yu† Kam-Fai Wong† Ge Yu‡ † Department of SE & EM, The Chinese University of Hong Kong, Hong Kong, China ‡ College of Information Science and Engineering, Northeastern University, Shenyang, China {ntang, yu, kfwong}se.cuhk.edu.hk {wanggr, yuge}@mail.neu.edu.cn Abstract graph-based. Distinct from them, XML employs a tree struc- The basic idea behind parallel database systems is to ture. Conventional data placement strategies in perform operations in parallel to reduce the response RDBMSs and OODBMSs cannot serve XML data well. time and improve the system throughput. Data place- Semistructured data involve nested structures; there ment is a key factor on the overall performance of are much intricacy and redundancy when represent- parallel systems. XML is semistructured data, tradi- ing XML as relations. Data placement strategies in tional data placement strategies cannot serve it well. RDBMSs cannot, therefore, well solve the data place- In this paper, we present the concept of intermediary ment problem of XML data. Although graph partition node INode, and propose a novel workload-aware data algorithms in OODBMSs can be applied to tree parti- placement WIN to effectively decluster XML data, to tion problems, the key relationships between tree nodes obtain high intra query parallelism. The extensive ex- such as ancestor-descendant or sibling cannot be clearly periments show that the speedup and scaleup perfor- denoted on graph. Hence data placement strategies mance of WIN outperforms the previous strategies. for OODBMSs cannot well adapt to XML data either.
    [Show full text]
  • Tepzz¥ 67¥¥Za T
    (19) TZZ¥ ¥¥Z_T (11) EP 3 267 330 A1 (12) EUROPEAN PATENT APPLICATION (43) Date of publication: (51) Int Cl.: 10.01.2018 Bulletin 2018/02 G06F 17/30 (2006.01) (21) Application number: 16187202.3 (22) Date of filing: 05.09.2016 (84) Designated Contracting States: (72) Inventors: AL AT BE BG CH CY CZ DE DK EE ES FI FR GB • Abolhassani, Neda GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO Athens, GA Georgia 30605 (US) PL PT RO RS SE SI SK SM TR • Sheausan Tung, Teresa Designated Extension States: San Jose, CA California 95120 (US) BA ME • Gomadam, Karthik Designated Validation States: San Jose, CA California 95112 (US) MA MD (74) Representative: Müller-Boré & Partner (30) Priority: 07.07.2016 US 201662359547 P Patentanwälte PartG mbB 03.08.2016 US 201615227605 Friedenheimer Brücke 21 80639 München (DE) (71) Applicant: Accenture Global Solutions Limited Dublin 4 (IE) (54) QUERY REWRITING IN A RELATIONAL DATA HARMONIZATION FRAMEWORK (57) A query rewriting processor (processor) analyz- an SQL query). The processor may then pass the rela- es database semantic models (e.g., RDF knowledge tional database query to another system or process (e.g., graphs) that capture the interconnections (e.g., foreign a data virtualization layer) for execution against the indi- and primary key links to other tables) present in a rela- vidual relational databases. In this manner, the processor tional database. The processor generates an enriched automatically translates queries for information about the model query given an initial model query (e.g., a SPARQL relational database structure to a corresponding or query) against the semantic model.
    [Show full text]
  • Transforming MARCXML Records Using XSLT
    Transforming MARCXML Records Using XSLT Violeta Ilik Head, Digital Systems & Collection Services Digital Innovations Librarian Galter Health Sciences Library Northwestern University Clinical and Translational Sciences Institute Chicago, IL March 4, 2015 Teun Hocks Violeta Ilik vivoweb.org Hosted by ALCTS, Association for Library Collections and Technical Services + ALCTS - Library Resources & Technical Services n Ilik, V., Storlien, J., and Olivarez, J. (2014). “Metadata Makeover: Transforming MARC Records Using XSLT.” Library Resources and Technical Services 58 (3): 187-208. “The knowledge gained from learning information technology can be used to experiment with methods of transforming one metadata schema into another using various software solutions” Violeta Ilik vivoweb.org Hosted by ALCTS, Association for Library Collections and Technical Services + Why Catalogers? n Survey conducted by Ma in 2007 revealed that the metadata qualifications and responsibilities required knowledge of MARC, crosswalks, XML, OAI … n Analysis of cataloging position description performed by Park, Lu, and Marion reveal that advances in technology have created a new realm of desired skills, qualifications and responsibilities for catalogers. n Terry Reese’s MarcEdit is a best friend to everyone in technical services n Tosaka stresses the importance of metadata transformation to enable reuse Violeta Ilik vivoweb.org Hosted by ALCTS, Association for Library Collections and Technical Services + Recommendations: n Cataloging departments need to be proactive in
    [Show full text]
  • Conjunctive Query Answering in the Description Logic EL Using a Relational Database System
    Conjunctive Query Answering in the Description Logic EL Using a Relational Database System Carsten Lutz - Universität Bremen, Germany David Toman - University of Waterloo, Canada FrankWolter - University of Liverpool, UK Presented By : Jasmeet Jagdev Background • One of the main application of ontologies is data access • Ontologies formalize conceptual information about the data • Calvanese et al. have argued, true scalability of CQ answering over DL ontologies can only be achieved by making use of RDBMSs • Not straight forward ! Background • RDBMSs are unaware of Tboxes (the DL mechanism for storing conceptual information) and adopt the closed-world semantics • In contrast, ABoxes (the DL mechanism for storing data) and the associated ontologies employ the open-world semantics Introduction • Approach for using RDBMSs for CQ answering over DL ontologies • We apply it to an extension of EL family of DLs • Widely used as ontology languages for large scale bio-medical ontologies such as SNOMED CT and NCI Introduction • EL allows – Concept Intersection – Existential restrictions dr • ELH⊥ = EL + bottom concept, role inclusions, and domain and range restrictions Main Idea • To incorporate the consequences of the TBox T into the relational instance corresponding to the given ABox A • To introduce the idea of combined first-order (FO) rewritability • It is possibly only if: – A and T can be written into an FO structure – q and T into FO query q* Main Idea • Properties of this approach: – It applies to DLs for which data complexity of CQ dr answering
    [Show full text]
  • Query Rewriting Using Monolingual Statistical Machine Translation
    Query Rewriting using Monolingual Statistical Machine Translation Stefan Riezler∗ Yi Liu∗∗ Google Google Long queries often suffer from low recall in web search due to conjunctive term matching. The chances of matching words in relevant documents can be increased by rewriting query terms into new terms with similar statistical properties. We present a comparison of approaches that deploy user query logs to learn rewrites of query terms into terms from the document space. We show that the best results are achieved by adopting the perspective of bridging the “lexical chasm” between queries and documents by translating from a source language of user queries into a target language of web documents. We train a state-of-the-art statistical machine translation (SMT) model on query-snippet pairs from user query logs, and extract expansion terms from the query rewrites produced by the monolingual translation system. We show in an extrinsic evaluation in a real-world web search task that the combination of a query-to-snippet translation model with a query language model achieves improved contextual query expansion compared to a state-of-the-art query expansion model that is trained on the same query log data. 1. Introduction Information Retrieval (IR) applications have been notoriously resistant to improvement attempts by Natural Language Processing (NLP). With a few exceptions for specialized tasks1, the contribution of part-of-speech taggers, syntactic parsers, or ontologies of nouns or verbs, has been inconclusive. In this paper, instead of deploying NLP tools or ontologies, we apply NLP ideas to IR problems. In particular, we take a viewpoint that looks at the problem of the word mismatch between queries and documents in web search as a problem of translating from a source language of user queries into a target language of web documents.
    [Show full text]
  • X3ML Mapping Framework for Information Integration in Cultural Heritage and Beyond
    Int J Digit Libr (2017) 18:301–319 DOI 10.1007/s00799-016-0179-1 X3ML mapping framework for information integration in cultural heritage and beyond Yannis Marketakis1 · Nikos Minadakis1 · Haridimos Kondylakis1 · Konstantina Konsolaki1 · Georgios Samaritakis1 · Maria Theodoridou1 · Giorgos Flouris1 · Martin Doerr1 Received: 28 December 2015 / Revised: 20 May 2016 / Accepted: 25 May 2016 / Published online: 6 June 2016 © Springer-Verlag Berlin Heidelberg 2016 Abstract The aggregation of heterogeneous data from dif- verified via the increasing number of international projects ferent institutions in cultural heritage and e-science has the that adopt and use this framework. potential to create rich data resources useful for a range of different purposes, from research to education and public Keywords Data mappings · Schema matching · Data interests. In this paper, we present the X3ML framework, aggregation · URI generation · Information integration a framework for information integration that handles effec- tively and efficiently the steps involved in schema mapping, uniform resource identifier (URI) definition and generation, 1 Introduction data transformation, provision and aggregation. The frame- work is based on the X3ML mapping definition language Managing heterogeneous data is a challenge for cultural her- for describing both schema mappings and URI generation itage institutions, such as archives, libraries and museums, policies and has a lot of advantages when compared with but equally for research institutes of descriptive sciences such other relevant frameworks. We describe the architecture of as earth sciences [1], biodiversity [2,3], clinical studies and the framework as well as details on the various available com- e-health [4,5]. These institutions host and maintain various ponents.
    [Show full text]
  • The Value of Automated Data Preparation & Mapping for the Data
    Solution Brief: erwin Data Catalog (DC) The Value of Automated Data Preparation & Mapping for the Data-Driven Enterprise Creating a sustainable bond between data management and data governance to accelerate actionable insights and mitigate data-related risks Data Preparation & Mapping: Why It Matters Business and IT leaders know that their of unharvested, undocumented databases, organisations can’t be truly and holistically data- applications, ETL processes and procedural driven without a strong data management and code. Consider the problematic issue of manually governance backbone. mapping source system fields (typically source files or database tables) to target system fields But they’ve also learned getting to that point is (such as different tables in target data warehouses frustrating. They’ve probably spent a lot of time or data marts). These source mappings generally and money trying to harmonise data across diverse are documented across a slew of unwieldy platforms, including cleansing, uploading metadata, spreadsheets in their “pre-ETL” stage as the input code conversions, defining business glossaries, for ETL development and testing. However, the ETL tracking data transformations and so on. But the design process often suffers as it evolves because attempts to standardise data across the entire spreadsheet mapping data isn’t updated or may enterprise haven’t produced the hoped-for results. be incorrectly updated thanks to human error. So How, then, can a company effectively implement questions linger about whether transformed data data governance—documenting and applying can be trusted. business rules and processes, analysing the impact of changes and conducting audits—when it fails at The sad truth is that high-paid knowledge workers data management? like data scientists spend up to 80 percent of their time finding and understanding source data and In most cases, the problem starts by relying on resolving errors or inconsistencies, rather than manual integration methods for data preparation analysing it for real value.
    [Show full text]
  • On Using an Automatic, Autonomous and Non-Intrusive Approach for Rewriting SQL Queries
    Simpósio Brasileiro de Banco de Dados - SBBD 2013 Short Papers On Using an Automatic, Autonomous and Non-Intrusive Approach for Rewriting SQL Queries Arlino H. M. de Araújo1;2, José Maria Monteiro2, José Antônio F. de Macêdo2, Júlio A. Tavares3, Angelo Brayner3, Sérgio Lifschitz4 1 Federal University of Piauí, Brazil 2 Federal University of Ceará, Brazil {arlino,monteiro,jose.macedo}@lia.ufc.br 3 University of Fortaleza, Brazil [email protected] 4 Pontifical Catholic University of Rio de Janeiro [email protected] Abstract. Database applications have become very complex, dealing with a huge volume of data and database objects. Concurrently, low query response time and high transaction throughput have emerged as mandatory requirements to be ensured by database management systems (DBMSs). Among other possible interventions regarding database performance, SQL query rewriting has been shown quite efficient. The idea is to rewrite a new SQL statement equivalent to the statement initially formulated, where the new SQL statement provides performance gains w.r.t. query response time. In this paper, we propose an automatic and non-intrusive approach for rewriting SQL queries. The proposed approach has been implemented and its behavior was evaluated in three different DBMSs using TPC-H benchmark. The results show that the proposed approach is quite efficient and can be effectively considered by database professionals. Categories and Subject Descriptors: H.2 [Database Management]: Miscellaneous Keywords: Query processing, Database tuning, Query rewriting 1. INTRODUCTION Database queries are expressed by means of high-level declarative languages, such as SQL (Structured Query Language), for instance. Such queries are submitted to the query engine, which is responsible for processing queries in database mangement systems (DBMSs).
    [Show full text]
  • Towards Practical Privacy-Preserving Data Analytics by Noah Michael
    Towards Practical Privacy-Preserving Data Analytics by Noah Michael Johnson A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering and Computer Sciences in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Dawn Song, Chair Professor David Wagner Professor Deirdre Mulligan Fall 2018 Towards Practical Privacy-Preserving Data Analytics Copyright 2018 by Noah Michael Johnson 1 Abstract Towards Practical Privacy-Preserving Data Analytics by Noah Michael Johnson Doctor of Philosophy in Engineering - Electrical Engineering and Computer Sciences University of California, Berkeley Professor Dawn Song, Chair Organizations are increasingly collecting sensitive information about individuals. Extracting value from this data requires providing analysts with flexible access, typically in the form of data- bases that support SQL queries. Unfortunately, allowing access to data has been a major cause of privacy breaches. Traditional approaches for data security cannot protect privacy of individuals while providing flexible access for analytics. This presents a difficult trade-off. Overly restrictive policies result in underutilization and data siloing, while insufficient restrictions can lead to privacy breaches and data leaks. Differential privacy is widely recognized by experts as the most rigorous theoretical solution to this problem. Differential privacy provides a formal guarantee of privacy for individuals while allowing general statistical analysis of the data. Despite extensive academic research, differential privacy has not been widely adopted in practice. Additional work is needed to address performance and usability issues that arise when applying differential privacy to real-world environments. In this dissertation we develop empirical and theoretical advances towards practical differential privacy.
    [Show full text]