Data Transformations

Total Page:16

File Type:pdf, Size:1020Kb

Data Transformations CHAPTER 9 Data Transformations Most data sets benefit by one or more data Domains and ranges transformations. The reasons for transforming data Bear in mind that some transformations are can be grouped into statistical and ecological reasons: unreasonable or even impossible for certain types of Statistical data. Table 9.1 lists the kinds of data that are potentially usable for each transformation. • improve assumptions of normality, linearity, homogeneity of variance, etc. Monotonie transformations • make units of attributes comparable when mea­ sured on different scales (for example, if you have elevation ranging from 100 to 2000 meters and Power transformation slope from 0 to 30 degrees) Ecological Different parameters (exponents) for the transfor­ • make distance measures work better mation change the effect of the transformation; p = 0 • reduce the effect of total quantity (sample unit gives presence/absence, p - 0.5 gives square root, etc. totals) to put the focus on relative quantities The smaller the parameter, the more compression applied to high values (Fig. 9.1 ). • equalize (or otherwise alter) the relative impor­ tance of common and rare species The square root transformation is similar in effect to, but less drastic than, the log transform. Unlike the • emphasize informative species at the expense of log transform, special treatment of zeros is not needed. uninformative species. The square root transformation is commonly used. Monotonie transformationsare applied to each Less frequent is a higher root, such as a cube root or element of the data matrix, independent of the other fourth root (Fig. 9.1). For example. Smith et al. (2001) elements. They are "monotonie" because they change the values of the data points without changing their rank. Relativizationsadjust matrix elements by a row 10 or column standard (e.g.. maximum, sum, mean, etc.). 9 One transformation described below, Beals smoothing, 8 is unique in being a probabilistic transformation 7 based on both row and column relationships. In this power 1/2 chapter, we also describe other adjustments to the data 6 power 1 3 matrix, including deleting rare species, combining 5 entities, and calculating first differences for time 4 power = 14 series data. 3 power " 1 10 It is difficult to overemphasize the potential 2 importance of transformations. They can make the 1 difference between illusion and insight, fog and clarity . 0 To use transformations effectively requires a good 0 25 50 75 100 understanding of their effects, and a clear vision of your goals. x Notation.— In all of the transformations described below, Figure 9.1. Effect of square root and higher root x,j = the original value in row / and column j of the transformations, b = f(x). Note that roots higher than data matrix three are essentially presence-absence transformations, yielding values close to 1 for all nonzero values. b,} = the adjusted value that replaces x„. 67 ( 'hapter 9 Table 9 1 Domain of input and range of output from transformations Reasonable and acceptable Range offix / domain of x MONOTONIC TRANSFORMATIONS x" (power) all 0 or 1 only x (power) nonnegative nonnegative log(x) positive all (2/7t)-arcsin(x) 0 < x < 1 0 to 1 inclusive (2/7T)-arcsin (x ") 0 < x < 1 0 to 1 inclusive SMOOTHING Beals smoothing 0 or 1 only 0 to 1 inclusive ROW/COLUMN RELATIVIZATIONS general nonnegative 0 to 1 inclusive by maximum nonnegative 0 to 1 inclusive by mean all all by standard deviates all generally between -10 and 10 binary by mean all 0 or 1 only rank all positive integers binary by median all 0 or 1 only ubiquity nonnegative nonnegative information function of ubiquity nonnegative nonnegative applied a cube root to count data, a choice supported bv various sciences. They claim that the abundance of an optimization procedure. Roots at a higher power species follows a truncated lognormal distribution, than three nearly transform to presence-absence: citing Sugihara (1980) and Magurran (1988) While nonzero values become close to one. while zeros the nonzero values of community data sets often remain at zero. resemble a lognormal distribution, excluding zeros often amounts to ignoring half of a data set. The log­ Logarithmic transformation normal distribution is fundamentally flawed when applied to community data because a zero value is. K = lo g (* „ ) more often than not. the most frequent abundance Log transformation compresses high values and value for a species. Nevertheless, the log transforma­ spreads low values by expressing the values as orders tion is extremely useful in community analysis, of magnitude. Log transformation is often useful when providing that one carefully handles the problem of there is a high degree of variation within variables or log(O) being undefined. when there is a high degree of variation among To log-transform data containing zeros, a small attributes within a sample. These are commonly true number must be added to all data points. If the lowest with count data and biomass data. nonzero value in the data is one (as in count data), then Log transformations are extremely useful for many it is best to add one before applying the transforma­ kinds of environmental and habitat variables, the log­ tions: normal distribution being one of the most common in hi} = log(x,; +1) nature See Limpert et al. (2001) for a general intro­ duction to lognormal distributions and applications in Data Transformation If. however, the lowest nonzero value of x differs < x < 1 The function arcsin is the same as sin 1 or from one by more than an order of magnitude, then inverse sine Data must range between zero and one. adding one will distort the relationship between zeros inclusive If they do not. you should rclativize before and other values in the data set For example, biomass selecting this transformation. data often contain many small decimal fractions Unlike the arcsine-squareroot transformation, an (values such as 0.00345 and 0.00332) ranging up to arcsine transformation is usually counterproductive in fairly large values (in the hundreds). Adding a one to community ecology, because it tends to spread the high the whole data set will tend to compress the resulting values and compress the low values (Fig 9.2). This distribution at the low end of the scale. The order-of- might be useful for distributions with negative skew, magnitude difference between 0.003 and 0.03 is lost if but community data almost alway s have positiv e skew you add a one to both values before log transformation: log( 1.003) is about the same as log( 1.03). Arcsine sqnareroot transformation The following transformation is a generalized procedure that (a) tends to preserve the original order bj = 2/π * arcsin (д/х^) of magnitudes in the data and (b) results in values of zero when the initial value was zero. Given: The arcsine-squareroot transformation spreads (he ends of the scale for proportion data, while com­ Min(.v) is the smallest nonzero value in the data pressing the middle (Fig. 9.2). This transformation is lnt(x) is a function that truncates x to an integer by recommended by many statisticians for proportion dropping digits after the decimal point data, often improving normality (Sokal and Rohlf c = order of magnitude constant = Int(log(Min(.v)) 1995). The data must range between zero and one. J — decimal constant = log'1 (c) inclusive. The arcsine-squareroot is multiplied by 2/π to rescale the result so that it ranges from 0 to 1 then the transformation is The logit transformation, b = ln(x/(l-x)). is also b,j = log(x,; + J ) - c sometimes used for proportion data (Sokal and Rohlf Subtracting the constant c from each element of the 1995). However, if x = 0 or x = 1. then the logit is data set after the log transforma­ undefined Often a small constant is added to prev ent tion shifts the values such that the lowest value in the data set will be a zero For example, if the smallest nonzero value in the data set is 0.00345. then 0.8 - log(min(x)) = -2.46 c = mt(log(min(x))) = -2 arcsin(sqrt(x) log'1 (c) = 0.01. Applying the transformation to some example values: If x = 0. arcsin(x then/) = log(0+0.01)-(-2). therefore b = 0. If x = 0.00345. then b = log(0.00345+0.01 )-(-2), therefore b = 0.128. A resine transformation h,, = 2/π * arcsin(x,,) The constant 2/π scales the result of arcsin(x) [in radians] to range from 0 to 1. assuming that 0 Figure 9.2. Effect of several transformations on proportion data ln(0) and division by /ero Alternatively, empirical data are quantitative and you do not want to lose this logits may be used (see Sokal and Rohlf 1995.762). information Because zeros are so common in community data, it Beals smoothing can be slow' to compute If you seems reasonable to use the arcsine squareroot or have a large data set and a slow computer, be sure to squareroot transformations to avoid this problem. allocate plenty of time This transformation is avail­ able in PC-ORD but apparently not in other packages Beals smoothing for statistical analysis. Beals smoothing replaces each cell in the commu­ nity matrix with a probability of the target species Relativizations occurring in that particular sample unit, based on the loint occurrences of the target species with the species "To relativize or not to relativize. that focuses the that are actually in the sample unit. The purpose of question. (Shakespeare. " ????) this transformation (also known as the sociological favorability index, Beals 1984) is to relieve the "zero- truncation problem" (Beals 1984).
Recommended publications
  • Informal Data Transformation Considered Harmful
    Informal Data Transformation Considered Harmful Eric Daimler, Ryan Wisnesky Conexus AI out the enterprise, so that data and programs that depend on that data need not constantly be re-validated for ev- ery particular use. Computer scientists have been develop- ing techniques for preserving data integrity during trans- formation since the 1970s (Doan, Halevy, and Ives 2012); however, we agree with the authors of (Breiner, Subrah- manian, and Jones 2018) and others that these techniques are insufficient for the practice of AI and modern IT sys- tems integration and we describe a modern mathematical approach based on category theory (Barr and Wells 1990; Awodey 2010), and the categorical query language CQL2, that is sufficient for today’s needs and also subsumes and unifies previous approaches. 1.1 Outline To help motivate our approach, we next briefly summarize an application of CQL to a data science project undertaken jointly with the Chemical Engineering department of Stan- ford University (Brown, Spivak, and Wisnesky 2019). Then, in Section 2 we review data integrity and in Section 3 we re- view category theory. Finally, we describe the mathematics of our approach in Section 4, and conclude in Section 5. We present no new results, instead citing a line of work summa- rized in (Schultz, Spivak, and Wisnesky 2017). Image used under a creative commons license; original 1.2 Motivating Case Study available at http://xkcd.com/1838/. In scientific practice, computer simulation is now a third pri- mary tool, alongside theory and experiment. Within
    [Show full text]
  • Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H
    C2Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H. V. Jagadish University of Michigan University of Michigan University of Michigan Ann Arbor, Michigan Ann Arbor, Michigan Ann Arbor, Michigan [email protected] [email protected] [email protected] ABSTRACT CCS CONCEPTS Datasets are often derived by manipulating raw data with • Information systems → Data provenance; Extraction, statistical software packages. The derivation of a dataset transformation and loading. must be recorded in terms of both the raw input and the ma- nipulations applied to it. Statistics packages typically provide KEYWORDS limited help in documenting provenance for the resulting de- data transformation, data documentation, data provenance rived data. At best, the operations performed by the statistical ACM Reference Format: package are described in a script. Disparate representations Jie Song, George Alter, and H. V. Jagadish. 2019. C2Metadata: Au- make these scripts hard to understand for users. To address tomating the Capture of Data Transformations from Statistical these challenges, we created Continuous Capture of Meta- Scripts in Data Documentation. In 2019 International Conference data (C2Metadata), a system to capture data transformations on Management of Data (SIGMOD ’19), June 30-July 5, 2019, Am- in scripts for statistical packages and represent it as metadata sterdam, Netherlands. ACM, New York, NY, USA, 4 pages. https: in a standard format that is easy to understand. We do so by //doi.org/10.1145/3299869.3320241 devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a 1 INTRODUCTION large fraction of data manipulation performed in practice.
    [Show full text]
  • What Is a Data Warehouse?
    What is a Data Warehouse? By Susan L. Miertschin “A data warehouse is a subject oriented, integrated, time variant, nonvolatile, collection of data in support of management's decision making process.” https: //www. bus iness.auc. dk/oe kostyr /file /What_ is_a_ Data_ Ware house.pdf 2 What is a Data Warehouse? “A copy of transaction data specifically structured for query and analysis” 3 “Data Warehousing is the coordination, architected, and periodic copying of data from various sources, both inside and outside the enterprise, into an environment optimized for analytical and informational processing” ‐ Alan Simon Data Warehousing for Dummies 4 Business Intelligence (BI) • “…implies thinking abstractly about the organization, reasoning about the business, organizing large quantities of information about the business environment.” p. 6 in Giovinazzo textbook • Purpose of BI is to define and execute a strategy 5 Strategic Thinking • Business strategist – Always looking forward to see how the company can meet the objectives reflected in the mission statement • Successful companies – Do more than just react to the day‐to‐day environment – Understand the past – Are able to predict and adapt to the future 6 Business Intelligence Loop Business Intelligence Figure 1‐1 p. 2 Giovinazzo • Encompasses entire loop shown Business Strategist • Data Storage + ETC = OLAP Data Mining Reports Data Warehouse Data Storage • Data WhWarehouse + Tools (yellow) = Extraction,Transformation, Cleaning DiiDecision Support CRM Accounting Finance HR System 7 The Data Warehouse Decision Support Systems Central Repository Metadata Dependent Data DtData Mar t EtExtrac tion DtData Log Administration Cleansing/Tranformation External Extraction Source Extraction Store Independent Data Mart Operational Environment Figure 1-2 p.
    [Show full text]
  • POLITECNICO DI TORINO Repository ISTITUZIONALE
    POLITECNICO DI TORINO Repository ISTITUZIONALE Rethinking Software Network Data Planes in the Era of Microservices Original Rethinking Software Network Data Planes in the Era of Microservices / Miano, Sebastiano. - (2020 Jul 13), pp. 1-175. Availability: This version is available at: 11583/2841176 since: 2020-07-22T19:49:25Z Publisher: Politecnico di Torino Published DOI: Terms of use: Altro tipo di accesso This article is made available under terms and conditions as specified in the corresponding bibliographic description in the repository Publisher copyright (Article begins on next page) 08 October 2021 Doctoral Dissertation Doctoral Program in Computer and Control Enginering (32nd cycle) Rethinking Software Network Data Planes in the Era of Microservices Sebastiano Miano ****** Supervisor Prof. Fulvio Risso Doctoral examination committee Prof. Antonio Barbalace, Referee, University of Edinburgh (UK) Prof. Costin Raiciu, Referee, Universitatea Politehnica Bucuresti (RO) Prof. Giuseppe Bianchi, University of Rome “Tor Vergata” (IT) Prof. Marco Chiesa, KTH Royal Institute of Technology (SE) Prof. Riccardo Sisto, Polytechnic University of Turin (IT) Politecnico di Torino 2020 This thesis is licensed under a Creative Commons License, Attribution - Noncommercial- NoDerivative Works 4.0 International: see www.creativecommons.org. The text may be reproduced for non-commercial purposes, provided that credit is given to the original author. I hereby declare that, the contents and organisation of this dissertation constitute my own original work and does not compromise in any way the rights of third parties, including those relating to the security of personal data. ........................................ Sebastiano Miano Turin, 2020 Summary With the advent of Software Defined Networks (SDN) and Network Functions Virtualization (NFV), software started playing a crucial role in the computer net- work architectures, with the end-hosts representing natural enforcement points for core network functionalities that go beyond simple switching and routing.
    [Show full text]
  • Data Warehousing on AWS
    Data Warehousing on AWS March 2016 Amazon Web Services – Data Warehousing on AWS March 2016 © 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only. It represents AWS’s current product offerings and practices as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or implied. This document does not create any warranties, representations, contractual commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. Page 2 of 26 Amazon Web Services – Data Warehousing on AWS March 2016 Contents Abstract 4 Introduction 4 Modern Analytics and Data Warehousing Architecture 6 Analytics Architecture 6 Data Warehouse Technology Options 12 Row-Oriented Databases 12 Column-Oriented Databases 13 Massively Parallel Processing Architectures 15 Amazon Redshift Deep Dive 15 Performance 15 Durability and Availability 16 Scalability and Elasticity 16 Interfaces 17 Security 17 Cost Model 18 Ideal Usage Patterns 18 Anti-Patterns 19 Migrating to Amazon Redshift 20 One-Step Migration 20 Two-Step Migration 20 Tools for Database Migration 21 Designing Data Warehousing Workflows 21 Conclusion 24 Further Reading 25 Page 3 of 26 Amazon Web Services – Data Warehousing on AWS March 2016 Abstract Data engineers, data analysts, and developers in enterprises across the globe are looking to migrate data warehousing to the cloud to increase performance and lower costs.
    [Show full text]
  • Master Data Simplified
    MASTER THE BEST IN MASTER DATA GOVERNANCE DATA SIMPLIFICATION & MANAGEMENT SIMPLIFIED CHARLIE MASSOGLIA & ANBARASAN MURUGAN ‘A MUST READ FOR DATA ENTHUISIASTS’ - AUSTIN DAVIS About the Authors Charlie Massoglia VP & CIO, Chain-Sys Corporation Former CIO for Dawn Food Products For 13+ years. 25+ years experience with a variety of ERP systems. Extensive experience in system migrations & conversions. Participated in 9 acquisitions ranging from a single US location to 14 sites in 11 countries. Author of numerous technical books, articles, presentations, and seminars globally. Anbarasan Murugan Product Lead, Master Data Management Master Data Simplification & Governance expert. Industry experience of more than 11 years. Chief Technical Architect for more than 10 products TM within the Chain Sys Platform . Has designed complex analytical & transactional Master data processes for Fortune 500 companies. Master Data Simplified An Introduction to Master Data Simplification, Governance, and Management By Charles L. Massoglia VP & CIO Chain●Sys Corporation [email protected] and Anbarasan Murugan Product Manager Chain●Sys Corporation [email protected] No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without written permission of the publisher. For information regarding permission, write to Chain-Sys Corporation, Attention: Permissions Department, 325 S. Clinton Street, Suite 205, Grand Ledge, MI 48837 Trademarks: Chain●Sys Platform is a trademark of Chain-Sys Corporation in the United States and other countries and may not be used without permission.
    [Show full text]
  • Lineage Tracing for General Data Warehouse Transformations
    Lineage Tracing for General Data Warehouse Transformations∗ Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University fcyw, [email protected] Abstract. Data warehousing systems integrate information and managing such transformations as part of the extract- from operational data sources into a central repository to enable transform-load (ETL) process, e.g., [Inf, Mic, PPD, Sag]. analysis and mining of the integrated information. During the The transformations may vary from simple algebraic op- integration process, source data typically undergoes a series of erations or aggregations to complex procedural code. transformations, which may vary from simple algebraic opera- In this paper we consider the problem of lineage trac- tions or aggregations to complex “data cleansing” procedures. ing for data warehouses created by general transforma- In a warehousing environment, the data lineage problem is that tions. Since we no longer have the luxury of a fixed set of of tracing warehouse data items back to the original source items operators or the algebraic properties offered by relational from which they were derived. We formally define the lineage views, the problem is considerably more difficult and tracing problem in the presence of general data warehouse trans- open-ended than previous work on lineage tracing. Fur- formations, and we present algorithms for lineage tracing in this thermore, since transformation graphs in real ETL pro- environment. Our tracing procedures take advantage of known cesses can often be quite complex—containing as many structure or properties of transformations when present, but also as 60 or more transformations—the storage requirements work in the absence of such information.
    [Show full text]
  • Doppiodb 2.0: Hardware Techniques for Improved Integration of Machine Learning Into Databases
    Research Collection Conference Paper doppioDB 2.0: Hardware Techniques for Improved Integration of Machine Learning into Databases Author(s): Kara, Kaan; Wang, Zeke; Zhang, Ce; Alonso, Gustavo Publication Date: 2019-08 Permanent Link: https://doi.org/10.3929/ethz-b-000394510 Originally published in: Proceedings of the VLDB Endowment 12(12), http://doi.org/10.14778/3352063.3352074 Rights / License: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library doppioDB 2.0: Hardware Techniques for Improved Integration of Machine Learning into Databases Kaan Kara Zeke Wang Ce Zhang Gustavo Alonso Systems Group, Department of Computer Science ETH Zurich, Switzerland fi[email protected] ABSTRACT t1 compressed/ Database engines are starting to incorporate machine learning (ML) doppioDB 2.0 encrypted functionality as part of their repertoire. Machine learning algo- Table t1 Iterative Decryption rithms, however, have very different characteristics than those of Execution Decompression relational operators. In this demonstration, we explore the chal- SCD lenges that arise when integrating generalized linear models into a t1 bitweaving t1_model database engine and how to incorporate hardware accelerators into Iterative Quantized Execution SGD the execution, a tool now widely used for ML workloads. t1_model The demo explores two complementary alternatives: (1) how to - Training: INSERT INTO t1_model train models directly on compressed/encrypted column-stores us- SELECT weights FROM TRAIN('t1', step_size, …); ing a specialized coordinate descent engine, and (2) how to use a - Validation: SELECT loss FROM VALIDATE('t1_model', 't1'); bitwise weaving index for stochastic gradient descent on low pre- SELECT prediction FROM INFER('t1_model', 't1_new'); cision input data.
    [Show full text]
  • Extract, Transform, Load | ETL Development
    What is ETL – Extract, Transform, Load | ETL Development ETL development – Pre requisites. 1. Setup Source and Target database. 2. Creation of ODBC connections. 1. Creating source ODBC connection. 2. Creating target ODBC connection. 3. Starting Informatica PowerCenter service. 4. Creating folder. ETL development process. 1. Creation of source metadata. 2. Creation of target metadata. 3. Design mapping without business rules. 4. Creating session for each mapping. 1. Create reader connection (source). 2. Create writer connection (Target). 3. Create workflow. 4. Run workflow. 5. Monitoring ETL process. (view state). What is ETL? ETL stands for Extract-Transform-Load. ETL is the process of extracting the data from different source (Operational databases) systems, integrating the data and Transforming the data into a homogeneous format and loading into the target warehouse database. Simply the overall process of ETL (Extraction, Transformation and Loading) is called Data Acquisition. Extraction : Extraction is the process of reading the data from source databases into staging areas. Transformation : Transformation is the process of converting the source data into required warehouse format. Loading : Loading is the process of writing converted data from staging area into target warehouse systems. What are the GUI based ETL tools? 1. Informatica. 2. DataStage. 3. Data Junction. 4. Oracle Warehouse Builder. 5. Cognos Decision Stream. What are the programmatic based ETL tools? 1. PI/Sql. 2. SAS Base. 3. Tera ACCESS. 4. Tera Data Utilities. 1. BTEQ. 2. Fast Load. 3. Multi Load. 4. T(Trickle) Pump. An Informatica PowerCenter is a GUI based ETL (Extract, Transform, Load) tool from Informatica Corporation which was founded in Redwood city, Los Angels (1993).
    [Show full text]
  • Rethinking Software Network Data Planes in the Era of Microservices
    Doctoral Dissertation Doctoral Program in Computer and Control Enginering (32nd cycle) Rethinking Software Network Data Planes in the Era of Microservices Sebastiano Miano ****** Supervisor Prof. Fulvio Risso Doctoral examination committee Prof. Antonio Barbalace, Referee, University of Edinburgh (UK) Prof. Costin Raiciu, Referee, Universitatea Politehnica Bucuresti (RO) Prof. Giuseppe Bianchi, University of Rome “Tor Vergata” (IT) Prof. Marco Chiesa, KTH Royal Institute of Technology (SE) Prof. Riccardo Sisto, Polytechnic University of Turin (IT) Politecnico di Torino 2020 This thesis is licensed under a Creative Commons License, Attribution - Noncommercial- NoDerivative Works 4.0 International: see www.creativecommons.org. The text may be reproduced for non-commercial purposes, provided that credit is given to the original author. I hereby declare that, the contents and organisation of this dissertation constitute my own original work and does not compromise in any way the rights of third parties, including those relating to the security of personal data. ........................................ Sebastiano Miano Turin, 2020 Summary With the advent of Software Defined Networks (SDN) and Network Functions Virtualization (NFV), software started playing a crucial role in the computer net- work architectures, with the end-hosts representing natural enforcement points for core network functionalities that go beyond simple switching and routing. Recently, there has been a definite shift in the paradigms used to develop and deploy server applications in favor of microservices, which has also brought a visible change in the type and requirements of network functionalities deployed across the data center. Network applications should be able to continuously adapt to the runtime behav- ior of cloud-native applications, which might regularly change or be scheduled by an orchestrator, or easily interact with existing “native” applications by leveraging kernel functionalities - all of this without sacrificing performance or flexibility.
    [Show full text]
  • Doppiodb 2.0: Hardware Techniques for Improved Integration of Machine Learning Into Databases
    doppioDB 2.0: Hardware Techniques for Improved Integration of Machine Learning into Databases Kaan Kara Zeke Wang Ce Zhang Gustavo Alonso Systems Group, Department of Computer Science ETH Zurich, Switzerland fi[email protected] ABSTRACT t1 compressed/ Database engines are starting to incorporate machine learning (ML) doppioDB 2.0 encrypted functionality as part of their repertoire. Machine learning algo- Table t1 Iterative Decryption rithms, however, have very different characteristics than those of Execution Decompression relational operators. In this demonstration, we explore the chal- SCD lenges that arise when integrating generalized linear models into a t1 bitweaving t1_model database engine and how to incorporate hardware accelerators into Iterative Quantized Execution SGD the execution, a tool now widely used for ML workloads. t1_model The demo explores two complementary alternatives: (1) how to - Training: INSERT INTO t1_model train models directly on compressed/encrypted column-stores us- SELECT weights FROM TRAIN('t1', step_size, …); ing a specialized coordinate descent engine, and (2) how to use a - Validation: SELECT loss FROM VALIDATE('t1_model', 't1'); bitwise weaving index for stochastic gradient descent on low pre- SELECT prediction FROM INFER('t1_model', 't1_new'); cision input data. We present these techniques as implemented in - Inference: our prototype database doppioDB 2.0 and show how the new func- tionality can be used from SQL. Figure 1: Overview of an ML workflow in doppioDB 2.0. PVLDB Reference Format: Kaan Kara, Zeke Wang, Ce Zhang, Gustavo Alonso. doppioDB 2.0: Hard- and compress data for better memory bandwidth utilization and de- ware Techniques for Improved Integration of Machine Learning into Databases.
    [Show full text]
  • Automating the Capture of Data Transformation Metadata
    Automating the Capture of Data Transformation Metadata H.V. Jagadish Univ. of Michigan http://www.eecs.umich.edu/~jag George Alter, University of Michigan Why Metadata? • Data are useless without Metadata – “data about data” • Metadata should: – Include all information about data creation – Describe transformations to variables – Be easy to create • Our goal: Automated capture of metadata A few words about ICPSR • World’s largest archive of social science data • Consortium established 1962 • 760+ member institutions around the world • Founding member and home office for the DDI Alliance Powered by DDI Metadata ICPSR is building search tools based upon Data Documentation Initiative (DDI) XML Codebooks (pdf and online) are rendered from the DDI. Searchable database of 4.5M variables Click here for online codebook What question Online codebook shows was asked? variable in context of dataset How was the Link to online question coded? graph tool Link to online crosstab tool Searchable database of 4.5M variables Click here for variable comparison Variable comparison display Click here for online codebook Metadata for the American National Election Study What question Who answered was asked? this question? How was the question coded? Who answered this question? Metadata for the American National Election Study Who answered this question? How do we know who answered the question? It’s in the pdf. Who answered this question? When data arrive at the archive… • No question text • No interview flow (question order, skip pattern) • No variable provenance • Data transformations are not documented. How is research data created? • Most surveys are conducted with computer assisted interview software (CAI) – CATI – Computer-assisted Telephone Interview – CAPI – Computer-assisted Personal Interview – CAWI – Computer Aided Web Interview • There is no paper questionnaire • The CAI program is the questionnaire – i.e.
    [Show full text]