IDACT TRANSFORMATION MANAGER by Brian Hay A

Total Page:16

File Type:pdf, Size:1020Kb

IDACT TRANSFORMATION MANAGER by Brian Hay A IDACT TRANSFORMATION MANAGER by Brian Hay A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science MONTANA STATE UNIVERSITY Bozeman, Montana November 2005 © COPYRIGHT by Brian Hay 2005 All Rights Reserved ii APPROVAL of a dissertation submitted by Brian Hay This dissertation has been read by each member of the dissertation committee and has been found to be satisfactory regarding content, English usage, format, citations, bibliographic style, and consistency, and is ready for submission to the College of Graduate Studies. Michael Oudshoorn Approved for the Department of Computer Science Michael Oudshoorn Approved for the College of Graduate Studies Joseph J. Fedock iii STATEMENT OF PERMISSION TO USE In presenting this dissertation in partial fulfillment of the requirements for a doctoral degree at Montana State University, I agree that the Library shall make it available to borrowers under rules of the Library. I further agree that copying of this dissertation is allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S. Copyright Law. Requests for extensive copying or reproduction of this dissertation should be referred to ProQuest Information and Learning, 300 North Zeeb Road, Ann Arbor, Michigan 48106, to whom I have granted “the exclusive right to reproduce and distribute my dissertation in and from microform along with the non- exclusive right to reproduce and distribute my abstract in any format in whole or in part.” Brian Hay November 2005 iv TABLE OF CONTENTS 1. PROBLEM STATEMENT....................................................................................... 1 Overview............................................................................................................ 1 Detailed Problem Description............................................................................ 2 Unorganized Data.................................................................................. 3 Data Format Issues................................................................................. 4 Incomplete or Missing Metadata........................................................... 7 Problem Statement............................................................................................. 8 2. FOUNDATIONAL WORK AND CURRENT APPROACHES............................. 10 General Foundational Work............................................................................... 10 Middleware and Data Portals…............................................................. 10 Integration of Heterogeneous Schemas.................................................. 10 Transformation Tools............................................................................. 12 Transformation Components.................................................................. 12 Foundational Work............................................................................................ 12 SynCon Project Data System................................................................. 13 SIMON................................................................................................... 18 Other Systems...…............................................................................................. 20 Atmospheric Radiation Monitoring Program........................................ 20 Storage Resource Broker…................................................................... 21 3. SOLUTION SCOPE................................................................................................. 22 Intelligent Data Acquisition, Collection, and Transformation (IDACT) System.......................................................................................... 22 Overview................................................................................................ 22 Data Source Registry............................................................................. 23 Query Manager...................................................................................... 24 Transformation Manager........................................................................ 24 Schedule Manager.................................................................................. 25 Data Warehouse..................................................................................... 25 Solution Scope................................................................................................... 26 4. SOLUTION DESCRIPTION.................................................................................... 27 Overview............................................................................................................ 27 Transformation Mechanisms.............................................................................. 28 Introduction............................................................................................ 28 Identity Transformations........................................................................ 29 Predefined Transformations................................................................... 29 v TABLE OF CONTENTS (CONTINUED) Transformation Sequences..................................................................... 29 Custom Transformations........................................................................ 34 Custom Transformation Creation Process......................................................... 35 Transformation Components.................................................................. 35 Common Fields...................................................................................... 36 Simple Associations............................................................................... 37 Split Associations................................................................................... 38 Combine Associations............................................................................ 39 Association Database............................................................................. 40 Adding New Data Sources..................................................................... 41 Conflict Resolution in Building the Candidate Association List……... 44 Partial String Matching for Name-Based Field Mapping...................... 45 Context-Based Conflict Resolution....................................................... 46 Creation of New Custom Transformations............................................ 48 Conclusions........................................................................................................ 55 5. PROTOTYPE IMPLEMENTATION...................................................................... 56 Architecture........................................................................................................ 56 DSR Database.................................................................................................... 56 Transformation Manager Database.................................................................... 57 Transformation Manager Engine....................................................................... 58 Implementation Tools and Languages................................................... 60 Identity Transformations........................................................................ 61 Transformation Sequences..................................................................... 61 Creating Custom Transformations......................................................... 64 Actions................................................................................................... 75 TM User Interface.............................................................................................. 76 Command Line Interface....................................................................... 76 Graphical User Interface........................................................................ 76 Security Issues…................................................................................................ 77 Source Code……............................................................................................... 77 6. TESTING.................................................................................................................. 78 Overview............................................................................................................ 78 Test Environments............................................................................................. 78 Testing Tools.......................................................................................... 78 Transformation Sequence Testing......................................................... 79 Action Testing........................................................................................ 81 Transformation Testing.......................................................................... 84 Performance Testing.............................................................................. 86 vi TABLE OF CONTENTS (CONTINUED) Custom Transformation Generation Testing.......................................... 89 Evaluation Methods........................................................................................... 89 7. SUMMARY............................................................................................................. 92 Conclusions........................................................................................................ 92 Future Work......................................................................................................
Recommended publications
  • Informal Data Transformation Considered Harmful
    Informal Data Transformation Considered Harmful Eric Daimler, Ryan Wisnesky Conexus AI out the enterprise, so that data and programs that depend on that data need not constantly be re-validated for ev- ery particular use. Computer scientists have been develop- ing techniques for preserving data integrity during trans- formation since the 1970s (Doan, Halevy, and Ives 2012); however, we agree with the authors of (Breiner, Subrah- manian, and Jones 2018) and others that these techniques are insufficient for the practice of AI and modern IT sys- tems integration and we describe a modern mathematical approach based on category theory (Barr and Wells 1990; Awodey 2010), and the categorical query language CQL2, that is sufficient for today’s needs and also subsumes and unifies previous approaches. 1.1 Outline To help motivate our approach, we next briefly summarize an application of CQL to a data science project undertaken jointly with the Chemical Engineering department of Stan- ford University (Brown, Spivak, and Wisnesky 2019). Then, in Section 2 we review data integrity and in Section 3 we re- view category theory. Finally, we describe the mathematics of our approach in Section 4, and conclude in Section 5. We present no new results, instead citing a line of work summa- rized in (Schultz, Spivak, and Wisnesky 2017). Image used under a creative commons license; original 1.2 Motivating Case Study available at http://xkcd.com/1838/. In scientific practice, computer simulation is now a third pri- mary tool, alongside theory and experiment. Within
    [Show full text]
  • Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H
    C2Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H. V. Jagadish University of Michigan University of Michigan University of Michigan Ann Arbor, Michigan Ann Arbor, Michigan Ann Arbor, Michigan [email protected] [email protected] [email protected] ABSTRACT CCS CONCEPTS Datasets are often derived by manipulating raw data with • Information systems → Data provenance; Extraction, statistical software packages. The derivation of a dataset transformation and loading. must be recorded in terms of both the raw input and the ma- nipulations applied to it. Statistics packages typically provide KEYWORDS limited help in documenting provenance for the resulting de- data transformation, data documentation, data provenance rived data. At best, the operations performed by the statistical ACM Reference Format: package are described in a script. Disparate representations Jie Song, George Alter, and H. V. Jagadish. 2019. C2Metadata: Au- make these scripts hard to understand for users. To address tomating the Capture of Data Transformations from Statistical these challenges, we created Continuous Capture of Meta- Scripts in Data Documentation. In 2019 International Conference data (C2Metadata), a system to capture data transformations on Management of Data (SIGMOD ’19), June 30-July 5, 2019, Am- in scripts for statistical packages and represent it as metadata sterdam, Netherlands. ACM, New York, NY, USA, 4 pages. https: in a standard format that is easy to understand. We do so by //doi.org/10.1145/3299869.3320241 devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a 1 INTRODUCTION large fraction of data manipulation performed in practice.
    [Show full text]
  • Improving E-Learning Environments for Pen and Multi-Touch Based Interaction a Study Case on Blog Tools and Mobile Devices
    eLmL 2014 : The Sixth International Conference on Mobile, Hybrid, and On-line Learning Improving e-Learning Environments for Pen and Multi-touch Based Interaction A study case on blog tools and mobile devices André Constantino da Silva Heloísa Vieira da Rocha Institute of Computing (PG) Institute of Computing, NIED UNICAMP, IFSP UNICAMP Campinas, Brazil, Hortolândia, Brazil Campinas, Brazil [email protected] [email protected] Abstract — e-Learning environments are applications that use power to process Web pages. So, it is possible to access blog the Web infra-structure to support teaching and learning tools, read the messages, post new messages and write activities; they are designed to have good usability using a comments through mobile devices. But, it is important to desktop computer with keyboard, mouse and high resolution consider that these tools (and so their Web pages) are medium-size display. Devices equipped with pen and touch developed to be accessed by desktop computers equipped sensitive screen have enough computational power to render with keyboard, mouse and a medium size display; in your Web pages and allow users to navigate through the e-learning previous work we described that when a user interface environments. But, pen-based or touch sensitive devices have a designed for a set of interaction styles is accessed by a different input style; decreasing the usability of e-learning different set of interaction styles the users face interaction environments due the interaction modality change. To work on problems [5]. Another problem is that it is not possible to mobile contexts, e-learning environments must be improved to consider the interaction through pen and touch.
    [Show full text]
  • What Is a Data Warehouse?
    What is a Data Warehouse? By Susan L. Miertschin “A data warehouse is a subject oriented, integrated, time variant, nonvolatile, collection of data in support of management's decision making process.” https: //www. bus iness.auc. dk/oe kostyr /file /What_ is_a_ Data_ Ware house.pdf 2 What is a Data Warehouse? “A copy of transaction data specifically structured for query and analysis” 3 “Data Warehousing is the coordination, architected, and periodic copying of data from various sources, both inside and outside the enterprise, into an environment optimized for analytical and informational processing” ‐ Alan Simon Data Warehousing for Dummies 4 Business Intelligence (BI) • “…implies thinking abstractly about the organization, reasoning about the business, organizing large quantities of information about the business environment.” p. 6 in Giovinazzo textbook • Purpose of BI is to define and execute a strategy 5 Strategic Thinking • Business strategist – Always looking forward to see how the company can meet the objectives reflected in the mission statement • Successful companies – Do more than just react to the day‐to‐day environment – Understand the past – Are able to predict and adapt to the future 6 Business Intelligence Loop Business Intelligence Figure 1‐1 p. 2 Giovinazzo • Encompasses entire loop shown Business Strategist • Data Storage + ETC = OLAP Data Mining Reports Data Warehouse Data Storage • Data WhWarehouse + Tools (yellow) = Extraction,Transformation, Cleaning DiiDecision Support CRM Accounting Finance HR System 7 The Data Warehouse Decision Support Systems Central Repository Metadata Dependent Data DtData Mar t EtExtrac tion DtData Log Administration Cleansing/Tranformation External Extraction Source Extraction Store Independent Data Mart Operational Environment Figure 1-2 p.
    [Show full text]
  • POLITECNICO DI TORINO Repository ISTITUZIONALE
    POLITECNICO DI TORINO Repository ISTITUZIONALE Rethinking Software Network Data Planes in the Era of Microservices Original Rethinking Software Network Data Planes in the Era of Microservices / Miano, Sebastiano. - (2020 Jul 13), pp. 1-175. Availability: This version is available at: 11583/2841176 since: 2020-07-22T19:49:25Z Publisher: Politecnico di Torino Published DOI: Terms of use: Altro tipo di accesso This article is made available under terms and conditions as specified in the corresponding bibliographic description in the repository Publisher copyright (Article begins on next page) 08 October 2021 Doctoral Dissertation Doctoral Program in Computer and Control Enginering (32nd cycle) Rethinking Software Network Data Planes in the Era of Microservices Sebastiano Miano ****** Supervisor Prof. Fulvio Risso Doctoral examination committee Prof. Antonio Barbalace, Referee, University of Edinburgh (UK) Prof. Costin Raiciu, Referee, Universitatea Politehnica Bucuresti (RO) Prof. Giuseppe Bianchi, University of Rome “Tor Vergata” (IT) Prof. Marco Chiesa, KTH Royal Institute of Technology (SE) Prof. Riccardo Sisto, Polytechnic University of Turin (IT) Politecnico di Torino 2020 This thesis is licensed under a Creative Commons License, Attribution - Noncommercial- NoDerivative Works 4.0 International: see www.creativecommons.org. The text may be reproduced for non-commercial purposes, provided that credit is given to the original author. I hereby declare that, the contents and organisation of this dissertation constitute my own original work and does not compromise in any way the rights of third parties, including those relating to the security of personal data. ........................................ Sebastiano Miano Turin, 2020 Summary With the advent of Software Defined Networks (SDN) and Network Functions Virtualization (NFV), software started playing a crucial role in the computer net- work architectures, with the end-hosts representing natural enforcement points for core network functionalities that go beyond simple switching and routing.
    [Show full text]
  • Data Warehousing on AWS
    Data Warehousing on AWS March 2016 Amazon Web Services – Data Warehousing on AWS March 2016 © 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only. It represents AWS’s current product offerings and practices as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or implied. This document does not create any warranties, representations, contractual commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. Page 2 of 26 Amazon Web Services – Data Warehousing on AWS March 2016 Contents Abstract 4 Introduction 4 Modern Analytics and Data Warehousing Architecture 6 Analytics Architecture 6 Data Warehouse Technology Options 12 Row-Oriented Databases 12 Column-Oriented Databases 13 Massively Parallel Processing Architectures 15 Amazon Redshift Deep Dive 15 Performance 15 Durability and Availability 16 Scalability and Elasticity 16 Interfaces 17 Security 17 Cost Model 18 Ideal Usage Patterns 18 Anti-Patterns 19 Migrating to Amazon Redshift 20 One-Step Migration 20 Two-Step Migration 20 Tools for Database Migration 21 Designing Data Warehousing Workflows 21 Conclusion 24 Further Reading 25 Page 3 of 26 Amazon Web Services – Data Warehousing on AWS March 2016 Abstract Data engineers, data analysts, and developers in enterprises across the globe are looking to migrate data warehousing to the cloud to increase performance and lower costs.
    [Show full text]
  • Clarifying the Legacy Data Conversion Plan & Introducing The
    PharmaSUG 2017 - Paper SS11 Documenting Traceability for the FDA: Clarifying the Legacy Data Conversion Plan & Introducing the Study Data Traceability Guide David C. Izard, Chiltern International Ltd. Kristin C. Kelly, Merck & Co. Inc. Jane A. Lozano, Eli Lilly & Company ABSTRACT Traceability from data collection through to the presentation of analysis results has always been a concern of the US Food & Drug Administration (FDA). The introduction of electronic data as part of submission added additional steps to confirm provenance of information. Now the requirement to provide clinical and non-clinical data based on a set of FDA endorsed data standards adds exponentially to the challenge, especially if legacy format data structures were utilized when the study was originally executed and reported but data meeting FDA requirements must be present in your submission. The PhUSE organization, made up of volunteers across industry, has worked closely with the FDA to develop tools to support the organization, presentation and interpretation of clinical and non-clinical data to regulatory bodies. Examples include the Study & Analysis Data Reviewer's Guides and the Study Data Standardization Plan. These documents describe routine situations where FDA endorsed data standards are deployed at the time a study is initiated; additional support is needed when the provenance of the data is not as straightforward. The FDA's Study Data Technical Conformance Guide calls out for the need to provide a Legacy Data Conversion Plan & Report when legacy data is the source of deliverables based on FDA endorsed data standards, but it is not very clear as to when you must provide one.
    [Show full text]
  • Design and Implementation of ADPCM Based Audio Compression Using Verilog
    The International Journal Of Engineering And Science (IJES) || Volume || 3 || Issue || 5 || Pages || 50-55 || 2014 || ISSN (e): 2319 – 1813 ISSN (p): 2319 – 1805 Design and Implementation of ADPCM Based Audio Compression Using Verilog Hemanthkumar M P, Swetha H N Department of ECE, B.G.S. Institute of Technology, B.G.Nagar, Mandya-571448 Assistant Professor ,Department of ECE, B.G.S. Institute of Technology, B.G.Nagar, Mandya-571448 ---------------------------------------------------------ABSTRACT----------------------------------------------------------- Internet-based voice transmission, digital telephony, intercoms, telephone answering machines, and mass storage, we need to compress audio signals. ADPCM is one of the techniques to reduce the bandwidth in voice communication. More frequently, the smaller file sizes of compressed but lossy formats are used to store and transfer audio. Their small file sizes allow faster Internet transmission, as well as lower consumption of space on memory media. However, lossy formats trade off smaller file size against loss of audio quality, as all such compression algorithms compromise available signal detail. This paper discusses the implementation of ADPCM algorithm for audio compression of .wav file. It yields a compressed file ¼ of the size of the original file. The sound quality of the audio file is maintained reasonably after compression. KEYWORDS: ADPCM, audio compression. --------------------------------------------------------------------------------------------------------------------------------------
    [Show full text]
  • The Development of Algorithms for On-Demand Map Editing for Internet and Mobile Users with Gml and Svg
    THE DEVELOPMENT OF ALGORITHMS FOR ON-DEMAND MAP EDITING FOR INTERNET AND MOBILE USERS WITH GML AND SVG Miss. Ida K.L CHEUNG a, , Mr. Geoffrey Y.K. SHEA b a Department of Land Surveying & Geo-Informatics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, email: [email protected] b Department of Land Surveying & Geo-Informatics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, email: [email protected] Commission VI, PS WG IV/2 KEY WORDS: Spatial Information Sciences, GIS, Research, Internet, Interoperability ABSTRACT: The widespread availability of the World Wide Web has led to a rapid increase in the amount of data accessing, sharing and disseminating that gives the opportunities for delivering maps over the Internet as well as small mobile devices. In GIS industry, many vendors or companies have produced their own web map products which have their own version, data model and proprietary data formats without standardization. Such problem has long been an issue. Therefore, Geographic Markup Language (GML) was designed to provide solutions. GML is an XML grammar written in XML Schema for the modelling, transport, and storage of geographic information including both spatial and non-spatial properties of geographic features. GML is developed by Open GIS Consortium in order to promote spatial data interoperability and open standard. Since GML is still a newly developed standard, this promising research field provides a standardized method to integrate web-based mapping information in terms of data modelling, spatial data representation mechanism and graphic presentation. As GML is not for data display, SVG is an ideal vector graphic for displaying geographic data.
    [Show full text]
  • Master Data Simplified
    MASTER THE BEST IN MASTER DATA GOVERNANCE DATA SIMPLIFICATION & MANAGEMENT SIMPLIFIED CHARLIE MASSOGLIA & ANBARASAN MURUGAN ‘A MUST READ FOR DATA ENTHUISIASTS’ - AUSTIN DAVIS About the Authors Charlie Massoglia VP & CIO, Chain-Sys Corporation Former CIO for Dawn Food Products For 13+ years. 25+ years experience with a variety of ERP systems. Extensive experience in system migrations & conversions. Participated in 9 acquisitions ranging from a single US location to 14 sites in 11 countries. Author of numerous technical books, articles, presentations, and seminars globally. Anbarasan Murugan Product Lead, Master Data Management Master Data Simplification & Governance expert. Industry experience of more than 11 years. Chief Technical Architect for more than 10 products TM within the Chain Sys Platform . Has designed complex analytical & transactional Master data processes for Fortune 500 companies. Master Data Simplified An Introduction to Master Data Simplification, Governance, and Management By Charles L. Massoglia VP & CIO Chain●Sys Corporation [email protected] and Anbarasan Murugan Product Manager Chain●Sys Corporation [email protected] No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without written permission of the publisher. For information regarding permission, write to Chain-Sys Corporation, Attention: Permissions Department, 325 S. Clinton Street, Suite 205, Grand Ledge, MI 48837 Trademarks: Chain●Sys Platform is a trademark of Chain-Sys Corporation in the United States and other countries and may not be used without permission.
    [Show full text]
  • Mobile Pen-And-Paper Interaction
    Mobile Pen-and-Paper Interaction Infrastructure Design, Conceptual Frameworks of Interaction and Inter- action Theory Dissertation genehmigt zur Erlangung des akademischen Grades eines Doktor-Ingenieurs (Dr.-Ing.) vorgelegt von Dipl.-Inform. Farid Hamza Felix Heinrichs geb. in Frankfurt am Main am Fachbereich Informatik der Technischen Universit at¨ Darmstadt Gutachter Prof. Dr. Max M uhlh¨ auser¨ (TU Darmstadt) Prof. Dr. Johannes Sch oning¨ (Universiteit Hasselt) Tag der Einreichung 26. August 2015 Tag der Disputation 12. November 2015 Darmstadt 2015 D17 ÕækQË @ á ÔgQË @ é <Ë @ Õ æ . ”In the name of God, the Most Gracious, the Most Merciful” Abstract Although smartphones, tablets and other mobile devices become increasingly popu- lar, pen and paper continue to play an important role in mobile settings, such as note taking or creative discussions. However, information on paper documents remains static and usage practices involving sharing, researching, linking or in any other way digitally processing information on paper are hindered by the gap between the digi- tal and physical worlds. A considerable body of research has leveraged digital pen technology in order to overcome this problem with respect to static settings, however, systematically neglecting the mobile domain. Only recently, several approaches began exploring the mobile domain and develop- ing initial insights into mobile pen-and-paper interaction (mPPI), e.g., to publish digi- tal sketches, [Cowan et al., 2011], link paper and digital artifacts, [Pietrzak et al., 2012] or compose music, [Tsandilas, 2012]. However, applications designed to integrate the most common mobile tools pen, paper and mobile devices, thereby combining the benefits of both worlds in a hybrid mPPI ensemble, are hindered by the lack of sup- porting infrastructures and limited theoretical understanding of interaction design in the domain.
    [Show full text]
  • Merchandising Data Conversion Implementation Guide
    Oracle® Retail Merchandising Conversion Implementation Guide Release 19.0 F24057-02 May 2020 Oracle® Retail Merchandising Conversion Implementation Guide, Release 19.0 Copyright © 2020, Oracle and/or its affiliates. All rights reserved. Primary Author: Contributors: This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this software or related documentation is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are "commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of the programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, shall be subject to license terms and license restrictions applicable to the programs. No other rights are granted to the U.S.
    [Show full text]