Union Types for Semistructured Data

Total Page:16

File Type:pdf, Size:1020Kb

Union Types for Semistructured Data Edinburgh Research Explorer Union Types for Semistructured Data Citation for published version: Buneman, P & Pierce, B 1999, Union Types for Semistructured Data. in Union Types for Semistructured Data: 7th International Workshop on Database Programming Languages, DBPL’99 Kinloch Rannoch, UK, September 1–3,1999 Revised Papers. Lecture Notes in Computer Science, vol. 1949, Springer-Verlag GmbH, pp. 184-207. https://doi.org/10.1007/3-540-44543-9_12 Digital Object Identifier (DOI): 10.1007/3-540-44543-9_12 Link: Link to publication record in Edinburgh Research Explorer Document Version: Peer reviewed version Published In: Union Types for Semistructured Data General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 27. Sep. 2021 Union Typ es for Semistructured Data Peter Buneman Benjamin Pierce University of Pennsylvania Dept of Computer Information Science South rd Street Philadelphia PA USA fpeterbcpiercegcisupenn edu Technical rep ort MSCIS Corrected Version July Abstract Semistructured databases are treated as dynamically typ ed they come equipp ed with no indep endent schema or typ e system to constrain the data Query languages that are designed for semistructured data even when used with structured data typically ignore any typ e information that may b e present The consequences of this are what one would exp ect from using a dynamic typ e system with complex data fewer guarantees on the correctness of applications For example a query that would cause a typ e error in a statically typ ed query language will return the empty set when applied to a semistructured representation of the same data Much semistructured data originates in structured data A semistructured representation is useful when one wants to add data that do es not conform to the original typ e or when one wants to combine sources of dierent typ es However the deviations from the prescrib ed typ es are often minor and we b elieve that a b etter strategy than throwing away all typ e information is to preserve as much of it as p ossible We describ e a system of untagged union types that can accommo date variations in structure while still allowing a degree of static typ e checking A novelty of this system is that it involves nontrivial equivalences among typ es arising from a law of distributivity for records and unions a value may b e intro duced with one typ e eg a record containing a union and used at another typ e a union of records We describ e programming and query language constructs for dealing with such typ es prove the soundness of the typ e system and develop algorithms for subtyping and typ echecking Intro duction Although semistructured data has by denition no schema there are many cases in which the data obviously p ossesses some structure even if it has mild deviations from that structure Moreover it typically has this structure b ecause it is derived from sources that have structure In the pro cess of annotating data or combining data from dierent sources one needs to accommo date the irregularities that are intro duced by these pro cesses Because there is no way of describing mildly irregular structure current approaches start by ignoring the structure completely treating the data as some dynamically typ ed ob ject such as a lab elled graph and then p erhaps attempting to recover some structure by a variety of pattern matching and data mining techniques NAM Ali The purp ose of this structure recovery is typically to provide optimization techniques for query evaluation or ecient storage storage structures and it is partial It is not intended as a technique for preserving the integrity of data or for any kind of static typ echecking of applications When data originates from some structured source it is desirable to preserve that structure if at all p ossible The typical cases in which one cannot require rigid conformance to a schema arise when one wants to annotate or mo dify the database with unanticipated structure or when one merges two databases with slight dierences in structure Rather than forgetting the original typ e and resorting to a completely dynamically typ e we b elieve a more disciplined approach to maintaining typ e information is appropriate We prop ose here a typ e system that can degrade gracefully if sources are added with variations in structure while preserving the common structure of the sources where it exists The advantages of this approach include The ability to check the correctness of programs and queries on semistructured data Current semistruc + + tured query languages BDHS AQM DFF have no way of providing typ e errors they typically return the empty answer on data whose typ e do es not conform to the typ e assumed by the query The ability to create data at one typ e and query it at another equivalent typ e This is a natural consequence of using a exible typ e system for semistructured data New query language constructs that p ermit the ecient implementation of case expressions and increase the expressive p ower of a OQLstyle query languages As an example biological databases often have a structure that can b e expressed naturally using a com bination of tuples records and collection typ es They are typically cast in sp ecialpurp ose data formats and there are groups of related databases each expressed in some format that is a mild variation on some original format These formats have an intended typ e which could b e expressed in a numb er of notations For example a source source could have typ e 1 set id Int description Str bibl set title Str authors listname Str address Str year Int A second source source might yield a closely related structure 2 set id Int description Str bibl set title Str authors listfn Str ln Str address Str year Int This diers only in the way in which author names are represented This example is ctional but not far removed from what happ ens in practice The usual solution to this problem in conventional programming languages is to represent the union of the sources using some form of tagged union typ e seth tag id Int tag id Int i 1 2 The diculty with this solution is that a program such as for each x in source do printxdescription 1 that worked on source must now b e mo died to 1 for each x in source union source do 1 2 case x of h tag y i printy description 1 1 1 j h tag y i printy description 2 2 2 in order to work on the union of the sources even though the two branches of the case statement contain identical co de This is also true for the few database query languages that deal with tagged union typ es + BLS Contrast this with a typical semistructured query select description d title t where description d bibl Title t source 1 This query works by pattern matching based on the dynamically determined structure of the data Thus 1 the same query works equally well against either of the two sources and hence also against their union The drawback of this approach however is that incorrect queries for example queries that use a eld that do es not exist in either source yield the empty set rather than an error In this pap er we dene a system that combines the advantages of b oth approaches based on a system of typ esafe untagged union typ es As a rst example consider the two forms of the author eld in the typ es ab ove We may write the union of these typ es as name Str address Str ln Str fn Str address Str It is intuitively obvious that an address can always b e extracted from a value of such a typ e To express this formally we b egin by writing a multield record typ e l T l T as a pro duct of singleeld record 1 1 2 2 typ es l T l T In this more basic form the union typ e ab ove is 1 1 2 2 name Str address Str ln Str fn Str address Str We now invoke a distributivity law that allows us to treat a T b T c T and a T b T a T c T a b c a b a c as equivalent typ es Using this the union typ e ab ove rewrites to name Str fn Str ln Str address Str In this form it is evident that the the selection of the address eld is an allowable op eration Typ eequivalences like this distributivity rule allow us to intro duce a value at one typ e and op erate on it another typ e Under this system b oth the program and the query ab ove will typ echeck when extended to the union of the two sources On the other hand queries that reference a eld that is not in either source will fail to typ e check Some care is needed in designing the op erations for manipulating values of union typ es Usually the in terrogation op eration for records is eld selection and the corresp onding op eration for unions is a case expression However it is not enough simply to use these two op erations Consider the typ e a T b 1 1 1 U a T b U The form of this typ e warrants neither selecting a eld nor using a case 1 n n n n expression We can
Recommended publications
  • Cyclone: a Type-Safe Dialect of C∗
    Cyclone: A Type-Safe Dialect of C∗ Dan Grossman Michael Hicks Trevor Jim Greg Morrisett If any bug has achieved celebrity status, it is the • In C, an array of structs will be laid out contigu- buffer overflow. It made front-page news as early ously in memory, which is good for cache locality. as 1987, as the enabler of the Morris worm, the first In Java, the decision of how to lay out an array worm to spread through the Internet. In recent years, of objects is made by the compiler, and probably attacks exploiting buffer overflows have become more has indirections. frequent, and more virulent. This year, for exam- ple, the Witty worm was released to the wild less • C has data types that match hardware data than 48 hours after a buffer overflow vulnerability types and operations. Java abstracts from the was publicly announced; in 45 minutes, it infected hardware (“write once, run anywhere”). the entire world-wide population of 12,000 machines running the vulnerable programs. • C has manual memory management, whereas Notably, buffer overflows are a problem only for the Java has garbage collection. Garbage collec- C and C++ languages—Java and other “safe” lan- tion is safe and convenient, but places little con- guages have built-in protection against them. More- trol over performance in the hands of the pro- over, buffer overflows appear in C programs written grammer, and indeed encourages an allocation- by expert programmers who are security concious— intensive style. programs such as OpenSSH, Kerberos, and the com- In short, C programmers can see the costs of their mercial intrusion detection programs that were the programs simply by looking at them, and they can target of Witty.
    [Show full text]
  • REFPERSYS High-Level Goals and Design Ideas*
    REFPERSYS high-level goals and design ideas* Basile STARYNKEVITCH† Abhishek CHAKRAVARTI‡ Nimesh NEEMA§ refpersys.org October 2019 - May 2021 Abstract REFPERSYS is a REFlexive and orthogonally PERsistent SYStem (as a GPLv3+ licensed free software1) running on Linux; it is a hobby2 but serious research project for many years, mostly aimed to experiment open science ideas close to Artificial General Intelligence3 dreams, and we don’t expect use- ful or interesting results before several years of hard work. audience : LINUX free software developers4 and computer scientists interested in an experimental open science approach to reflexive systems, orthogonal persistence, symbolic artificial intelligence, knowledge engines, etc.... Nota Bene: this report contains many hyperlinks to relevant sources so its PDF should rather be read on a computer screen, e.g. with evince. Since it describes a circular design (with many cycles [Hofstadter:1979:GEB]), we recommend to read it twice (skipping footnotes and references on the first read). This entire document is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. *This document has git commit fb17387fbbb7e200, was Lua-LATEX generated on 2021-May-17 18:55 MEST, see gitlab.com/bstarynk/refpersys/ and its doc/design-ideas subdirectory. Its draft is downloadable, as a PDF file, from starynkevitch.net/Basile/refpersys-design.pdf ... †See starynkevitch.net/Basile/ and contact [email protected], 92340 Bourg La Reine (near Paris), France. ‡[email protected], FL 3C, 62B PGH Shah Road, Kolkata 700032, India.
    [Show full text]
  • Spindle Documentation Release 2.0.0
    spindle Documentation Release 2.0.0 Jorge Ortiz, Jason Liszka June 08, 2016 Contents 1 Thrift 3 1.1 Data model................................................3 1.2 Interface definition language (IDL)...................................4 1.3 Serialization formats...........................................4 2 Records 5 2.1 Creating a record.............................................5 2.2 Reading/writing records.........................................6 2.3 Record interface methods........................................6 2.4 Other methods..............................................7 2.5 Mutable trait...............................................7 2.6 Raw class.................................................7 2.7 Priming..................................................7 2.8 Proxies..................................................8 2.9 Reflection.................................................8 2.10 Field descriptors.............................................8 3 Custom types 9 3.1 Enhanced types..............................................9 3.2 Bitfields..................................................9 3.3 Type-safe IDs............................................... 10 4 Enums 13 4.1 Enum value methods........................................... 13 4.2 Companion object methods....................................... 13 4.3 Matching and unknown values...................................... 14 4.4 Serializing to string............................................ 14 4.5 Examples................................................. 14 5 Working
    [Show full text]
  • Multi-Level Constraints
    Multi-Level Constraints Tony Clark1 and Ulrich Frank2 1 Aston University, UK, [email protected] 2 University of Duisburg-Essen, DE, [email protected] Abstract. Meta-modelling and domain-specific modelling languages are supported by multi-level modelling which liberates model-based engi- neering from the traditional two-level type-instance language architec- ture. Proponents of this approach claim that multi-level modelling in- creases the quality of the resulting systems by introducing a second ab- straction dimension and thereby allowing both intra-level abstraction via sub-typing and inter-level abstraction via meta-types. Modelling ap- proaches include constraint languages that are used to express model semantics. Traditional languages, such as OCL, support intra-level con- straints, but not inter-level constraints. This paper motivates the need for multi-level constraints, shows how to implement such a language in a reflexive language architecture and applies multi-level constraints to an example multi-level model. 1 Introduction Conceptual models aim to bridge the gap between natural languages that are required to design and use a system and implementation languages. To this end, general-purpose modelling languages (GPML) like the UML consist of concepts that represent semantic primitives such as class, attribute, etc., that, on the one hand correspond to concepts of foundational ontologies, e.g., [4], and on the other hand can be nicely mapped to corresponding elements of object-oriented programming languages. Since GPML can be used to model a wide range of systems, they promise at- tractive economies of scale. At the same time, their use suffers from the fact that they offer generic concepts only.
    [Show full text]
  • Djangoshop Release 0.11.2
    djangoSHOP Release 0.11.2 Oct 27, 2017 Contents 1 Software Architecture 1 2 Unique Features of django-SHOP5 3 Upgrading 7 4 Tutorial 9 5 Reference 33 6 How To’s 125 7 Development and Community 131 8 To be written 149 9 License 155 Python Module Index 157 i ii CHAPTER 1 Software Architecture The django-SHOP framework is, as its name implies, a framework and not a software which runs out of the box. Instead, an e-commerce site built upon django-SHOP, always consists of this framework, a bunch of other Django apps and the merchant’s own implementation. While this may seem more complicate than a ready-to-use solution, it gives the programmer enormous advantages during the implementation: Not everything can be “explained” to a software system using graphical user interfaces. After reaching a certain point of complexity, it normally is easier to pour those requirements into executable code, rather than to expect yet another set of configuration buttons. When evaluating django-SHOP with other e-commerce solutions, I therefore suggest to do the following litmus test: Consider a product which shall be sold world-wide. Depending on the country’s origin of the request, use the native language and the local currency. Due to export restrictions, some products can not be sold everywhere. Moreover, in some countries the value added tax is part of the product’s price, and must be stated separately on the invoice, while in other countries, products are advertised using net prices, and tax is added later on the invoice.
    [Show full text]
  • A Proposal for a Standard Systemverilog Synthesis Subset
    A Proposal for a Standard Synthesizable Subset for SystemVerilog-2005: What the IEEE Failed to Define Stuart Sutherland Sutherland HDL, Inc., Portland, Oregon [email protected] Abstract frustrating disparity in commercial synthesis compilers. Each commercial synthesis product—HDL Compiler, SystemVerilog adds hundreds of extensions to the Encounter RTL Compiler, Precision, Blast and Synplify, to Verilog language. Some of these extensions are intended to name just a few—supports a different subset of represent hardware behavior, and are synthesizable. Other SystemVerilog. Design engineers must experiment and extensions are intended for testbench programming or determine for themselves what SystemVerilog constructs abstract system level modeling, and are not synthesizable. can safely be used for a specific design project and a The IEEE 1800-2005 SystemVerilog standard[1] defines specific mix of EDA tools. Valuable engineering time is the syntax and simulation semantics of these extensions, lost because of the lack of a standard synthesizable subset but does not define which constructs are synthesizable, or definition for SystemVerilog. the synthesis rules and semantics. This paper proposes a standard synthesis subset for SystemVerilog. The paper This paper proposes a subset of the SystemVerilog design reflects discussions with several EDA companies, in order extensions that should be considered synthesizable using to accurately define a common synthesis subset that is current RTL synthesis compiler technology. At Sutherland portable across today’s commercial synthesis compilers. HDL, we use this SystemVerilog synthesis subset in the training and consulting services we provide. We have 1. Introduction worked closely with several EDA companies to ensure that this subset is portable across a variety of synthesis SystemVerilog extensions to the Verilog HDL address two compilers.
    [Show full text]
  • Presentation on Ocaml Internals
    OCaml Internals Implementation of an ML descendant Theophile Ranquet Ecole Pour l’Informatique et les Techniques Avancées SRS 2014 [email protected] November 14, 2013 2 of 113 Table of Contents Variants and subtyping System F Variants Type oddities worth noting Polymorphic variants Cyclic types Subtyping Weak types Implementation details α ! β Compilers Functional programming Values Why functional programming ? Allocation and garbage Combinatory logic : SKI collection The Curry-Howard Compiling correspondence Type inference OCaml and recursion 3 of 113 Variants A tagged union (also called variant, disjoint union, sum type, or algebraic data type) holds a value which may be one of several types, but only one at a time. This is very similar to the logical disjunction, in intuitionistic logic (by the Curry-Howard correspondance). 4 of 113 Variants are very convenient to represent data structures, and implement algorithms on these : 1 d a t a t y p e tree= Leaf 2 | Node of(int ∗ t r e e ∗ t r e e) 3 4 Node(5, Node(1,Leaf,Leaf), Node(3, Leaf, Node(4, Leaf, Leaf))) 5 1 3 4 1 fun countNodes(Leaf)=0 2 | countNodes(Node(int,left,right)) = 3 1 + countNodes(left)+ countNodes(right) 5 of 113 1 t y p e basic_color= 2 | Black| Red| Green| Yellow 3 | Blue| Magenta| Cyan| White 4 t y p e weight= Regular| Bold 5 t y p e color= 6 | Basic of basic_color ∗ w e i g h t 7 | RGB of int ∗ i n t ∗ i n t 8 | Gray of int 9 1 l e t color_to_int= function 2 | Basic(basic_color,weight) −> 3 l e t base= match weight with Bold −> 8 | Regular −> 0 in 4 base+ basic_color_to_int basic_color 5 | RGB(r,g,b) −> 16 +b+g ∗ 6 +r ∗ 36 6 | Grayi −> 232 +i 7 6 of 113 The limit of variants Say we want to handle a color representation with an alpha channel, but just for color_to_int (this implies we do not want to redefine our color type, this would be a hassle elsewhere).
    [Show full text]
  • Practical Subtyping for System F with Sized (Co-)Induction Rodolphe Lepigre, Christophe Raffalli
    Practical Subtyping for System F with Sized (Co-)Induction Rodolphe Lepigre, Christophe Raffalli To cite this version: Rodolphe Lepigre, Christophe Raffalli. Practical Subtyping for System F with Sized (Co-)Induction. 2017. hal-01289760v3 HAL Id: hal-01289760 https://hal.archives-ouvertes.fr/hal-01289760v3 Preprint submitted on 10 Jul 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0 International License PRACTICAL SUBTYPING FOR SYSTEM F WITH SIZED (CO-)INDUCTION RODOLPHE LEPIGRE AND CHRISTOPHE RAFFALLI LAMA, UMR 5127 CNRS - Universit´eSavoie Mont Blanc e-mail address: frodolphe.lepigre j christophe.raff[email protected] Abstract. We present a rich type system with subtyping for an extension of System F. Our type constructors include sum and product types, universal and existential quanti- fiers, inductive and coinductive types. The latter two may carry annotations allowing the encoding of size invariants that are used to ensure the termination of recursive programs. For example, the termination of quicksort can be derived by showing that partitioning a list does not increase its size. The system deals with complex programs involving mixed induction and coinduction, or even mixed polymorphism and (co-)induction (as for Scott- encoded data types).
    [Show full text]
  • Categorical Models of Type Theory
    Categorical models of type theory Michael Shulman February 28, 2012 1 / 43 Theories and models Example The theory of a group asserts an identity e, products x · y and inverses x−1 for any x; y, and equalities x · (y · z) = (x · y) · z and x · e = x = e · x and x · x−1 = e. I A model of this theory (in sets) is a particularparticular group, like Z or S3. I A model in spaces is a topological group. I A model in manifolds is a Lie group. I ... 3 / 43 Group objects in categories Definition A group object in a category with finite products is an object G with morphisms e : 1 ! G, m : G × G ! G, and i : G ! G, such that the following diagrams commute. m×1 (e;1) (1;e) G × G × G / G × G / G × G o G F G FF xx 1×m m FF xx FF m xx 1 F x 1 / F# x{ x G × G m G G ! / e / G 1 GO ∆ m G × G / G × G 1×i 4 / 43 Categorical semantics Categorical semantics is a general procedure to go from 1. the theory of a group to 2. the notion of group object in a category. A group object in a category is a model of the theory of a group. Then, anything we can prove formally in the theory of a group will be valid for group objects in any category. 5 / 43 Doctrines For each kind of type theory there is a corresponding kind of structured category in which we consider models.
    [Show full text]
  • Algebraic Data Types
    Composite Data Types as Algebra, Logic Recursive Types Algebraic Data Types Christine Rizkallah CSE, UNSW (and data61) Term 3 2019 1 Classes Tuples Structs Unions Records Composite Data Types as Algebra, Logic Recursive Types Composite Data Types Most of the types we have seen so far are basic types, in the sense that they represent built-in machine data representations. Real programming languages feature ways to compose types together to produce new types, such as: 2 Classes Unions Composite Data Types as Algebra, Logic Recursive Types Composite Data Types Most of the types we have seen so far are basic types, in the sense that they represent built-in machine data representations. Real programming languages feature ways to compose types together to produce new types, such as: Tuples Structs Records 3 Unions Composite Data Types as Algebra, Logic Recursive Types Composite Data Types Most of the types we have seen so far are basic types, in the sense that they represent built-in machine data representations. Real programming languages feature ways to compose types together to produce new types, such as: Classes Tuples Structs Records 4 Composite Data Types as Algebra, Logic Recursive Types Composite Data Types Most of the types we have seen so far are basic types, in the sense that they represent built-in machine data representations. Real programming languages feature ways to compose types together to produce new types, such as: Classes Tuples Structs Unions Records 5 Composite Data Types as Algebra, Logic Recursive Types Combining values conjunctively We want to store two things in one value.
    [Show full text]
  • Class Notes on Type Inference 2018 Edition Chuck Liang Hofstra University Computer Science
    Class Notes on Type Inference 2018 edition Chuck Liang Hofstra University Computer Science Background and Introduction Many modern programming languages that are designed for applications programming im- pose typing disciplines on the construction of programs. In constrast to untyped languages such as Scheme/Perl/Python/JS, etc, and weakly typed languages such as C, a strongly typed language (C#/Java, Ada, F#, etc ...) place constraints on how programs can be written. A type system ensures that programs observe logical structure. The origins of type theory stretches back to the early twentieth century in the work of Bertrand Russell and Alfred N. Whitehead and their \Principia Mathematica." They observed that our language, if not restrained, can lead to unsolvable paradoxes. Specifically, let's say a mathematician defined S to be the set of all sets that do not contain themselves. That is: S = fall A : A 62 Ag Then it is valid to ask the question does S contain itself (S 2 S?). If the answer is yes, then by definition of S, S is one of the 'A's, and thus S 62 S. But if S 62 S, then S is one of those sets that do not contain themselves, and so it must be that S 2 S! This observation is known as Russell's Paradox. In order to avoid this paradox, the language of mathematics (or any language for that matter) must be constrained so that the set S cannot be defined. This is one of many discoveries that resulted from the careful study of language, logic and meaning that formed the foundation of twentieth century analytical philosophy and abstract mathematics, and also that of computer science.
    [Show full text]
  • Meant to Provoke Thought Regarding the Current "Software Crisis" at the Time
    1 www.onlineeducation.bharatsevaksamaj.net www.bssskillmission.in DATA STRUCTURES Topic Objective: At the end of this topic student will be able to: At the end of this topic student will be able to: Learn about software engineering principles Discover what an algorithm is and explore problem-solving techniques Become aware of structured design and object-oriented design programming methodologies Learn about classes Learn about private, protected, and public members of a class Explore how classes are implemented Become aware of Unified Modeling Language (UML) notation Examine constructors and destructors Learn about the abstract data type (ADT) Explore how classes are used to implement ADT Definition/Overview: Software engineering is the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software, and the study of these approaches. That is the application of engineering to software. The term software engineering first appeared in the 1968 NATO Software Engineering Conference and WWW.BSSVE.INwas meant to provoke thought regarding the current "software crisis" at the time. Since then, it has continued as a profession and field of study dedicated to creating software that is of higher quality, cheaper, maintainable, and quicker to build. Since the field is still relatively young compared to its sister fields of engineering, there is still much work and debate around what software engineering actually is, and if it deserves the title engineering. It has grown organically out of the limitations of viewing software as just programming. Software development is a term sometimes preferred by practitioners in the industry who view software engineering as too heavy-handed and constrictive to the malleable process of creating software.
    [Show full text]