Reactively Querying an Immutable Log of Relational Facts. Design And
Total Page:16
File Type:pdf, Size:1020Kb
Reactively Querying an Immutable Log of Relational Facts Design and Implementation of a Bitemporal EAV Database with a Subscription Query Language Master Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Science in Engineering to the University of Applied Sciences FH Campus Wien Master Degree Program Software Design and Engineering Author: Albert Zak Student identification number: 1810838004 Supervisor: Priv.-Doz. Mag.rer.soc.oec. Dipl.-Ing. Dipl.-Ing. Dr.techn. Karl Michael Göschka Date: 2020-05-31 Declaration of authorship: I declare that this Master Thesis has been written by myself. I have not used any other than the listed sources, nor have I received any unauthorized help. I hereby certify that I have not submitted this Master Thesis in any form (to a reviewer for assessment) either in Austria or abroad. Furthermore, I assure that the (printed and electronic) copies I have submitted are identical. Date: Signature: Abstract Growing customer demands lead to increased incidental complexity of data-intensive distributed applications. Relational Database Management Systems (RDBMS), es- pecially those implementing Structured Query Language (SQL) possess performant but destructive-by-default write semantics. Modeling domains such as healthcare in classic RDBMS leads to an explosion in the number of columns and tables. Their structure has to be known in advance, discouraging an explorative development process. Requirements of real time collaboration and auditability of changes push the limit of established paradigms. The core of the data layer presented in this thesis is a simple relational model based on facts in the form of Entity-Attribute-Value (EAV) triples. A central append- only immutable log accretes these facts via assertions and retractions. Transactions guarantee atomicity/consistency/isolation/durability (ACID) and are themselves first class queryable entities carrying arbitrary meta facts, realizing strict bitemporal auditability of all changes by keeping two timestamps: transaction time tx and valid time tv. Changes are replicated to clients which subscribe to the result of a query. Multiple incrementally maintained indices (EAVT, AEVT, AVET, VAET) grant efficient direct access to tuples, regarding the database analogous to a combined graph, column, and document store. The database itself is an immutable value which can be passed within the program and queried locally. This work demonstrates the feasibility of implementing various desirable features in less than 400 lines of Clojure: bitemporality, audit logging, transactions, server- client reactivity, consistency criteria, derived facts, and a simple relational query language based on Datalog. i Key Terms EAV Database Bitemporal Immutable Datalog Lisp ii List of Abbreviations tv valid time tx transaction time 6NF 6th normal form ACID Atomicity, Consistency, Isolation, Durability API Application Programming Interface CAP Consistency, Availability, Partition tolerance CQRS Command Query Responsibility Segregation CRDT Conflict-free Replicated Data Type CRUD Create, Read, Update, Delete DDD Domain-Driven Design DDP Decentralized Data Protocol DRP Distributed Reactive Programming DSL Domain-specific language edn Extensible Data Notation ES Event Sourcing FRP Functional-Reactive Programming IETF Internet Engineering Task Force JSON JavaScript Object Notation lvar logic variable MUMPS Massachusetts General Hospital Utility Multi- Programming System ORM Object-Relational Mapping OT Operational Transformation PACELC Partitions: Availability or Consistency, Else Latency or Consistency iii RAD Rapid Application Development RDBMS Relational Database Management Systems RDF Resource Description Framework REST Representational State Transfer RPC Remote Procedure Call SaaS Software-as-a-Service SPA Single Page Application SQL Structured Query Language UI User Interface UUID Universally Unique Identifier XML Extensible Markup Language iv Contents 1 Introduction 1 1.1 Problem . .3 1.2 Contribution . .3 1.3 State of the art . .4 2 Related Work 8 3 Design 12 3.1 Problems . 12 3.2 Goals . 14 3.3 Non-goals . 15 3.4 Conceptual model . 15 3.5 Query language . 17 4 Implementation 19 4.1 Data model . 21 4.2 Writing . 23 4.3 Querying . 27 4.4 Publishing and subscribing . 30 5 Discussion 32 5.1 Advantages . 32 5.2 Limitations . 33 6 Future Work 35 7 Conclusion 37 v 1 Introduction This thesis describes a possible design and a proof-of-concept implementation of a data layer for near real time distributed collaborative applications targeted at operational workloads in small to medium sized businesses with strict auditability requirements. Context. The ideas presented in this work are based on the author’s experience running a Software-as-a-Service (SaaS) business developing and operating a highly available information system in the medical sector for the past five years (see figure 1). The deployed system covers all of the daily operational needs of medium-sized practices, handling medical records, personal data, and associated imagery for more than 80.000 patients a year, scheduling appointments, employees, departments, and rooms subject to complex business rules constraining available treatments, forecasting, planning and reporting clinical workload, sending appointment reminders, handling inbound calls and messages from patients, recording internal referrals and escalations, along with support for distributed staff (e.g. external call centers, telemedicine) owing to fine-grained roles and authorization controls, thorough audit logging of all changes, and compliance with relevant regulations. Figure 1: The author’s medical information system 1 Background. The system is built on top of the open source Meteor framework (presented in section 2) and MongoDB, making heavy use of their "live query" mechanisms, along with Redux, React and Minimongo in the front end, packaged with Electron and React Native. To increase availability, the back end is deployed over three data centers using primary-backup replication. Currently, the tracked code base comprises about 42.000 lines of JavaScript. Being a system designed, developed, and operated by a single person, the primary challenge by far has been to keep runaway incidental complexity at bay. While the abstractions around "live" publication/subscription provided by the Meteor framework turned out to be very expressive, other design decisions of the underlying framework and database caused unnecessary complexity in the application code. Drawbacks include pervasive mutability, the need to denormalize data, the lack of a relational query language, no server-side joins, and no (bi-)temporality or notions of time or memory for auditing purposes. Outline. Section 2 presents related work. Section 3 exhibits common problems experienced in data-intensive business applications and shows a different approach towards solving some of them in a functional, immutable, and reactive way. A proof- of-concept implementation is described in section 4 and its merits and deficiencies are discussed in section 5. 2 1.1 Problem Classic Relational Database Management Systems (RDBMS) are not suitable for modeling sparse, hierarchical or multi-valued data in domains where evolution and dynamism are hard requirements. They also lack a notion of memory or change over time because writes are destructive by default. Attempts to add concepts of temporality to Structured Query Language (SQL) are complicated and, as a result, not widely used. Auditing changes to a mutable black box with no history is hard. Developers fear the networking aspects of sending queries on a round trip "over there" [Hic12a] to the database, as opposed to having the data in memory and being able to query it directly. On top of all, customers demand increasingly interactive distributed collaboration environments, which pushes the limits of established request-response mechanisms. 1.2 Contribution This thesis explains a possible design of a data layer for business applications and describes its prototypal implementation. The core is a simple relational data model based on facts in the form of Entity-Attribute-Value (EAV) tuples similar to the 6th normal form (6NF) or the Resource Description Framework (RDF) as seen in Semantic Web literature. Such EAV tuples are accreted in an append-only (immutable) log as assertions and retractions together with two timestamps: transaction time (tx) at which the fact was added to the log, and valid time (tv) at which the fact became true in the wider context of the system [Sno92] i.e., the real world. Goal. The original goal of this work was to demonstrate the feasibility of an im- plementation of various desirable data layer features in less than 1000 lines total of readily comprehensible Lisp (Clojure, more specifically ClojureScript [Hic08]): Bitemporality, practically immutable audit logging, transaction metadata, server-client reactivity, schemalessness, consistency criteria, derived facts, versatile indexing, transactions with guarantees of Atomicity, Consistency, Isolation, Durability (ACID), and a simple relational query language based on Datalog. Method. The thesis explains the decisions made while iterating on the design and its implementation and includes a discussion on limitations and advantages. Results. Embracing the EAV+tx+tv data model together with the expressive pro- gramming language ClojureScript reveals that it is possible to implement a proof- of-concept data layer and query language for business applications with various