Joins Vs. Links Or Relational Join Considered Harmful

Joins vs. Links or Relational Join Considered Harmful Alexandr Savinov Bosch Software Innovations GmbH, Stuttgarterstr. 130, 71332 Waiblingen, Germany Keywords: Data Processing, Data Modeling, Join Operation, Links, References, Connectivity. Abstract: Since the introduction of the relational model of data, the join operation is part of almost all query languages and data processing engines. Nowadays, it is not only a formal operation but rather a dominating pattern of thought for the concept of data connectivity. In this paper, we critically analyze properties of this operation, its role and uses by demonstrating some of its fundamental drawbacks in the context of data processing. We also analyze an alternative approach which is based on the concept of link by showing how it can solve these problems. Based on this analysis, we argue that link-based mechanisms should be preferred to joins as a main operation in data model and data processing systems. 1 INTRODUCTION identifies a data element (referent) and is used to access it. It is a very simple and natural concept A data model is a definition of data, that is, it which dominates in programming but is also used in answers the question what is data. It defines how many data processing systems and models. data is organized and how it is manipulated by This paper is devoted to comparing joins and providing structural elements and operations. For links as alternative mechanisms for accessing related example, the relational model (Codd, 1970) data elements which are based on completely organizes data by using tuples, domains and different patterns of thought. We critically analyze relations while the functional data model (Sibley and join operation and demonstrate that it has some Kerschberg; 1977) does the same by using functions. significant drawbacks while links on the other hand One of the major concerns in any data model is can solve these problems. We argue that joins should describing how elements are related or connected as not be used in data modeling and data processing well as providing means for retrieving related systems or at least their role should be significantly elements. The mechanism of connectivity diminished. We also want to show that links can be determines such important aspects as semantic used as a basis for a new mechanism that can replace clarity, conciseness of queries, maintainability and joins and do what joins have been intended for. performance. The paper has the following layout. Section 2 Probably the most wide spread approach to and Section 3 are devoted to describing joins and retrieving related elements in a database is based on links, respectively. Each of these sections starts from the operation of relational algebra called join. A join describing the corresponding mechanism and then takes two or more relations as input and produces a we analyze their more specific properties. Section 4 new relation as output. The output relation contains makes concluding remarks. tuples composed of related tuples from the inputs. The way they are related is specified in the join condition which is a parameter of the operation. 2 “WHO IS TO BLAME?” JOINS Although the join operation dominates in the area of data management, there are also alternative 2.1 What is in a Join? Common Value approaches. One of them defines how data elements are related and accessed by using the notion of link Let us assume that there are two CSV files with a list or reference (we will use these terms of employees and departments which might have the interchangeably because their differences are not followings structure and data: important for this paper). A link is a value which 362 Savinov, A. Joins vs. Links or Relational Join Considered Harmful. DOI: 10.5220/0005932403620368 In Proceedings of the International Conference on Internet of Things and Big Data (IoTBD 2016), pages 362-368 ISBN: 978-989-758-183-0 Copyright c 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved Joins vs. Links or Relational Join Considered Harmful emp, name, dept_id result table is a combination of matching tuples from 25, Smith, RD the two input tables which share the same value. dept, mngr, location Employees RD, 30, Stuttgart emp name dept_id Departments These two lines are stored in different files and 25 Alex RD dept mngr location formally are not connected because there is no 26 John HR RD 30 Dresden indication of any relation between them. However, 27 Anna HR HR 20 Berlin we can easily see that both of these lines store the emp name dept_id dept mngr location 25 Alex RD RD 30 Dresden same value ‘RD’ and we also know that this value (semantically) represents the same entity. Emps_Depts This fact of being characterized by the same emp name dept_id dept mngr locatioin value is a basis for establishing relationships 25 Alex RD RD 30 Dresden between data elements. In our example, the fact that 26 John HR HR 20 Berlin one employee and one department are both 27 Anna HR HR 20 Berlin characterized by the value ‘RD’ is not a coincidence Figure 2: Output relation contains combinations of related but rather a way to represent that they are connected tuples from input relations. (Fig. 1). In other words, two data elements are supposed to be related if they have something in The idea of using common values for matching common or, more specifically, share the same value. data elements has its formal roots in predicate Therefore, this general connectivity principle could calculus and is used various technologies, for be called a common value or shared value approach. example, deductive databases (Ullman & Zaniolo; Importantly, in addition to the two data items being 1990) or query languages. If two predicates in a connected, there is a third item stored in both of logical expression have the same free variables then them and treated as a means of connectivity. There they have to be bound to the same value in order for is no way to connect two elements without the resulting proposition to be true. For example, specifying a third element they share. given two predicates Employees(emp, ename, city) Shared value RD Departments(dept, dname, city) Employees Departments we can define a logical expression Employees(emp, 25 Smith RD RD 30 Stuttgart ename, city) & Departments(dept, dname, Figure 1: Two elements are related if they share a third city) which will be true only if the city variable element (data value). takes the same value. In this way, we can retrieve all employees who are located in (share) the same city Having something in common is a conceptual as their department. definition of related elements. It allows us to determine whether two elements are related or not. 2.2 Join is Symmetric However, the problem is not only to define related elements but also to retrieve them by materializing The underlying semantics of the join operation is this conceptual relationship. In the relational two elements are defined to be related if they have algebra, such a mechanism is provided by the join the same property. This property makes it a operation which is applied to relations and returns a symmetric operation where all inputs have the same new relation. In addition to input relations, this roles. For example, if we look at this join condition operation needs a parameter which provides a Employees.dept_id = Departments.dept criterion of connectivity. The output relation will contains only tuples satisfying this condition. then it is not possible to assign a special role to one Although joins can specify any condition the of the input tables Employees or Departments (Fig. combinations of input tuples have to satisfy (theta- 3). join), the most common case is to select only input Here we see one major problem of joins: in fact, tuples which contain equal values of some attributes these two tables have different semantic roles and (equijoin). Fig. 2 provides an example of joining two the relationship between them is not symmetric. One tables Employees and Departments by producing a indication of asymmetricity is that it is a many-to- new table Emps_Depts. Note that every tuple of the one relationship where many employees belong to one department. Another observation is that this 363 IoTBD 2016 - International Conference on Internet of Things and Big Data same semantics can be (correctly) represented as an query time while FK is a declaration which is used employee record referencing one department (but at design time. Fourth, it can be difficult to not vice versa). Also, the shared value is a understand how to use FKs in the case of arbitrarily department identifier and employee identifier. If we complex join conditions. Essentially, FK is an change the direction of this relationship then we attempt to introduce a mechanism of links but they change the meaning of the connection. Yet, joins are have incompatible semantics and therefore their not able to represent this semantics because all input simultaneous use is quite controversial and eclectic. relations have equal rights in a join. The use of FKs in combination with joins is analogous to introducing constructs for structural join programming along with goto operator. Their String domains simultaneous use will result in strange mixtures of dept_id dept different patterns. Employee Departments relations 2.3 Join is a Cross-cutting Concern Figure 3: Join is symmetric. Let us assume that we want to get a list of In the relational model, a unit of connectivity employees working at some department. It can be (between domains) is one relation and composition done by means of the following query: means joining relations. In other words, join allows SELECT E.emp, D.dept, D.location to compose or chain relations.

Load more