White Paper: What is a multi-model and why use it?

An ArangoDB White Paper (April 2020)

What is a multi-model database and why use it?

An ArangoDB White Paper (April 2020)

Creating applications is very much like life itself. You never know what might happen tomorrow or in 2 months from now. We to technology choices based on assumptions about the future.

Choosing the right database for an application is fraught with difficulty, since database choice involves commitment to a particular model up front, that can easily be invalidated by unpredictably changing requirements in the future. Making the right choice involves prioritizing capabilities and making compromises based on what you know at the time, what you think might happen, the skills and resources you have at the time, and a lot of luck.

If your use case requires relational, document, and graph capabilities, do you pick three different kinds of database technologies or do you pick the one you think is the most dominant? Often compromises lead to multiple kinds of in the same project, resulting in operational friction, inconsistency, redundancy, and latency.

Picking the wrong database model can be very costly. What if you could future proof database choice? Native multi-model databases insulate you from requirements changes and lock in.

This white paper explains what multi-model databases are, why it makes sense to use them, and illustrates how to apply them using an aircraft fleet management use case.

C​ opyright ArangoDB Inc. ​ What is a multi-model database and why use it? ArangoDB White Paper – April 2020

Table of Contents

1 ​ ​What is a native multi-model database? 3

2 ​ ​Why native multi-model? 4

3 ​ ​ with native multi-model databases 6 3.1 ​Aircraft fleet maintenance: A case study 6 3.1.1 A​ data model for an aircraft fleet 7 3.1.2 Q​ ueries for aircraft fleet maintenance 8 3.1.3 U​ sing multi-model querying 12 3.2 ​Lessons learned for data modeling 13

4​ ​Further use cases for native multi-model databases 14

2 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020

1 What is a native multi-model database?

Polyglot persistence is becoming the new normal [​zdnet 2020]​ , since the explosion of single model NoSQL databases in 2010, however, the proliferation of different kinds of databases and the complex orchestrations needed to keep the systems-of-systems in synch can become extremely complex, burdensome and expensive. By contrast, native multi model databases, like A​ rangoDB​, allow polyglot persistence in the same datastore. This eliminates the complex data orchestration, data transformation, and diversity of database knowledge needed to support polyglot persistence.

Native multi-model databases, like ArangoDB, support multiple data models (e.g., key-value, document, graph, SQL) seamlessly (at the same time) in one core system using a single unified . This definition excludes databases that support one model at a time and need to be switched between data models. It also excludes databases that use views to create facades for different data models (layered multi-model). In a true multi-model database, the different models coexist as first class citizens and are interoperable in a multi-model query language.

So what is a native multi-model database?

In short, a native multi-model database has one core, one query language, but multiple data models. A native multi-model database is a combination of several data stores in one system. You can store data as key/value pairs, graphs or documents and can access your data seamlessly with one declarative query language, combining different models in a single query. You can build high-performance applications and scale horizontally using all data models to their full extent.

Figure 1: A native multi-model has multiple coexisting data models accessed via one query language.

3 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020

2 Why native multi-model?

Over the past decade, software architects have realized that there are benefits to leveraging a variety of data models for different parts of the persistence layer of software projects. As a result: Polyglot persistence has become increasingly popular.

However, the earlier applications of polyglot persistence combined multiple single model databases in one software system. In this earlier approach, you would use a to persist structured tabular data; a document store for unstructured object-like data; a key/value store for a hash ; and a for highly linked referential data. The combination of multiple single model databases in the same project, often leads to operational friction (more complicated deployment, more frequent upgrades) as well as data consistency and duplication issues.

A single native multi-model database provides a superior solution for the following three reasons:

Firstly,​ most use cases require diverse data models and choosing a particular single data model causes you to either force fit the other data models into the chosen model or causes you to use multiple databases accompanied by increasingly bizantine orchestration logic, high latencies, and data consistency issues - A multi-model allows you to avoid doing this.

Secondly,​ requirements do change over time and multi-model databases allow you to easily adapt to these changes without having to “rip and replace” databases.

Thirdly​, database proliferation within enterprises is costly, can require expensive specialized skills, and complex orchestrations and transformations to keep data in sync and consistent. Multi-model databases allow enterprises to replace multiple systems, while avoiding external orchestration and transformation.

4 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020

Figure 2: tables, documents, graphs and key/value pairs: different data models.

How do native multi-model databases work? How do key-values, documents and graphs coexist seamlessly? The answer is very simple.

Documents in a document collection usually have a unique primary key that encodes document identity, which makes a document store into a key/value store, where the keys are strings and the values are JSON documents. The fact that the values are JSON does not impose a performance penalty, but offers a good amount of flexibility.

The graph data model can be implemented by storing a JSON document for each vertex and a JSON document for each edge. The edges are kept in special edge collections that ensure that every edge has _​ from and ​_to attributes which reference the starting and ending ​ ​ vertices of the edge as well as the direction of a relationship.

Having unified the data for the three data models in this way, it remains to implement a common query language that allows users to express document queries, key/value lookups, “graphy queries,” and arbitrary combinations of these. By “graphy queries”, I mean queries that involve the particular connectivity features coming from the edges, e.g.

● Graph Traversals ● Shortest_Path ● and Pattern Matching

A Pattern Matching query in a multi-model database identifies all paths that follow an arbitrary complex combination of conditions. These conditions are composed of conditions on each single document or edge and conditions on the overall layout created by these

5 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020 objects. After all the theory, let’s dive into an actual use case of Aircraft Fleet Management including data modeling aspects and actual multi-model queries.

3 Data modeling with native multi-model databases

3.1 Aircraft fleet maintenance: A case study

One area where the flexibility of a native multi-model database is extremely well suited is the management of large amounts of hierarchical data, such as in an aircraft fleet.

An aircraft fleet consists of several types of aircraft, and a typical aircraft is composed of several million parts: from assemblies and subassemblies to individual components. One can think of an aircraft as a collection of parts flying in formation where the “formation” is a hierarchy of “items”. ​To organize the maintenance of a fleet of aircrafts, one has to manage a multitude of data at different levels in this hierarchy.​

Parts and components have:

● names ● serial numbers ● manufacturer information ● maintenance intervals ● maintenance dates ● information about subcontractors ● links to manuals and documentation ● contact persons ● warranty and service contract information, to name but a few.

This data for each item is attached to nodes in an aircraft’s bill of materials hierarchy. It is important to note that there may be three component hierarchies for an aircraft: as-designed, as-built, and as-maintained.

In this paper we are focused on the as-maintained component hierarchy of an aircraft. The difference is that in an as-maintained hierarchy we see the aircraft in terms of replaceable and repairable components. In addition the configuration of the aircraft, and therefore the hierarchy will change over time as parts are replaced or refurbished on the individual aircraft.

This data is tracked in order to provide information and answer questions. Questions can include but are not limited to the following examples:

6 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020

● What are all the component parts in a given part? ● Given a (broken) part, what is the smallest component of the aircraft that contains the part and for which there is a maintenance procedure? ● Which parts of this aircraft need maintenance next week?

3.1.1 A data model for an aircraft fleet

How do we model the data about our aircraft fleet using a multi-model database?

A multi-model database provides a number of ways of accomplishing this, however, one approach that is very natural and offers fast query performance is to use a JSON document to hold the data for each item in the hierarchy and to represent the hierarchical relationships between the item documents as a graph. The schema free, recursive nature of JSON documents, allows us to store arbitrary information about each item, so​ it does not matter that the data being managed for an aircraft, or engine or a small fastener is completely different.

A graph is an excellent representation for modeling the hierarchical structure (partonomy) of an aircraft fleet to individual aircraft in the fleet, through the bill of materials for each aircraft down to individual parts. In this model, the aircraft fleet is represented by a vertex that has an edge to each aircraft vertex. Each aircraft vertex in turn has edges to every top-level component vertex in the bill of materials. In the bill of materials, the component assembly vertices have edges to the subcomponents they are made of, and so on, to individual parts forming the leaves of the (See figure 3).

7 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020

Figure 3: A graph-structured tree of item documents.

How should one structure the document collections? We could either allocate all of the item documents into a single collection or define multiple collections for different classes of items or groups of similar items. For the bill of materials graph, it does not matter whether the items are organized into one large collection or multiple collections for similar items. However when you consider secondary indexing, the organization of the items into collections is important. Multiple collections containing similar groupings or classes of item documents are better for indexing and will result in more performant queries for this use case. A single collection containing a great variety of items may result in sparse non-overlapping properties and a preponderance of indexes for the different query access patterns to access and group document metrics via different properties in the documents.

3.1.2 Queries for aircraft fleet maintenance

What are the typical questions we might ask of the data and how do we query a graph-structured tree of documents to answer them? This Section describes typical questions and code examples for how to answer these questions using queries written in the A​ rangoDB Query Language (AQL)​.

● What are all the constituent parts of a given component?

8 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020

Finding the sub-parts of a particular component involves starting at a particular vertex (the given component) in the graph and navigating to all vertices “below” that vertex — all vertices that can be reached by following edges in the forward direction are the sub-parts of the given component. This navigation is known as graph traversal, which is a typical characteristic of graph queries.

Figure 4: Finding all parts that are part of a component.

Here is an example of this query in AQL (ArangoDB Query Language). It finds all vertices that can be reached from “c​ omponents/Engine765” by doing a graph traversal into the ​ subtree below that vertex limited to a depth of 4 steps.

FOR part IN 1..4 OUTBOUND "components/Engine765" GRAPH "FleetGraph" ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ RETURN part ​ ​ ​

The example query illustrates the use of named graphs in AQL, where traversals can be specified without having to enumerate the specific edge collections in the traversal.

A named graph is a logical grouping of edges and vertices comprising a graph and is defined in ArangoDB by specifying a name and enumerating the edge collections and the connected vertex collections of the graph. The traversal only needs the graph name "F​ leetGraph", the starting vertex, and O​ UTBOUND for the direction of the edges to be ​ ​ followed, without having to explicitly specify the edge collections to follow. You can specify additional ​options ​ to control and constrain traversals.

9 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020

● Given a (broken) part, find the maintenance procedure for the part or for parts containing that part.

You have an individual part that is broken and you are looking for a repair procedure. A repair procedure may exist directly for the part, or it may be included in the repair procedures of the parent part hierarchy above the individual part. What we are looking for, is the smallest component of the aircraft that contains the part and for which there is a maintenance procedure.

In the graph-structured bill of materials, this involves starting at a leaf vertex representing the individual part, and searching upward in the tree until a component is found that has a maintenance procedure referenced in the JSON document for the component. This is a typical graph query that involves traversal where the number of steps to go is not known a priori​. This particular case is relatively easy since in traversing up a tree, there is always only one edge above each vertex.

Figure 5: Finding the smallest maintainable component.

The following AQL query finds the shortest path from “​parts/Screw56744” to a vertex ​ whose ​isMaintainable attribute has the Boolean value true, following the edges in the ​ “​inbound” direction: ​

FOR component IN 0..4 INBOUND "parts/Screw56744" GRAPH "FleetGraph" ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ FILTER component.isMaintainable == true ​ ​ ​ ​ ​ ​ ​ LIMIT 1 ​ RETURN component ​ ​ ​

10 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020

Note that here, we specify the graph name, the ​_id of the start vertex and filter on the ​ target vertex. We are only interested in the first vertex matching the filter, hence we apply a LIMIT 1.​ We see again that AQL directly supports this type of graph query.

● Which parts of this aircraft need maintenance next week?

This question illustrates the flexibility of the multi-model. The query that answers this question is associative not navigational. The query does not need to traverse the graph structure and we can directly query the document data model for the answer with the appropriate secondary index for optimizing query performance.

Figure 6: Query whose result is orthogonal to the graph structure.

This kind of query would be very cumbersome in a pure graph query language and database. That is because the query needs to rely on secondary indexes, in this case on the attribute storing the date of the next maintenance time.

We use a document query to find the components that are due for maintenance:

FOR c IN components ​ ​ ​ ​ FILTER c.nextMaintenance <= "2016-12-15" ​ ​ ​ ​ ​ RETURN {id: c._id, ​ ​ ​ nextMaintenance: c.nextMaintenance}

AQL uses a looping metaphor to describe the processing of documents in the ​components collection. The query optimizer identifies that it can use the secondary index on the

11 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020 nextMaintenance attribute to generate an efficient query execution path that uses the ​ index rather than performing a full collection scan that checks the ​FILTER condition over all ​ documents.

Note that AQL specifies document projections by simply forming a new JSON document in the R​ ETURN statement from known data. The query above illustrates how AQL also supports ​ queries that are typical of document stores.

3.1.3 Using multi-model querying

We will use the following question to illustrate the power of combining multiple query models in one query: Which parts need to be maintained next week and who is the point of contact for each of those parts?

The query that answers this question finds the parts with maintenance due, finds the nearest parent component containing the maintenance procedure, and then performs a JOIN operation with the c​ ontacts collection to add contact information for maintenance in ​ the result:

FOR p IN parts ​ ​ ​ ​ FILTER p.nextMaintenance <= "2016-12-15" ​ ​ ​ ​ FOR c IN 0..4 INBOUND p GRAPH "FleetGraph" ​ ​ ​ ​ ​ ​ ​ FILTER c.isMaintainable == true ​ ​ ​ ​ LIMIT 1 ​ FOR person IN contacts ​ ​ ​ ​ ​ ​ ​ FILTER person._key == c.contact ​ ​ ​ ​ ​ ​ ​ ​ RETURN {part: p._id, component: c, contact: person} ​ ​ ​

This query illustrates how to perform JOINs in AQL. The third nested ​FOR statement brings ​ the c​ ontacts collection into consideration for joining against the contact on the first ​ maintainable component above the part scheduled for maintenance. The query optimizer recognizes that the ​FILTER statement can be implemented most efficiently by doing a JOIN ​ using the primary index of the ​contacts collection for a fast hash lookup. ​ This is an example of the power and flexibility of the multi-model approach. The query needs all three data models: documents with secondary indexes, graph traversal, and a JOIN powered by fast key/value lookup. I​ magine the hoops that we would have to jump

12 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020 through if the three data models did not coexist in the same database engine and in the same query language!

This case study shows that the three different data models were indeed necessary to achieve good performance for all queries arising from the application. Without a graph database, the traversal queries with a priori unknown path lengths would require, notoriously nasty, inefficient multi-JOIN operations. However, a pure graph database would not satisfy our needs for the document queries that were achieved efficiently using the right secondary indexes. The key/value lookups complement the picture by allowing performant JOIN operations that provide additional flexibility in data representation. For example, in the case study, we did not have to redundantly embed the full contact details in every document, because we could perform the JOIN operation in the last query.

It is worth noting that ArangoDB can be used as a pure graph database, pure document store or pure key/value store. The combination of models is not always necessary but always possible even within the same query.

3.2 Lessons learned for data modeling

● JSON is very versatile for unstructured and structured data.​ The recursive nature of JSON allows embedding of subdocuments and variable length lists. You can even store the rows of a table as JSON documents. Modern data stores are so good at compressing data that there is no memory overhead in comparison to relational databases. For structured data, schema validation can be implemented as needed using microservices.

● Graphs are a good data model for relations. ​ In many real world cases, a graph is a natural data model. It captures relations and using JSON can store complex data on edges and vertices. ​Learn the Graph Database Capabilities in ArangoDB.​

● A graph database is particularly good for navigational queries. ​ The crucial thing here is that the query language must implement routines like “shortest path” and “graph traversal”. The fundamental capability for these is to rapidly access the list of all outgoing or incoming edges of a vertex.

● A multi-model database can compete with specialised solutions.​ The particular choice of the three data models allows us to combine them in a coherent engine. This combination does not mean compromise: As a document store it is as efficient

13 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020

as a dedicated document store. As a graph database it is as efficient as a dedicated graph database.

● A multi-model database allows you to choose different data models with less operational overhead. ​ Having multiple data models coexisting in a single database engine alleviates the many challenges of using different data models at the same time. It means less operational overhead and less data synchronisation, allowing for a huge leap in data modeling flexibility. You have the ability to keep related data together in the same data store, even if it requires different data models to do that. Mixing different data models within a single query increases options for application design and performance optimizations. And if you choose to split the persistence layer into several different database instances, you still have the benefit of only having to deploy a single technology. Furthermore, data model lock-in is prevented.

● Multi-model has a larger solution space than relational. ​ Considering all these possibilities for queries, the flexibility in data modeling and the benefits of polyglot persistence without the ensuing friction, the multi-model approach covers a solution space that is larger than that of the .

4 Further use cases for native multi-model databases

Content management Content has a nonhomogeneous structure, which makes a document store a good data model. However, links and connections between different pieces of content are most conveniently and naturally described by a graph structure.

Knowledge graphs Knowledge graphs are enormous data collections, most queries from expert systems use only the edges and graph queries, but often enough you need “orthogonal” and associative queries only considering the vertex data. The multi-model also allows source data to be staged in the knowledge graph and curated data can be served directly from the knowledge graph to support applications consuming the data. Get the full ​Multi-Model Enterprise Knowledge Graph White Paper​.

Enterprise hierarchies Enterprise hierarchies naturally fit a graph model, however, data access rights management typically requires a blend of graph and document queries.

14 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020

Fraud Detection This use case involves assembling and integrating a huge amount of data involving connections between different entities like accounts, IP addresses, machines and the like. In many cases, this can sensibly be modelled by a graph structure. Detecting fraud involves complex pattern matching that also considers the graph structure (e.g. an unusual amount of connections to a single host or account, or long fraud chains), as well as statistical analysis, associative queries and JOINs.

Identity and Access Management Similar to the enterprise hierarchies use case above, identity and access management often involves data that has a hierarchical structure, and usually, people or entities higher up in the hierarchy have their own access rights plus in addition all the access rights of their subordinates. This data is best described by a tree or a directed acyclic graph. Deciding access rights often involves the graph structure, but there are also a lot of queries about the identities which completely ignore the hierarchy.

Complex, user-defined data structures Any application that deals with complex, user-defined data structures benefits dramatically from the flexibility of a document store and has often good applications for graph data as well.

E-commerce systems E-commerce systems need to store customer and product data (JSON), shopping carts (key/value), orders and sales (JSON or graph) and data for recommendations (graph), and need a multitude of queries featuring all of these data items.

Real-time Recommendation Engine Coming up with sensible and effective real-time recommendations for customers in e-commerce requires path pattern matching in graphs, since one would like to recommend things to a customer A that have been bought by another customer B who is linked to A in some way, for example by both having bought similar products. At the same time, queries also use secondary indexes on the product catalog, for example to take sales rank and things into account.

Internet of things The IoT produces a very high volume of status data, geo location information, sensor data and the like. At the same time, the actual things in the IoT typically come in a hierarchical structure. For example, all home devices in the same house would report up to the house,

15 Copyright ArangoDB Inc.

What is a multi-model database and why use it? ArangoDB White Paper – April 2020 which in turn would aggregate some data and report to higher levels. This means that the data about the devices is naturally modelled by a graph and the high volume of sensor data has a diverse structure and often needs to be joined to the much smaller set of thing data.

Logistics Logistics involves geo locations, tasks, task dependencies, and resources needed for tasks. The data is both of a rather diverse structure and highly connected. Queries involve graph queries considering dependencies and associative queries backed by indices.

Network and IT Operations Computer networks, the associated hosts and their components, as well as virtulizations of software defined infrastructure, form a graph, and management of such infrastructure involves queries about this very graph structure, but also queries about the set of hosts or similar things.

Social networks Social networks are the prime example for large, highly connected graphs and typically involve graph algorithms and graph traversal queries, however, applications often need additional queries which query and aggregate information across individuals and thus need secondary indexes and possibly JOINs with key lookups.

Traffic Management Street networks are naturally modelled as a graph. Traffic flow data produces a high volume of time based data which is closely related to the street network. Finding good decisions about traffic management involves querying all this data and running intelligent algorithms using aggregations, graph traversals and joins.

Version management applications Version management applications usually work with a directed acyclic graph, but also need graphy queries and others.

Workflow management software Workflow management software often models the dependencies between tasks with a graph, some queries need these dependencies; others ignore them and only look at the remaining data.

16 Copyright ArangoDB Inc.