<<

MEAP Edition Manning Early Access Program Graph in Action Version 1

Copyright 2019 Manning Publications

For more information on this and other Manning titles go to manning.com

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action welcome

Thank you for purchasing the MEAP for Graph Databases in Action. Though cliché, the motivation behind this book is to write the book that I wish had existed when I started working with graph databases. Though there is a large amount of information available on the web about graph databases, it tends to be either very rudimentary or extremely advanced. What’s lacking is information to help people go from just getting started with graph databases to being proficient in the practical aspects of applications with them. This is the void that I am looking to fill with this book. My approach to teaching graph databases is to draw on the familiar concepts of relational databases for comparison, so having a background in data modeling and querying relational databases is suggested. To make the most of this book you will also need to be familiar with building out Java applications on top of relational systems. In this book you will gain an understanding of how graph databases work, how they differ from relational databases, and how to use them to develop applications. You’ll learn the fundamentals of graph databases by following along as we build a fictitious graph-based application called GluttonApp. The book is divided into three parts. Part 1 explains what a is, when to use one, and what the ecosystem of graph database options looks like. It also covers the fundamental principles of graph data modeling in action, as we apply them to our sample application. In Part 2, you will learn how to query a graph database and build a basic Java application using a TinkerPop-enabled graph database named Tinkergraph. Part 3 will move us to some more advanced concepts of graph databases such as performance tuning and application pitfalls and anti-patterns. My goal in this book is to remain vendor agnostic in the concepts and techniques, so you can easily transfer the skills you learn here to a variety of popular databases when you actually go to put your application into production. To build out the application, I decided to go with the Apache TinkerPop project, due to its widest adoption amongst database vendors, including AWS and Azure, as well as its open source in-memory database, Tinkergraph, which we will use throughout the book. If you have any questions, comments, or suggestions, please share them in Manning’s Author Online forum for my book.

—Dave Bechberger

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action brief contents

PART 1: GETTING STARTED WITH GRAPH DATABASES 1 What is a Graph and what can I do with it? 2 Do I have a Graph problem? 3 Graph Data Modeling 4 Data modeling in practice

PART 2: BUILDING ON GRAPH DATABASES 5 Querying Graph databases 6 Developing our application in Java 7 Beyond basic querying

PART 2: MOVING BEYOND THE BASICS 8 Performance tuning our application 9 Graph pitfalls and anti-patterns 10 Graph analytics for non-Graph people

APPENDIXES: A Apache TinkerPop installation and overview B Gremlin steps cheatsheet C An overview of property model graph databases and tools

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 1

1 What is a Graph and What Can it Do?

This chapter covers

• What graph databases are and why you want to use them to solve highly connected data problems • How graph databases compare to relational and NoSQL databases • Introduction to “just enough” graph and terminology to get started

In May of 2016, a massive leak of over 11 million documents, measuring ~2.6 terabytes of data, was released by the International Consortium for Investigative Journalists (ICIJ) (https://www.icij.org/investigations/panama-papers/), in what has become known as the Panama Papers. This release was a coordinated effort between journalists in nearly 80 countries to examine and connect information on approximately 200,000+ secret offshore companies based in Panama. (https://www.icij.org/investigations/panama- papers/pages/panama-papers-about-the-investigation/) Their investigation led to the naming of many celebrities, politicians, and their families as potentially using offshore bank accounts to hide their fortunes. Due to the sheer volume, the number of records, and highly connected nature of the data, the ICIJ decided to use a graph database named Neo4j to handle and coordinate the distributed efforts to connect the various pieces of data. Why would you choose to use a graph database over a more standard tool, such as a relational database to answer these sorts of questions?

Graph databases are the only option when trying to make sense of the vast terabytes of connected data that we are producing more and more of, and are an essential tool for international agencies, governments, financial services, and security firms trying to uncover the truth.

Emil Eifrem – CEO Neo4j Inc. in reference to the Panama Papers

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 2

In other words, Mr. Eifrem's quote above references the fundamental power of graph databases: to show the richness of highly interconnected data in a manner that is unavailable with other types of databases. Graphs and graph databases are powerful tools that enable us to better understand real-world problems that deal with highly connected data such as social networking, search, infrastructure management, recommendation engines, or in the case of the Panama Papers, fraud detection. The fundamental driving factors for this book are to empower you as the reader to leverage graph databases as a tool to build applications. Throughout this book, we will examine how graphs, and graph databases, provide the end user with a tremendous amount of additional power to navigate and explore data in ways that cannot be accomplished easily within a traditional relational database. We will achieve this goal by walking through the process of building a fictitious application, which we will call GluttonApp. GluttonApp is an application that provides personalized restaurant reviews, similar to Yelp. This application will also allow you to connect with your friends and develop a , similar to Facebook or Twitter. Finally, the app will use your friends' ratings to personalize your restaurant recommendations based on their restaurant reviews. Each chapter will comprise a step the process and will build upon the previous chapters work on GluttonApp. By the end of this book, we will have created a functioning application on a graph database using the skills learned along the way. In this chapter, we begin our journey by gaining an understanding of what graphs and graph databases are and how they compare with traditional tools, such as relational and document databases.

1.1 What are Graph Databases?

Graph databases are a type of database that uses graph structures, specifically vertices (also called nodes) and edges, to store and query complex data.

Figure 1.1 A Simple Graph showing a and an Edge

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 3

These databases combine these basic graph structures with the fundamental constructs of graph theory to provide a database that facilitates fast and straightforward retrieval of complex data and relationships. In this section, we will introduce what a graph is and how they fundamentally differ from relational databases.

1.1.1 What is a Graph and Why Use one?

A graph is a mathematical construct used to model relationships between items. It provides an abstract method for connecting objects and representing the relationships between them. In the figure below, we see a small social network graph where the people (items) appear as vertices (circles) and the relationships are represented by the lines connecting them, known as edges.

Figure 1.2 A small social network graph

It is human nature to attempt to view systems of interconnected entities in the real world as graphs. When thinking about a of data which contains a vast array of highly interconnected items, such as the Panama Papers, it is a natural tendency to analyze and describe them as a web of interconnected things, which reflect another way to describe a graph. In the real-world, items are related to other items in rich and varying ways, which are not well represented by the uniform and rigid structure of columns, rows, and tables used by relational databases. In the figure below, we visualize the business connections of the family members of Syrian President Bashar Al Assad (https://www.icij.org/investigations/panama- papers/the-power-players/). Looking at the figure below, we can see representing this as a graph makes it natural to visually comprehend and navigate this highly connected data in a way that would be impossible in a tabular view.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 4

Figure 1.3 A graph of Syrian President Bashir al Assad’s family business connections

Secondly, graphs are powerful tools for recognizing patterns within data. In the case of the Panama Papers, for example, graphs were used to provide valuable insight into the data by finding hidden offshore assets associated with celebrities and politicians. In the figure below, we can see the interconnected data of former Icelandic Prime Minister Sigmundur Davíð Gunnlaugsson. He stepped aside after the ICIJ Panama Papers investigation revealed that he was hiding offshore assets through his wife. The process would have been much more complicated and tedious if we attempted to understand this deeply tangled network of relationships if we were using the tabular data formats of standard relational databases.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 5

Figure 1.4 Image from Panama Papers of the Former Prime Minister of Iceland Sigmundur Davíð Gunnlaugsson hidden assets (https://linkurio.us/blog/panama-papers-how-linkurious-enables-icij-to-investigate-the-massive- mossack-fonseca-leaks/)

Finally, a graph is optimized to handle highly connected data with strong relations within itself. It enables this by making those connections as important and powerful as the items they are connecting. In a relational database, the relationships between entities are represented via a foreign key constraint. In a graph, relationships between objects are represented as are rich connections between items, known as edges, that have the additional benefit of adding semantic meaning to the data. As demonstrated in the figure below, we can easily discern how each of the elements are related to one another. The relationships enabling you to effectively navigate your data in ways that traditional databases do not allow.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 6

Figure 1.5 Image from the Panama Papers of Iiham Aliyev, President of Azerbaijan (https://www.icij.org/investigations/panama-papers/the-power-players/)

1.1.2 Why Can't I use SQL?

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.

Abraham Maslow

This quote by Abraham Maslow (The Psychology of Science, 1966), known as the Law of the instrument, is a concept that suggests how there is often an over-reliance on familiar tools instead of choosing the optimal tool for the job. As developers, we are all frequently guilty of choosing the familiar tool over the correct one, especially when dealing with databases. While the short answer is that you can do most anything in a relational database that you can do in a graph database, that does not mean that it is the optimal tool to accomplish a job. Most development teams have in-depth knowledge of the ins and outs when working with relational databases, but most teams have little to no expertise in other types of databases. Due to this void, we often default to the relational database "hammer" when there are better tools in the toolbox to solve a given problem. Let's look at a few of types of issues where the underlying nature of a graph database provides for a simpler, more elegant solution to a problem than a relational one.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 7

RECURSIVE QUERIES

Recursive queries represent one type of problem that is difficult to write in a SQL and even more challenging to make work well. Given a list of employees and managers in a company, let’s see how we would determine a person management structure.

Figure 1.6 Organizational Structure of a Company

In a relational, we would construct a table like this:

CREATE TABLE org_chart ( employee_id SMALLINT, manager_employee_id SMALLINT, employee_name VARCHAR(20) );

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 8

employee_id manager_employee_id employee_name

1 3 You

2 3 Co-Worker

3 4 Team Lead

4 5 Manager #2

5 8 VP

6 5 Manager #1

7 5 Manager #3

8 President/CEO

And then use a recursive function as shown below to query and retrieve this data:

WITH RECURSIVE org AS ( SELECT employee_id, manager_employee_id, employee_name, 1 AS level FROM org_chart UNION SELECT m.employee_id, e.manager_employee_id, e.employee_name, m.level+1 AS level FROM org_chart AS e INNER JOIN org AS m ON e.manager_employee_id = m.employee_id )

SELECT * FROM org ORDER BY level ASC;

If you have ever had to write Common Table Expressions (CTE) in SQL, you know that they are complex and challenging to write and nearly impossible to debug. Recursive CTE’s, such as the one above, are notorious for very poor performing very poorly and being nearly impossible to optimize and debug. On the other hand, nested and recursive queries such as this are a natural fit in a graph database and something that graph databases are optimized to answer. Take for example the same dataset shown as a graph:

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 9

Figure 1.7 Graph of Organizational

As you can see in the figure above, it is straightforward to visually navigate the hierarchal nature of the organization. Additionally, the ability to query this graph is much simpler and more performant than the comparable SQL. The query below demonstrates the straightforward nature with which you can recursively query a graph: g.V() .repeat( out(‘works_for’) ).().next()

NOTE The query language you see above is a graph query language called Gremlin which we will be using throughout this book. At this point it is unnecessary for you to understand precisely how it works. We will go into detail in a later chapter but instead look at the relative simplicity of this query compared to the previous SQL query.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 10

DIFFERENT RESULT TYPES

Have you ever needed to return different data types from a database? For example, let’s say that we had an order processing system, and we wanted the results of our query to contain not only the order information but also the product information for that order. Returning results containing different types is not something that SQL can do without creating a union of all the fields in all the tables. Let's take a look at what it would take to return this data using SQL.

Figure 1.8 Relational Schema of Orders and Products

SELECT *, null AS product_name, null AS cost FROM orders UNION SELECT *, null AS name, null AS address FROM Products id name Address product_name cost

1 John Smith 123 Main St

2 Jane Right 234 Park St

123 widget 1 5.95

234 widget 2 10.76

As we can see from the results returned, this union of these two data types dictates that our answer contains a large number of null values. This abundance of null data is because the columns between the two tables are inconsistent, but the dataset returned by the query must contain a consistent set of columns. This need for homogenous data not only inflates the amount of data returned, but it also removes the descriptive nature of the . One of the strengths of a graph database is the ability to return differing data types in the

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 11

results. In most to all graph frameworks/databases, the data returned can be a combination of vertex and edges: data created from the series of steps the query takes as it moves through the graph known as the path, as well as types mapped from any part of the results to a new data structure. In the case above, this is what data a graph database returns. gremlin> g.V().valueMap(true).next() ==>[label:order,address:[123 Main St],name:[John Smith],id:1] ==>[label:order,address:[234 Park St],name:[Jane Right],id:2] ==>[label:product,cost:[10.76],id:234,product_name:[widget 2]] ==>[label:product,cost:[5.95],id:123,product_name:[widget 1]]

As you can see, the data returned retains all the semantic meaning of what the object is and represents. No additional null data fields are required, only the data associated with each vertex is returned. Without the large quantity null data, we can create much cleaner code when working with highly varied data types than is possible using a relational database.

PATHS

If you have ever gotten directions using Google or Apple Maps, then you have used a fundamental concept of graphs known as a path. A path is the sequence of vertex and edges that describe how the traversal moved through the graph. In Google or Apple maps, a set of directions between two locations would be an example of a path. The ability to return how two objects are connected to each other from within the database is a feature unique to graph databases. What I find compelling about paths is that they allow us to solve problems in novel ways. For example, let us look at a classic puzzle known as a river crossing puzzle. In our river crossing puzzle, we have a fox, a goose, and a bag of barley that must be transported across a river by a farmer on a boat.

• The boat can only carry one item in addition to the farmer on each trip. • The farmer must go on each trip • The fox cannot be left alone with the goose or it will eat it • The goose cannot be left alone with the grain or it will eat it

Using a relational database, I am not aware of a way to solve this riddle without using a brute force method to calculate all possibilities. However, with a little clever data modeling and the power of a it is rather straightforward to answer this riddle with a graph.

NOTE In the figure below, we are representing the boat as a T, the goose as a G, the fox as an F, and the barley as a B. The underscore represents the river and each letter appear on the side of the river where it currently resides.

We start by modeling the initial state of our system as a vertex in our graph. We then add an edge to represent each potential state change leading to a vertex that represents the new

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 12

state. We repeat this process until the new vertex violates one of the constraints above or we reach the desired state. By doing this for all potential options, we get the graph below:

Figure 1.9 River Crossing Problem Represented in a Graph

If we remove any state (vertex) that violate a constraint, the adjoining relationships (edge), and any edge that connects back to a previous state, we get the figure below:

Figure 1.10 River Crossing Problem with the Invalid States and Cycles Removed

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 13

From looking at the figure above, it becomes apparent that we have two separate paths to get to our desired state. To query the graph to return these paths, it is simply a matter of leveraging the pathfinding capabilities of graph databases to return the two appropriate paths. g.V(‘TFGB_’) .repeat( out() ).until(hasId(‘_TGFB’)) .path().next()

The query above will return not only the first and last vertex visited but the entire set of vertices and edges that were visited along the way.

TFGB_ -> FB_TG -> TFB_G -> F_TGB -> TFG_B -> G_TBF -> TG_FB -> _TGFB TFGB_ -> FB_TG -> TFB_G -> B_TFG -> TGB_F -> G_TBF -> TG_FB -> _TGFB

Even though this example was just a riddle, it represents the same fundamental problem found in many real-world applications such as route finding on a , modeling constraints in a logistics system, or finding connections between people in a social network. Each of these cases is fundamentally about determining the set of steps it takes to get from one entity to another. The fact that graph databases leverages a graph data structure allows us to leverage these pathfinding capabilities which are unavailable to other database types.

1.2 Graph Databases on the Database Spectrum

If you are thinking to yourself, "I am reading this book to learn about graph databases, why I am looking at these other types of data stores?," then you likely aren't alone. The reason to take time to understand the types of databases available on the market today is so that we can compare graph databases to more familiar technologies, such as relational and document databases, so that we learn what differentiates them from their competitors. While there is a wide array of offerings in the database market, there are five generally recognized categories of databases, each of which is defined by the data structures used to store data and differentiated by the complexity of the data they are capable of storing.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 14

Figure 1.11 NoSQL Data Spectrum and Increasing Data Complexity

To demonstrate the differences, we are going to compare how we would solve a “friends of friends” (FofF) problem. A "friends of friends" problem is a common use case in a graph database due to the recursive nature of retrieving the data. This type of problem that is one faced by companies such as LinkedIn or Facebook when recommending friends or people you might know within their systems. In our example, we want to locate all the people that you follow (Alice) as well as anyone that they follow (Ted/Josh) who also follows you. In a graph database, we can solve this problem by creating a graph like the figure below:

Figure 1.12 Friends of Friends Problem Graph

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 15

By creating this type of graph, we can navigate through our graph, as shown in the figure below to locate our answer of Ted.

NOTE Throughout this book, we will be using this little green gremlin to represent where we are located in our graph as we move through it. This green character is the mascot for the Gremlin query language of the Apache TinkerPop project. This graphic is the trademark of the Apache Software Foundation (http://www.apache.org/) and the Apache TinkerPop project. (http://tinkerpop.apache.org/)

Figure 1.13 Navigating the Friends of Friends Graph

Now let us look at how we can write a query to return the Ted from the graph above written in Apache TinkerPop Gremlin: g.V().has(‘person’, ‘name’, ‘Bob’) .repeat( out(‘follows’) ).times(2) .values(‘name’).next()

And the same query when written in Cypher:

MATCH (bob:person {name: ‘Bob’})–[f: follows*2]->(friend:person) RETURN friend.name

Now that we know how we would solve this in a graph, let us examine how we can address this example in other database types on the Database Spectrum and how the data looks when stored for each category. Since we have already compared and contrasted graph and relational databases in the last section, we will instead focus on the remaining three types of Key/Value, Wide Column, and Document.

KEY VALUE

Key/Value databases store data represented by a unique identifier, the key, and an associated data object, the value. This key-value model uses data structures such as dictionaries or

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 16

hash tables to store and retrieve data in an efficient manner. Key-value databases treat data as if it is a single array of data with each record being a single atomic row identified by the key. In key-value databases, there is no schema enforced by the database so the data structure of the value of the entry can vary per row. Unlike other databases types, we will describe below how the ability to query and retrieve multiple records or records specific criteria is not available in key-value databases. In the case of our "friends of friends" problem, you would need to store data as unique and atomic rows. Key-value databases do not provide a means to "link" associated records within the database. To accomplish the linking between data, you will need to embed the relationship, "follows", into each record of stored. Adding this data to each record is essentially creating a bi-directional linked list in each connected data record.

Key/Value Representation of the Friends of Friends Problem

Key Value

Bob follows:[‘Alice’]

Alice follows: [‘Ted’, ‘Josh’]

Ted follows:[‘Bob’]

Josh follows: []

To query this data, we need to pull back records one at a time, starting with "Bob", find all the linked records, retrieve those records, and repeat this process until we have located all friends of our friends. Since key-value databases do not provide to link or join data across records, the application would perform this process. In general, key-value databases are excellent tools for handling the quick storage and retrieval of non-uniform atomic data, such as those needed by application caches, but are not well suited to handle highly interconnected data.

COLUMN-ORIENTED

Column-Oriented (or Wide Column) databases store data in rows with the ability to hold a potentially large number and arbitrary set of columns. Column-oriented databases store data in such a way that there is a unique record or row key associated with a specific row of data. Inside that row, the data stored inside the columns may be arbitrary and schema-less such as is the case with Apache HBase and Azure CosmosDB or may require a predefined schema and data types such as the case with Apache Cassandra. While the implementations of the underlying databases differ, the user thinks of their data in familiar terms such as Table, Row, and Column. One significant difference between Column-oriented databases and the more familiar RDBMS models is that Column-oriented databases do not usually support advanced query features such as joins. All the data required to satisfy a specific query must reside inside that single table. It is helpful to think of tables inside Column-oriented databases as if they are

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 17

pre-computed views on your data created at write time. To achieve this objective, we are usually required to denormalize our data into multiple tables at write time from within our application’s data access layer. While this additional overhead may seem unwieldy and difficult to optimize, the tradeoff for denormalizing your data at write time is that the query time to retrieve the data can be dramatically less than a properly normalized model in an RDBMS. In the case of our "friends of friends" problem, we would store data in two tables: one contains the unique properties of a vertex and one includes the links between the items in the other table, as shown below:

Column-oriented Representation of the Friends of Friends Problem

People Linked People

person_id name person_id linked_person_id

1 Bob 1 2

2 Alice 2 3

3 Ted 2 4

4 Josh 3 1

As with key-value databases, column-oriented databases do not allow for joins between rows at the database level. They require that the application handle all the individually for retrieving rows and navigating to the linked rows. Column-oriented databases suffer from the same lack of relationships, as we discussed with key-value databases. In brief, they embody the same drawbacks when dealing with highly connected data. Column-oriented databases are generally optimized to serve out specific rows of data from a single table quickly and at an enormous scale.

DOCUMENT

Document databases store schema-free data usually in a standard format such as JSON or XML. These databases store documents, each of which has a unique identifier, or key, that uniquely identifies a document. Within each document, the contents of that document can support complex, nested, and hierarchical data structures. Each document is an atomic entity that contains all the information related to that specific entity within your system. Even though the database is schema-less across documents, most document databases provide querying functionality as well as some support joining across documents. One drawback to using this functionality with document databases is the fact that the management and validation of the data schema is now the responsibility of the application(s) using that database instead of the database itself.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 18

In the case of our "friends of friends" problem, you would need to store data as a series of documents. Each document must also contain a data that stores the links between them in the form of a linked list.

Listing 1.1 Document Representation of the Friends of Friends Problem

[ { "id":1, "name":"Bob", "follows":[2] ] },{ "id":2, "name":"Alice", "follows":[3,4] ] },{ "id":3, "name":"Ted", "follows":[1] ] },{ "id":4, "name":"Josh" ] } ]

From this data model, you can see that our document data model is quite similar to the key- value and column-oriented data models as the relationships between items are stored inside of each document as a linked list pointing to the unique identifier for the related document. Unlike key-value and column-oriented databases, document databases frequently have the ability to join documents. However, document databases still have several drawbacks when dealing with highly interconnected data. First, links between data are stored at the document level, which require retrieving and updating all associated documents when a change occurs. Second, they only support relationships in one direction, so I can answer the question "Who do I follow?" but not "Who follows me?." In general, document datastores are useful in scenarios where the data tends to be atomic with only weak connections between items. From our examination of other potential database types, they each have a different and unique approach to storing data. They each provide a useful tool in the right scenarios; however, when dealing with highly interconnected data, they each lack pieces of functionality that make them a less than optimal choice. Graph databases are the only type to offer rich, robust relationships between data, which enable navigating the links found in highly interconnected data in ways that are impossible with the other database types.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 19

1.3 “Just Enough” Graph Theory

The purpose of this book is to deliver a practical guide to graph databases, not give an in- depth dive into graph theory. If you are like I am without a formal background in mathematics and do not relish the idea of learning complex mathematical theory to build an application, then you are in luck. Building applications using graph databases do not require that you have a deep understanding of graph theory; instead, they warrant knowledge of “just enough” theory to understand what the database is doing. In this section, we will walk through “just enough” of the fundamentals of graph theory to give you a basic understanding of graph theory and terminology to understand how to use graph when develop an application. Graph theory is a sub-discipline of devoted to studying graphs and graph structures. In mathematical terms, a graph is a set of vertices and edges.

Figure 1.14 Simple Representation of a Graph

In the figure above, each of the circles represent a vertex with each of the lines representing the relationships between the two vertices know as an edge. It is the collection of all nodes and edges together that comprises the graph. Vertices represent elements in the model and the edges represent the relationships between things. In much the same way that relational databases are based on the mathematical rules of relational and , graph databases and are based on the mathematical principals of graph theory. Graph theory is a type of math that is easily explainable using everyday examples. In the next section, we will take a look at the history of graph theory, some basic terminology, and examine some of the fundamental types of graphs that form the basis for how graph databases function.

1.3.1 Euler and Graphs

Graph theory is a branch of mathematics that can be traced to 1735 when a Swiss mathematician named solved the now famous Seven Bridges of Königsberg

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 20

problem. It is a cliche that every book on graph or graph databases uses this example, but it is a foundational problem with graphs to show how a graph can efficiently represent an abstracted version of real-world situations. Königsberg was an old Prussian city located on the Pregel river with two islands and seven bridges. The thought experiment was to devise a path that would allow citizens of the town to cross all seven bridges exactly once. Euler approached this problem by creating an abstract representation of the land masses (nodes) and representing the bridges and connections (edges) between the nodes. Based on this abstraction, Euler was able to describe that it was not the items specifically that mattered but the of how these items were connected that played the most significant role.

Figure 1.15 Seven Bridges of Königsberg and Euler’s abstraction of the problem

In his Seven Bridges of Königsberg paper, Euler stated that for the problem to be resolved, the graph must have either zero or two nodes with an odd . Any graph that meeting this condition is known as an Eulerian graph. If the path visits each edge exactly once, then it contains an . If the start and end vertex are the same, then it has an Eulerian circuit, which is also known as an Eulerian .

Eulerian Paths and Circuits in the Seven Bridges Problem Eulerian paths require that exactly two vertices have an odd number of adjacent edges or edges that have one end associated with them. In a Eulerian circuit all vertices must have an even number of incident edges. In the Seven Bridges problem, three vertices have an odd number of edges, so the Seven Bridges of Königsberg does not contain a Eulerian path or an Eulerian circuit and is therefore not an Eulerian graph.

In the context of this book, the critical takeaway from this paper is that Euler described a method for resolving the problem that used vertices to describe the objects in the question and edges to describe the relationship between the objects. This method of abstracting the

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 21

real-world problem into things and relationships provided the foundational framework for the study of graphs and graph theory.

1.3.2 Basic Graph Terminology

Before we dive into some basics of graph theory, let’s first define common graph theory terminology:

Figure 1.16 Example Graph to demonstrate terminology

• Graph – A graph (G) is defined as the set of all the vertices (V) of the graph and all the edges (E). In the figure above, the graph would a set where V = {A, B, C, D, E, F} and E = {1, 2, 3, 4, 5, 6, 7} • Vertex – A point in a graph where zero or more edges meet. This point is often also referred to as a node. In the figure above, a vertex would be any of the items in {A, B, C, D, E} • Edge – A relationship between two vertices within a graph. In the figure above, a vertex would be any of the items in {1, 2, 3, 4, 5, 6, 7} • Adjacency – If two vertices are connected via an edge, then they are said to be adjacent to each other. In the figure above B and C are defined as adjacent to A • Incident – An edge is incident to a vertex if one or more of its endpoints is the vertex. In the figure above, 1 and 2 are incident to A

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 22

• Subgraph – A subset of the vertices and edges in a graph. In the figure above, the graph would a set where V = {B, D, E} and E = {3, 4} • Path - The sequence of vertices and edges crossed to navigate the graph from one vertex to another. In the figure above, a path from A to D would consist of A – 1 -> B – 3 -> D • Traversal – The process of visiting each graph specified by the parameters of a query. This term is often used interchangeably with the term query. • Query – See Traversal. These two terms are used interchangeably. • – An edge that has the same start and end vertex. In the figure above, 6 is an example of a loop. • Cycle – A path that starts and ends at the same vertex. In the figure above, A – 1 -> B – 3 -> D – 7 -> A is an example of a cycle • Degree – The number of edges that are incidental to a specific vertex. In a , degree is often further defined as the in-degree, the number of edges incoming to the vertex, or out-degree, the number of edges going out of a specific vertex. In the figure above, B would have a degree=3, an in-degree=1, and an out-degree=2. • – A graph that has edges parallel to one another; in other words, a graph that has more than one edge between the same two vertices. In this book, the graph instances we will be referring to throughout are .

Graphs in graph theory are a data structure that represents a network of things and the connections between those things. As shown by Euler in the seven bridges problem, abstracting common systems into graphs is a highly natural in domains such as finding directions on a map, connecting people in a social network, or tracing financial transactions.

1.3.3 Basic Graph Types

Graphs are not new concepts to software developers and constitute the basis of many common data structures that we commonly use in software development, likely without even realizing it. After all, common data structures such linked lists and trees are a graph data structure with a specific set of constraints. In this section, we are going to examine the three basic types of graphs undirected, directed, and weighted. We will discuss what defines each type of graph, as well as provide examples of how each is used. To explain these concepts, we will find directions between the three cities and discuss how each type of graph affects the problem.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 23

UNDIRECTED GRAPH

Figure 1.17 Undirected Graph showing Three Towns

In our first example, we have three towns that are connected by three roads. These roads all allow traffic to flow in either direction as shown in the figure above. In this scenario, our graph is as an undirected graph. An undirected graph is a graph where the edges between two vertices do not specify a direction, which is represented by the absence of an arrow. In our example, each of the roads (edges) do not have a direction specified which means that you travel them in either direction. To find the shortest route from Town A to Town C, we travel the road (edge) that connects the two, since there is no restriction on the direction of travel. The use of undirected graphs to describe real-world problems has risen dramatically over the past few years. Undirected graphs represent bi-directional relationships between items such as electrical connections, communication networks, and even some social networks such as Facebook where friends can each see the other one.

NOTE: Unlike the bi-directional “friends” relationship of Facebook, where you are a friend of the people you friend, Twitter is unidirectional because you can “follow” people who do not “follow” you.

In the context of graph databases and frameworks, few systems rely on undirected graphs as the underlying graph representation. However, they allow undirected graphs to be represented by allowing traversal across the edges in either direction or by creating a pair of symmetric edges between vertices.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 24

DIRECTED GRAPH

In our second example, the road between Town A and B is a highway represented by a pair of symmetric edges. The road between Town B and C and Town C and A both have been changed into one-way streets. Each of these roads has a direction associated with it that limits travel to only that direction. e.g., You can travel directly from Town B to Town C but not from Town C to Town B.

Figure 1.18 Directed Graph showing Three Towns

In this scenario, our graph is a directed graph. A directed graph is a graph where each edge has a direction associated with it. Even with the added limitations on the direction of travel, we can still travel from Town A to Town C by traveling the directed road between them. However, no travel is allowed out of Town C as all the roads (edges) only permit travel into Town C, not out of Town C. Directed graphs are the basis of most graph databases on the market today. While the restriction on moving in only one direction on an edge may seem like a problematic constraint, most graph databases realistically allow for traversing edges in both the in and out direction. Directed graphs are used to model many real-world use cases such as dependency graphs, flow networks like roads or rivers, and social networking use cases such as Twitter, where connections are not automatically symmetric.

WEIGHTED GRAPH

In our third type of graph, we have added a value, or weight, to each edge to represent the relative amount of time it takes to travel between the cities. The road from Town A and B is

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 25

a highway so it has the shortest travel time meaning it has the lowest relative weight. The road between Town A and C is a hilly mountain road, so it has the highest travel time, hence the highest relative weight.

Figure 1.19 Weighted Graph showing Three Towns

In this scenario, the graph is a weighted graph. A weighted graph is a graph where each edge has a relative value associated with it, known as the “weight” of the edge. While counterintuitive, the quickest from Town A to Town C is to go from Town A to Town B, then to Town C instead of directly to Town C. The sum of the weights of the edges from Town A to Town B to Town C (0.8) is lower than the weight of the edge from Town A to Town C (0.9), giving it a shorter time of travel. In a weighted graph, we use weights to provide additional information about the topology of the graph that helps us to determine the least expensive path between two vertices, which may or may not be the shortest number of hops. Weighted graphs are commonly in real-world use cases because it is rare that all paths through a system have the same relative cost. By expressing these inequalities within our graph, we can apply these relative inequities to our traversals and algorithms to provide a realistic representation of the information contained within the graph.

APPLYING THEORY TO REAL WORLD PROBLEMS

Now that we learned a bit of graph theory, how do we apply it to building applications? While the specifics of graph theory are not often used when working with graph databases, a fundamental understanding of the constructs a database uses helps to create a clearer mental picture of how our system functions. A robust set of fundamentals allows us to begin thinking

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action 26

about problems as graphs, instead of the more familiar relational concepts of rows and columns. In brief, a strong understanding of graph terminology and which type of graph to apply allows us to immerse ourselves into the graph world by enabling us to think in the common terms of graph databases. In the relational world, understanding the basics of set theory allows us to fundamentally understand what our database is doing when we use INNER JOIN, OUTER JOIN, or FULL JOIN. In graph databases, an understanding of the fundamental concepts of graph theory allows us to understand how we are interacting with our graph. This conceptual understanding of the inner workings provides us the ability to not only optimize the solution, but also enables us to debug the solution much quicker. The items presented in this section are what I deem as the most useful to help graph database developers to gain a fundamental foundation needed to work with graph databases. As you build more complex applications on graph databases, it is likely that you will need to dive much deeper into aspects of graph theory beyond what was discussed here, but these strong fundamentals will provide a stepping stone in that path.

1.4 Summary • Graph databases treat the relationships between items as rich features of the data in a way that other types of databases do not. These rich relationships allow for better or novel ways of working with our data using concepts, such as recursive queries, differing result types, and paths. • Graph databases allow you to think about problems in fundamentally different ways than other data stores by enabling you to model items, such as constraints and dependencies directly in the data itself. • Graph databases are a powerful type of NoSQL database that handles highly interconnected data that is difficult to store and retrieve in relational databases. • Undirected graphs are graphs where the edges do not specify a direction and they are used to represent problems such as electronic circuits or communication networks. • Directed graphs are graphs where each edge has a direction associated with it and they are frequently utilized to describe problems such as dependency graphs or social networks. • Weighted graphs are graphs where each edge has a relative cost, or weight, associated with it. Weighted graphs are employed to represent problems where the effort required to walk each relationship is not necessarily equal, such as map directions or network routing.

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders. https://forums.manning.com/forums/graph-databases-in-action