Graph Databases for E-Commerce Platforms

GRAPH DATABASES FOR E-COMMERCE PLATFORMS

Dr. Simon Wong The Hong Kong Polytechnic University [email protected]

Dr. Jack Wu The Hong Kong Polytechnic University [email protected]

ABSTRACT

Nowadays, organizations need not only to manage larger volumes of data, but also generate insights from existing data. In this situation, the relationships between data points become more important and critical for making better values of data. This makes a demand of having a kind of technology which can manage data relationships effectively. This paper introduces graph database technology which is based on a graph data model. This paper addresses why graph databases are selected rather than relational databases for storing and analyzing data with better performance. Graph databases do not only store data relationships effectively, but also are flexible when expanding a data model or conforming to dynamic business needs. Furthermore, this paper presents how graph databases can be applied to e-commerce platforms in terms of data management and analysis. Some application scenarios are highlighted as examples.

Keyword: E-Commerce, Graph Database, Recommedation Models and Algorithms

1. INTRODUCTION

Technology evolutions are always driven by business needs and changes. The need of handling large-scaled data volume, more data varieties (e.g., structured, unstructured and semi-structured data) and data velocity drove the evolution of technology, especially in the database field. Data modeling and databases evolved together, and their history dates back to the 1960’s.

The first wave consists of network, hierarchical, inverted list and object-oriented database management systems (DBMSs) [1]; it took place from roughly 1960 to 1999. The second wave was led by relational DBMSs around 1990 and the relational DBMSs dominated the market for more than two decades [8]. The third wave was

introduced by Online Analytical Processing (OLAP) and business intelligent tools [10]. The recent wave includes NoSQL and graph databases due to big data demands induced by the Internet and mobile technologies [7].

The researchers in this study are interested in the graph databases in the recent wave and address why graph databases are good for storing and analyzing data of e-commerce platforms and how graph databases can bring additional values to the next generation of e-commerce platforms. Literature review on database development is first presented. Then, graph theory and a graph data model [3] on which graph databases are based [11] are introduced. Graph data modeling sets a new standard for visualization of data models. Using graphs to represent data models changes the way of understanding data and their patterns. Furthermore, key advantages of using graph databases are addressed, and comparison among relational, NoSQL and graph databases in terms of analytical capability and performance is presented. Lastly, some examples of adopting graph databases in the e-commerce industry are presented.

2. LITERATURE REVIEW

In the development progress of data modeling in terms of the Turing Awards, one of highly impacted theory, namely “Relational Database: A Practical Foundation for Productivity” was raised by E. F. Codd in 1981 [5, 6]. However, at that time, some criticisms of the relational model introduced two major issues: Complexity of data modeling due to normalization and performance. What really made relational databases into reliable production tools was the advent of robust query optimizers in 1990 [4]. That started the golden age of relational databases to take over most of database processing in most industries. Afterwards, there was no new relational technology breakthrough till 2000. Some companies started developing alternatives with specialized niches such as documents, graphs, semantics and high-volume applications. Around 2008, NoSQL got a high market attention because of Facebook’s open source versions of Hive and Cassandra. That was the critical age of driving the evolution from relational databases to NoSQL databases [7]. Developers started building robust and scalable systems with NoSQL databases in the Internet and mobile world.

With respect to the database structure, people always make use of the Relational Database Management System (RDMS) as the data storage engine of their systems. RDMS has been the major engine for handling most types of transaction processing, operational, and reporting applications. However, to fit data into relational tables and columns, it is required to normalize data structures and minimize data duplications across tables. As a result, the structures are completely different from what they really represent. An object-relational mapping (ORM) method is needed to map the data to the object model which is a graph in nature. This data structure of RDMS limits the capability to execute more sophisticated types of analysis, and connections between various entities. Though these connections can be done with primiary key and foreign key relationships, this relational model cannot optimally capture all the valuable information associated with entity connections.

Graph databases use a graph model to store data as a graph with a structure consisting of vertices (or, nodes) and edges (or, relationships). Each node represents an entity such as a person, a place, a thing or another piece of data, and each

relationship represents how two nodes are associated. This structure allows modeling all kinds of scenarios defined by relationships. A graph database performs as a database management system and allows Create, Update, Read and Delete (CURD) operations based on the graph data model. Unlike RDMS, a graph database does not need to infer data connections with foreign keys.

As this graph database technology is new to the market, including the e-commerce industry, this paper focuses on analyzing the strengths of graph databases through comparing with other database types and introducing some techniques for having better performance of adopting graph databases specifically for e-commerce platforms. Furthermore, some core areas which leverage this graph database technology to add new values to businesses is also a major focus in this paper.

3. GRAPH THEORY & DATA MODEL

3.1 Graph and Graph Theory

As presented by Vukotic et al. [11], graphs are regarded as the most natural structure to represent relationships between entities. A solution for the seven bridges of Königsberg problem published by Leohnard Euler in 1774 was the first scientific paper on graph theory [11]. Its model and algorithms have been adopted in various engineering domains. The idea of graph database originated from graph theory of mathematics which represents a group by a diagram in the plane where the vertices are represented by points and the edges are indicated by the presence of line segement or curve between two points in the plane corresponding to the appropriate vertices [11]. Figure 1 demonstrates an example of a graph.

N6 N5

N1 N4

N3 N2

Figure 1. A Graph with 6 nodes denoted by N1 to N6 with the set of edges connecting these nodes

3.2 Graph Data Model

Graph data modeling is the process of expressing an arbitrary domain as a connected graph of nodes and relationships. These are elemental components for building a graph. Both nodes and relationships can have properties which provide additional information between data points. This concept is illustrated in Figure 2.

Friend of

Person Friend of A

Person Purchase B

Purchase Product 1

Figure 2. A Graph data model representing relationships between two persons and a product

In Figure 2, person A (node) knows person B (node) and they are friends (relationships) of each other. Both of them purchase product 1 (node). Additional information can be represented as attributes of both nodes and relationships, as indicated in Figure 3.

Friend of Friendship since 2008 Name: Peter Jackson Person Friend of Gender: Male A Age: 30 Friendship since 2008

Name: Amy Licester Person Gender: Female B Purchase Age: 25 On 01/09/2017 Purchase Rating: 5 On 01/11/2017 Rating: 1 Product 1 Product: hand-bag Brand: Poker

Figure 3. Nodes and relationships with properties

The attributes in Figure 3 are presented as key-value properties. This relationship information facilitates answering some questions like who purchased the product (i.e., Poker hand-bag) earlier and who had a higher rating on the purchasing experience. Furthermore, the relationship information between data points leads one to think if the purchasing decision is influenced by the relationship between these two persons.

In short, graph data modeling can help to explore additional information between various data points. This kind of capability makes graph databases a good choice for storing and analyzing data of e-commerce platforms.

4. CHOICES OF DATABASES FOR E-COMMERCE

Choosing the right data store to house essential data is critical in terms of application performance, scalability and business developments. In this section, three major database engines, namely relational, NoSQL and graph databases are analyzed and compared. Figure 4 exhibits an overview of the database family.

Database Family

Relational NoSQL Databases Databases

Column- Document Key-Value Graph family -oriented Stores Databases stores databases

Figure 4. An overview of the database family

As shown in Figure 4, the database family consists of relational and NoSQL databases. The NoSQL database family further divides into key-value Stores, column-family stores, document-oriented databases and graph databases.

4.1 Relational Databases

Relational databases have been the core database engine of most applications for the past two decades. They require designers to strictly structure data into tables and columns with well-defined data types and lengths in the database schema. Also, they always undergo the normalization process to organize data in fields and tables for the purpose of reducing data redundancy. This process involves breaking down larger tables with many columns into smaller tables and joining tables using primary and foreign key pairs. Therefore, joining tables is a common and frequent operation in relational databases when performing data queries. Each table join creates a workload of retrieving all potential combination of rows (or, records), then filters out those rows not matching the criteria specified in the SQL where clause. This kind of operations can be intensive in both computing and memory costs. When a data query involves a large data volume by joining many tables, filtering out all the records that do not match the criteria of the query is too expensive. That is the major reason why relational databases perform slowly when handling large datasets. Moreover, as the database schema (i.e., number of tables, number of columns in each table, the data type of each column) are defined at the system design stage, it is difficult to

accommodate changes (e.g., adding more columns) after implementation. Therefore, the flexiblility of using relational databases is reduced.

On the other hand, the ACID (atomicity, consistency, isolation and durability) properties are an advantage of relational databases in transactional processing. The ACID properties result in “all or nothing” when handling transactions which ensure strong database consistency.

4.2 NoSQL Databases

NoSQL databases store data in the key-value format using dynamic database schemas. It means that there is no need to define the table structure such as fields (or, attributes) before using the table. It can be easy to change the structure of records by adding new fields or deleting existing ones. Records need not have an indentical set of fields. This data model enables the ability to represent hierarchical relationships to store complex data types (i.e., structured, semi-structured, un-structured data). Thus, NoSQL databases are designed to cope with the scale and aglitity challenges of moderm applications which are delivered as services accessible from many devices and designed to scale globablly to serve millions of users. These requirements made NoSQL databases be one of popular database engines. It is often considered that NoSQL databases sacrifies some of the data consistency (ACID) in order to gain higher scalability and flexibility [2].

4.3 Graph Databases

The NoSQL movement was born as an acknowledgement that new technologies were required to cope with the data changes and volume. Graph databases, as a part of the NoSQL movement, are rising in popularity among the ranks of NoSQL databases. They allow storing data as entities (nodes) and relationship (edges), and allow users to query the data as a graph. Queries written against graph databases are closer to how the data is modeled than other query languages do. A great advantage of graph queries is that they eliminate the need to join multiple tables to find those relationships between data points because the relationships are embedded in the data itself. On one hand, graph databases have the advantage of NoSQL databases which is high flexibility and scalability. On the other hand, graph databases can also have strong data integratity (ACID property) as in the relational databases. These are the reasons why graph databases have drawn much attention in recent research and practices.

Recent research [9] also shows that efficient pattern matching and search can be performed on very large graphs (up to 10 million vertices and 250 million edges) in graph databases. Regarding the performance, graph traversal which is a process of searching nodes in a graph, plays an essential role. It is used to decide the searching order during the quering process. Taking an optimal path through the graph can make the traversal completes fast with less computing and memory resources. There are two major algorithms, namely breath-first and depth-first algorithms, for controlling the searching order. The breath-first algorithm walks through the graph as wide as possible. For example, the traversal would reach nodes in the first level of the current node before moving to their child nodes. On the other hand, with the depth-first algorithm, the traversal would reach the first child node which has not been reached before and go backwards to the parent node if the child node has been reached before.

These two algorithms come up with different characteristics. As a result, the breath-first algorithm gives a better performance when the result is closer to the starting node. The depth-first algorithm performs better when the result is in the left part of the graph. Therefore, selecting a right algorithm for a specific scenario is critical to the performance.

4.4 Comparisons

Table 1 presents the comparison among relational, NoSQL and graph databases.

Table 1. Comparison among relational, NoSQL and graph databases

Relational NoSQL Graph

Databases Databases Databases Table based Key-value stores, Wide-column store database (in form Column-family (schemas-less Database of tables stores, model without Structure consisting of n Document-oriented standard schemas number of rows) database definition Pre-defined Dynamic schemas Dynamic schemas Database Schema Schemas / Schemassless / Schemassless Verticially Horizontally Horizontally Scalability scalable scalable scalable UnQL SQL (structured Cypher, SPARQL, Query Languages (unstructured query language) GraphQL, Gremlin query language) Data Storage Table Hierachchical data Hierachchical data Types based storage storage storage Nodes are Tables associated Not strong at Data connected with with primary and handling data Relationships each other by foreign keys relationships relationships Handling Handling Handling Data Handling structured and structured data unstructured data unstructured data

Data Integrity Strong Medium Strong

Handling high data Handling heavy Handling high data volume and Strength / Purpose duty transactional volume based analytical based type applications applications applications

Organizations adopt NoSQL databases because they can build applications faster, handle highly diverse data types and manage applications more efficiently at scale. NoSQL databases remove the complex ORM layer that translates objects in code to relational tables. This database flexibility makes changes in the database schemas easily to evolve with business changes. NoSQL databases can also scale within and across multiple distributed environments. This capability makes NoSQL databases different from relational databases. In contrast, to achieve scaling with relational databases involves significant engineering work. As a result, NoSQL databases are widely adopted for a variety of of use cases including Internet of Things, real-time analytics, personalization, catalog and content management. Graph databases comes up with additional values to analyze business data and get more insights from data in a more natural and effective way.

5. APPLICATIONS OF GRAPH DATABASES FOR E-COMMERCE PLATFORMS

A paradigm-shift happened in the e-commerce business, driving from the Business to Business model or the Business to Customer model to the recent Business to Individual model. This effect was induced by e-commerce players who want to engage their customers by providing more tailor-made products or services in the recent competitive market. This resulted in driving the technology development to achieve this purpose and making the graph database technology to be a key component of the spectrum. This section introduces two scenarios of using graph databases to formulate new business advantages.

5.1 Building Real-time Recommendation Engines

The application of recommendation engines was used in many industries to understand customer behaviors and preferences. It was originally popularized by e-commerce players like Amazon and eBay. They leveraged this kind of technology to deliver better customer services. Traditionally, these engines run in a batch processing mode in which records were processed at night through a series of calculations or algorithms to generate product recommendations for each customer. This application, when firstly launched to the market, gave those pioneers a competitive advantage over other market players. However, the advantage could not last long when more players adopted the same approach. This triggered people to develop more advanced engines which were good enough to set them apart from the competition. It resulted in driving people to develop some engines which could handle more data in a shorter period of time or even in real-time. It was expected that these engines could leverage the most updated data and recalculate their recommendations in real-time. This aimed to give customers a personalized feeling like having a skillful salesperson who could understand their needs and provide more accurate suggestions. Of course, the ultimate expectation was to generate more revenues through up-selling or cross-selling.

However, e-commerce players faced more difficulties and challenges of handling more complex algorithms and massively increased data volumes. The common short-term solution was to increase hardware capabilities like computing power and memory size in order to speed up the calculation processes. It was expected to have a bottleneck one day as the underlying system architecture or data model struggled to cope with large-scaled data volumes due to business growth.

The primary design principle of recommendation engines is to find objects having the similarities such as customers who purchase the same products, products which have the similar functionality and products purchased by the same customer. With this kind of information, these engines would assign a score for these similarities and formulate the shopping behavior and preferences of customers. The secondary principle is to find out what actions customers have taken. Imagine that Person A purchases the product X and the product Y. Person B, who is scored to have a similar behavior of Person A, purchases the product X and may have an interest or intention to purchase the product Y as well.

The recent problem is that most recommendation engines in the market are not well designed to analyze relationships between data points efficiently and effectively. These engines are developed with relational databases which model relationships by joining tables via primary and foreign key pairs. The table-joining process can be costly in terms of computing power and memory size. When handling more table-joins or higher data volumes, the process becomes slower and takes more time to complete tasks. Therefore, it’s hard to achieve the objective of building real-time engines at scale.

NoSQL databases struggle with another problem. They are not strong at handling data relationships because they do not have built-in means of connecting data. Additional logics must be built to formulate these relationships at the application level. This results in increasing computing workloads and complexity at that layer.

This problem can be resolved with graph databases of which the essence is to handle records and relationships at scale with its powerful searching capability. In graph databases, data are not segregated into separate tables, and there is no need to join tables. Instead, data are represented by entities (or, nodes) with relationships (or, edges) specifying how these nodes are linked to each other. Because these relationships are made explicit by edge elements, traversing the graph model from one node to another is computationally inexpensive. This nature enables handling complicated queries in real-time mode effectively.

5.2 Building Knowledge Graphs for Enterprise Data Management

Enterprise data are the most strategic asset. However, in most cases, enterprise data pattern is diverse, hetergenous and distributed. It is always a problem for enterprise to manage and leverage their enterprise data thoroughly though they understand its potential values. When its data volume keeps increasing massively, it also increases the difficulty of data management and analytics.

The graph database technology plays an important role in this area as it can facilitate building knowledge graphs which store information in a graph model and use graph queries to enable users to easily navigate relationship-based datasets. For example, by using a knowledge graph, enterprises can add topical information to product catalogs, build and query complex models of some regulatory rules or even general information such as Wiki data. As such, the value proposition of a knowledge graph for enterprises is that all data, data sources, and databases of every type can be represented by the knowledge graphs. It is a concept of data warehousing but at scale, at speed, and in the graph data model.

In short, building knowledge graphs gives enterprises competitive advantages as it delivers insights in real-time from data relationships of enterprise data and enables users to visualize the relationships easier and effectively.

6. CONCLUDING REMARKS AND FUTURE WORK

In this study, the researchers highlighted some advantages of using graph databases by explaining the evolution of database technologies, comparing major database types in term of database characteristics and addressing reasons why using graphs to represent data models changes the way of understanding data patterns. Graph data modeling can be a new standard visualization of data patterns. Furthermore, two major applications were mentioned as examples of how graph databases can be used to add values to the e-commerce industry specifically.

With respect to future work of this study, the researchers propose: • carrying out some experiements to verify the performance of database types, • running some sample datasets to stimulate e-commerce scenarios with graph databases, and • finding out suitable configurations such as searching algorithms for e-commerce scenarios.

7. REFERENCES

[1] Cattell, R. (1994). Object Data Management: Object-Oriented and Extended Relational Database Systems, 2nd Edition, Boston: Addison-Wesley.

[2] Cattell, R. (2011). Scalable SQL and NoSQL Data Stores, ACM Sigmod Record, 39(4), 12-27.

[3] Chartrand, G. and Zhang, P. (2005). A First Course in Graph Theory, Boston: McGraw-Hill Higher Education.

[4] Chaudhuri, S. (1998, May). An Overview of Query Optimization in Relational Systems, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, Seattle, Washington, 34-43.

[5] Codd, E. F. (1981, November). The 1981 ACM Turing Award Lecture, ACM '81, Los Angeles, California.

[6] Codd, E. F. (1982). Relational Database: A Practical Foundation for Productivity, Communications of the ACM, 25(2), 109-117.

[7] Frisendal, T. (2016). Graph Data Modeling for NoSQL and SQL: Visualize Structure and Meaning, New Jersey: Technics Publications.

[8] Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems, 3rd Edition, Boston: McGraw-Hill.

[9] Saltz, M., Jain, A., Kothari, A., Fard, A., Miller, J. A., and Ramaswamy, L. (2014, June). Dualiso: An algorithm for Subgraph Pattern Matching on Very Large Labeled Graphs, Proceedings of the 3rd IEEE International Congress on Big Data, Anchorage, Alaska, 498-505.

[10] Turban, E., Sharda, R. and Delen, D. (2011). Decision Support and Business Intelligence Systems, 9th Edition, Boston: Prentice-Hall.

[11] Vukotic, A., Watt, N., Abedrabbo, T., Fox, D. and Partner, J. (2015). Neo4j In Action, New York: Manning Publications.