Improving Performance of Schemaless Document Storage in Postgresql Using BSON
Total Page:16
File Type:pdf, Size:1020Kb
Improving performance of schemaless document storage in PostgreSQL using BSON Geoffrey Litt Seth Thompson John Whittaker Yale University geoff[email protected] [email protected] [email protected] Abstract database systems has exploded, in terms of both size NoSQL database systems have recently been gaining and variety. The traditional Relational Database Man- in popularity. These databases provide schema flexi- agement Systems (RDBMSs) that have ruled the mar- bility compared to traditional relational databases like ketplace for decades are still the best option for many PostgreSQL, but consequently give up certain desir- use cases, but NoSQL systems such as MongoDB [1] able features such as ACID guarantees. One compro- are increasing in popularity. Some benefits provided mise is to store schemaless documents within a re- by NoSQL databases include a simpler key-value data lational database—an architecture which PostgreSQL model, strong support for cluster environments (espe- has recently made possible by adding native support for cially those composed of commodity hardware), and JSON (JavaScript Object Notation), a common format emphasis on availability over consistency. for multi-level key-value documents. Some have suggested that a way for RDBMSs to Currently, PostgreSQL provides validation of JSON compete with NoSQL competitors—or at least to inno- documents upon input and native operators for query- vate in a time of changing needs—is to offer the ability ing operators within a JSON document. However, it to store schemaless documents within a traditional rela- stores JSON documents internally as text, which is in- tional framework [2]. This feature allows users to lever- efficient for many use cases. In this paper, we imple- age the ACID guarantees, performance benefits, code ment support for BSON (Binary JSON), a lightweight maturity, and other advantages of a relational database, binary encoding format for JSON-like documents which while also offering schema flexibility when required. enables fast traversals. We observe that using BSON to The open-source database PostgreSQL [3] recently store documents increases performance by two to eight released such support. In late 2012, the latest stable times when querying keys within documents, without build, version 9.2, added a native JSON data type. compromising document insertion times. JSON is a human-readable format for schemaless doc- ument storage. Postgres’ implementation simply ap- General Terms NoSQL, RDBMS plied JSON validation to a stored text field contain- Keywords PostgreSQL, JSON, BSON ing a string representing each document. At the time of the release, querying values within a document re- 1. Introduction quired using an external JavaScript engine for pars- ing sub-document elements. The current development With the rapid growth over the past decade of big build (9.3dev) has improved the built-in functionality data fueled by data-driven applications, the field of by adding native operators to query values within a JSON document, without the need for shared libraries Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are or external functions. not made or distributed for profit or commercial advantage and that copies The operator syntax is simple: an operation like bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific SELECT data->'key1' FROM json documents will permission and/or a fee. return the value associated with the key "key1", in the CPSC 438 Final Project April 29, 2013, New Haven, CT. JSON-type data column ”data”. Copyright c 2013 . $15.00 Although these developments are clearly useful, ples, suggesting that our results hold for a wide variety their scope is limited by the fact that the JSON type in of data sets. Ultimately, we believe that BSON’s ad- PostgreSQL remains a simple syntax validation. JSON vantages merit its replacement of JSON in almost all documents are stored in human-readable text format in use cases. the database—without any optimizations for efficient The rest of the paper is organized as follows: Section storage or fast traversal. This approach enables doc- 2 provides background information on PostgreSQL, uments to be easily passed to clients and end users, JSON, and BSON. Section 3 describes our implemen- but does not necessarily enable performant querying of tation of BSON in PostgreSQL. Section 4 presents per- values within a document. formance results when querying values inside a docu- In contrast, the popular NoSQL database Mon- ment in JSON and BSON, with a variety of data sets goDB stores documents internally in the BSON binary and use cases. Section 5 gives our conclusions and format. BSON is a lightweight binary encoding for plans for future work. JSON-like documents, which is specifically designed for speedy traversal compared to the human-readable JSON format. 2. Background To clarify a matter of semantics, this paper uses 2.1 PostgreSQL ”JSON-like” to refer specifically to the format of the hi- PostgreSQL [3] is an open-source relational database erarchical data document shared by both JSON proper system. It implements the majority of the most recent and BSON. This format is identical to the end user (hu- ISO and ANSI standard for the SQL database query man or machine) in that both JSON proper and BSON language, has ACID-guarantees, and is fully transac- documents are simply strings of keys and values orga- tional. Among PostgreSQL’s advantages are its avail- nized into dictionaries or arrays by braces and brackets, ability for numerous platforms, highly extensible archi- respectively. In the case of JSON proper, this format is tecture, support for complex data types, and mature, 15- also how the document is encoded for storage. But in year-old codebase. We decided to modify PostgreSQL the case of BSON, the interchange format is further en- because it is a reliable and efficient database system, coded for storage into a binary blob, a process which is and because it already had support for schemaless doc- described in more detail in Section 2.3. ument storage. In this paper, we describe our implementation of a PostgreSQL currently has three methods for storing native BSON type in PostgreSQL. We implement a schemaless data: XML, hstore, and JSON. XML is user-facing interface which is nearly identical to the a hierarchical structured data format which became a JSON interface—users insert documents using JSON W3C standard in 1998. The format was the backbone syntax (which enables nesting of objects, arrays, etc.), of the SOAP network message passing protocol and is and query values within a document using the oper- still used today by many Java applications as a markup ator syntax introduced by PostgreSQL 9.3. However, language. However, XML’s age is starting to show: the backend storage mechanism is completely differ- many criticize its verbosity and complex feature set [4]. ent. Our implementation parses the JSON document It is also the slowest of PostgreSQL’s document storage into key-value pairs upon insertion, and then constructs types, and features no indexing [5]. a BSON document which encodes the same informa- The second unstructured storage type implemented tion in BSON format. Upon querying, instead of having in PostgreSQL is hstore, which maps keys to string to parse the text document as in JSON, the BSON im- values or other hstore values. Although hstore is highly plementation can directly iterate over keys, providing expressive, the lack of nested documents and multiple significant performance advantages. types limits its functionality as a versatile schemaless We found performance gains across data sets of store. widely varying tuple size and document size. In some The last, and most recently added schema, is JSON, tests, queries performed on keys in BSON documents described in detail in the following section. Initially ran over eight times as fast as those on correspond- conceived as JavaScript’s structure declaration format, ing JSON documents. BSON was performant for many JSON was turned into a language-agnostic protocol. data types and on deep queries of large numbers of tu- Today it enjoys ubiquity as a data exchange format { strings in BSON, just as in JSON.) Additionally, BSON "firstName": "George", enforces a maximum document size of 16MB (which is "lastName": "Washington, unlikely to be exceeded in most circumstances). "phoneNumbers": [ BSON has several advantages compared to JSON { which enable fast traversal. Many types (e.g. integers, "type": "home", booleans) are stored as fixed length fields, as opposed "number": "203 123 4567" to variable length fields which must be parsed in JSON. }, For example, a boolean in BSON is represented by a { single byte, as opposed to several bytes representing the "type": "business", word ”true” or ”false” in JSON. Additionally, BSON "number: "202 345 6789" uses native C data types for many types such as integers } and floats, which enables compact storage and more ] efficient parsing. Finally, BSON also encodes length } information for variable length fields like strings, so they can be skipped over easily when querying for Figure 1. An example JSON document a specific key deep in the document. BSON is not necessarily more compact on disk than the equivalent across many modern website. JSON is the most perfor- JSON, but can almost always be traversed more quickly mant of PostgreSQL’s document storage methods [5] for these reasons. 2.2 JSON JavaScript Object Notation [6] is a lightweight data- 3. Implementing BSON in PostgreSQL interchange format derived from JavaScript. It is de- signed to be easy for humans to read and write and We heavily utilized two open-source libraries to com- for machines to parse. An object in JSON is a set plete this project.