Parsing Gigabytes of JSON Per Second
Total Page:16
File Type:pdf, Size:1020Kb
Parsing Gigabytes of JSON per Second Geoff Langdale · Daniel Lemire Abstract JavaScript Object Notation or JSON is a The JSON syntax can be viewed as a restricted form ubiquitous data exchange format on the Web. Ingesting of JavaScript, but it is used in many programming lan- JSON documents can become a performance bottleneck guages. JSON has four primitive types or atoms (string, due to the sheer volume of data. We are thus motivated number, Boolean, null) that can be embedded within to make JSON parsing as fast as possible. composed types (arrays and objects). An object takes Despite the maturity of the problem of JSON pars- the form of a series of key-value pairs between braces, ing, we show that substantial speedups are possible. where keys are strings (e.g., f"name":"Jack","age":22g). We present the first standard-compliant JSON parser An array is a list of comma-separated values between to process gigabytes of data per second on a single brackets (e.g., [1,"abc",null]). Composed types can core, using commodity processors. We can use a quar- contain primitive types or arbitrarily deeply nested com- ter or fewer instructions than a state-of-the-art refer- posed types as values. See Fig. 1 for an example. The ence parser like RapidJSON. Unlike other validating JSON specification defines six structural characters (`[', parsers, our software (simdjson) makes extensive use of `f', `]', `g', `:', `,'): they serve to delimit the locations Single Instruction, Multiple Data (SIMD) instructions. and structure of objects and arrays. To ensure reproducibility, simdjson is freely available as To access the data contained in a JSON document open-source software under a liberal license. from software, it is typical to transform the JSON text into a tree-like logical representation, akin to the right- hand-side of Fig. 1, an operation we call JSON parsing. 1 Introduction We refer to each value, object and array as a node in the parsed tree. After parsing, the programmer can access JavaScript Object Notation (JSON) is a text format each node in turn and navigate to its siblings or its used to represent data [4]. It is commonly used for children without need for complicated and error-prone browser-server communication on the Web. It is sup- string parsing. ported by many database systems such as MySQL, Post- Parsing large JSON documents is a common task. greSQL, IBM DB2, SQL Server, Oracle, and data-science Palkar et al. state that big-data applications can spend arXiv:1902.08318v6 [cs.DB] 2 Jan 2020 frameworks such as Pandas. Many document-oriented 80{90% of their time parsing JSON documents [25]. databases are centered around JSON such as CouchDB Boncz et al. identified the acceleration of JSON parsing or RethinkDB. as a topic of interest for speeding up database process- Geoff Langdale ing [2]. branchfree.org JSON parsing implies error checking: arrays must Sydney, NSW start and end with a bracket, objects must start and end Australia E-mail: geoff[email protected] with a brace, objects must be made of comma-separated pairs of values (separated by a colon) where all keys are Daniel Lemire Universit´edu Qu´ebec (TELUQ) strings. Numbers must follow the specification and fit Montreal, Quebec within a valid range. Outside of string values, only a Canada few ASCII characters are allowed. Within string val- E-mail: [email protected] ues, several characters (like ASCII line endings) must root { "Width": 800 "Width": 800, "Height": 600, "Height": 600 "Title":"View from my room", "Title": "View from my room" "Url":"http://ex.com "Url": "http://ex.com/img.png" /img.png", "Private": false, "Private": false "Thumbnail":{ "Thumbnail" "Url":"http://ex. com/th.png", "Url": "http://ex.com/th.png" "Height": 125, "Height": 125 "Width": 100 }, "Width": 100 "array":[ "array" 116 , 943 , 116 234 ], 943 "Owner": null 234 } "Owner": null Fig. 1: JSON example be escaped. The JSON specification requires that docu- To our knowledge, publicly available JSON validat- ments use a unicode character encoding (UTF-8, UTF- ing parsers make little use of SIMD instructions. Due 16, or UTF-32), with UTF-8 being the default. Thus to its complexity, the full JSON parsing problem may we must validate the character encoding of all strings. not appear immediately amenable to vectorization. JSON parsing is therefore more onerous than merely One of our core results is that SIMD instructions locating nodes. Our contention is that a parser that combined with minimal branching can lead to new speed accepts erroneous JSON is both dangerous|in that it records for JSON parsing|often processing gigabytes will silently accept malformed JSON whether this has of data per second on a single core. We present sev- been generated accidentally or maliciously|and poorly eral specific performance-oriented strategies that are of specified—it is difficult to anticipate or widely agree on general interest. what the semantics of malformed JSON files should be. { We detect quoted strings, using solely arithmetic To accelerate processing, we should use our proces- and logical operations and a fixed number of in- sors as efficiently as possible. Commodity processors structions per input bytes, while omitting escaped (Intel, AMD, ARM, POWER) support single-instruction- quotes (x 3.1.1). multiple-data (SIMD) instructions. These SIMD instruc- { We differentiate between sets of code-point values tions operate on several words at once unlike regular in- using vectorized classification thus avoiding the bur- structions. For example, starting with the Haswell mi- den of doing N comparisons to recognize that a croarchitecture (2013), Intel and AMD processors sup- value is part of a set of size N (x 3.1.2). port the AVX2 instruction set and 256-bit vector reg- { We validate UTF-8 strings using solely SIMD in- isters. Hence, on recent x64 processors, we can com- structions (x 3.1.5). pare two strings of 32 characters in a single instruction. It is thus straightforward to use SIMD instructions to 2 Related Work locate significant characters (e.g., `"', `=') using few in- structions. We refer to the application of SIMD instruc- A common strategy to accelerate JSON parsing in the tions as vectorization. Vectorized software tends to use literature is to parse selectively. Alagiannis et al. [1] fewer instructions than conventional software. Every- presented NoDB, an approach where one queries the thing else being equal, code that generates fewer in- JSON data without first loading it in the database. It structions is faster. relies in part on selective parsing of the input. Bonetta A closely related concept to vectorization is branch- and Brantner use speculative just-in-time (JIT) com- less processing: whenever the processor must choose be- pilation and selective data access to speed up JSON tween two code paths (a branch), there is a risk of in- processing [3]. They find repeated constant structures curring several cycles of penalty due to a mispredicted and generate code targeting these structures. branch on current pipelined processors. In our experi- Li et al. present their fast parser, Mison which can ence, SIMD instructions are most likely to be beneficial jump directly to a queried field without parsing inter- in a branchless setting. mediate content [17]. Mison uses SIMD instructions to 2 quickly identify some structural characters but other- FishStore [29] parses JSON data and selects sub- wise works by processing bit-vectors in general purpose sets of interest, storing the result in a fast key-value registers with branch-heavy loops. Mison does not at- store [6]. While the original FishStore relied on Mison, tempt to validate documents; it assumes that docu- the open-source version1 uses simdjson by default for ments are pure ASCII as opposed to unicode (UTF- fast parsing. 8). We summarize the most relevant component of the Pavlopoulou et al. [26] propose a parallelized JSON Mison architecture as follows: processor that supports advanced queries and rewrite rules. It avoids the need to first load the data. 1. In a first step, the input document is compared Sparser filters quickly an unprocessed document to against each of the structural characters (`[', `f', `]', find mostly just the relevant information [25], and then `g', `:', `,') as well as the backslash (`n'). Each com- relies on a parser. We could use simdjson with Sparser. parison uses a SIMD instruction, comparing 32 pairs Systems based on selective parsing like Mison or of bytes at a time. The comparisons are then con- Sparser might be beneficial when only a small subset verted into bitmaps where the bit value 1 indicate of the data is of interest. However, if the data is ac- the presence of the corresponding structural char- cessed repeatedly, it might be preferable to load the acter. Mison omits the structural characters related data in a database engine using a standard parser. Non- to arrays (`[', `]') when they are unneeded. Mison validating parsers like Mison might be best with tightly only uses SIMD instructions during this first step; integrated systems where invalid inputs are unlikely. it also appears to be the only step that is essentially branch-free. 2. During a second step, Mison identifies the starting 2.1 XML Parsing and ending point of each string in the document. It uses the quote and backslash bitmaps. For each Before JSON, there has been a lot of similar work done quote character, Mison counts the number of pre- on parsing XML. Noga et al. [24] report that when fewer ceding backslashes using a fast instruction (popcnt): than 80% of the values need to be parsed, it is more quotes preceded by an odd number of backslashes economical to parse just the needed values. Marian et are turned off and ignored.