An LSM-based Tuple Compaction Framework for Apache AsterixDB (Extended Version)

Wail Y. Alkowaileet Sattam Alsubaiee Michael J. Carey University of California, Irvine King Abdulaziz City for University of California, Irvine Irvine, CA Science and Technology Irvine, CA [email protected] Riyadh, Saudi Arabia [email protected] [email protected]

ABSTRACT record embeds metadata that describes its structure and Document database systems store self-describing semi- values). The flexibility of the self-describing data model structured records, such as JSON, “as-is” without requiring provided by NoSQL systems attracts applications where the the users to pre-define a schema. This provides users with schema can change in the future by adding, removing, or the flexibility to change the structure of incoming records even changing the type of one or more values without tak- without worrying about taking the system offline or hin- ing the system offline or slowing down the running queries. dering the performance of currently running queries. How- The flexibility provided in document store systems over ever, the flexibility of such systems does not free. The large the rigidity of the schemas in Relational Database Manage- amount of redundancy in the records can introduce an un- ment Systems (RDBMSs) does not come without a cost. For necessary storage overhead and impact query performance. instance, storing a boolean value for a field named hasChil- Our focus in this paper is to address the storage over- dren, which takes roughly one byte to store in an RDBMS, head issue by introducing a tuple compactor framework that can take a NoSQL DBMS an order of magnitude more bytes infers and extracts the schema from self-describing semi- to store. Defining a schema prior to ingesting the data can structured records during the data ingestion. As many alleviate the storage overhead, as the schema is then stored prominent document stores, such as MongoDB and Couch- in the system’s catalog and not in each record. However, base, adopt Log Structured Merge (LSM) trees in their stor- defining a schema defies the purpose of schema-less DBMSs, age engines, our framework exploits LSM lifecycle events which allow adding, removing or changing the types of the to piggyback the schema inference and extraction opera- fields without manually altering the schema [18, 23]. From tions. We have implemented and empirically evaluated our a user perspective, declaring a schema requires thorough a approach to measure its impact on storage, data ingestion, priori understanding of the dataset’s fields and their types. and query performance in the context of Apache AsterixDB. Let us consider a scenario where a data scientist wants to ingest and analyze a large volume of semi-structured data PVLDB Reference Format: from a new external data source without prior knowledge of Wail Y. Alkowaileet, Sattam Alsubaiee and Michael J. Carey. An its structure. Our data scientist starts by acquiring a few LSM-based Tuple Compaction Framework in Apache AsterixDB. PVLDB, 13(9): xxxx-yyyy, 2020. instances from the data source and tries to analyze their DOI: https://doi.org/10.14778/3397230.3397236 structures; she then builds a schema according to the ac- quired sample. After ingesting a few data instances, our data scientist discovers that some fields can have more than 1. INTRODUCTION one type, which was not captured in her initial sample. As a Self-describing semi-structured data formats like JSON result, she stops the ingestion process, alters the schema to have become the de facto format for storing and sharing accommodate the irregularities in the types of those fields, information as developers moving away from the rigidity and then reinitiates the data ingestion process. In this case, of schemas in the relational model. Consequently, NoSQL our data scientist has to continuously monitor the system arXiv:1910.08185v2 [cs.DB] 11 May 2020 Database Management Systems (DBMSs) have emerged as and alter the schema if necessary, which may result in taking popular solutions for storing, indexing, and querying self- the system offline or stopping the ingestion of new records. describing semi-structured data. In document store systems Having an automated mechanism to infer and consolidate such as MongoDB [11] and Couchbase [10], users are not the schema information for the ingested records without los- required to define a schema before loading or ingesting their ing the flexibility and the experience of schema-less stores data since each data instance is self-describing (i.e., each would clearly be desirable. In this work, we address the problem of the storage over- head in document stores by introducing a framework that in- This work is licensed under the Creative Commons Attribution- fers and compacts the schema information for semi-structured NonCommercial-NoDerivatives 4.0 International License. To view a copy data during the ingestion process. Our design utilizes the of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For lifecycle events of Log Structured Merge (LSM) tree [31] any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights based storage engines, which are used in many prominent licensed to the VLDB Endowment. document store systems [10, 11] including Apache Aster- Proceedings of the VLDB Endowment, Vol. 13, No. 9 ixDB [19]. In LSM-backed engines, records are first accu- ISSN 2150-8097. mulated in memory (LSM in-memory component) and then DOI: https://doi.org/10.14778/3397230.3397236

1 subsequently written sequentially to disk (flush operation) of types int, string and a multiset of DependentType, re- in a single batch (LSM on-disk component). Our framework spectively. The symbol “?” indicates that a field is optional. takes the opportunity provided by LSM flush operations to Note that we defined the type EmployeeType as open, where extract and strip the metadata from each record and con- data instances of this type can have additional undeclared struct a schema for each flushed LSM component. We have fields. On the other hand, we define the DependentType as implemented and empirically evaluated our framework to closed, where data instances can only have declared fields. measure its impact on the storage overhead, data ingestion In both the open and closed datatypes, AsterixDB does not rate and query performance in the context of AsterixDB. permit data instances that do not have values for the spec- Our main contributions could be summarized as follows: ified non-optional fields. Finally, in this example, we create a dataset Employee of the type EmployeeType and specify • We propose a mechanism that utilizes the LSM work- its id field as the primary key. flow to infer and compact the schema for NoSQL sys- tems records during flush operations. Moreover, we detail the steps required for distributed query process- CREATE TYPE DependentType CREATE TYPE EmployeeType AS OPEN { ing using the inferred schema. AS CLOSED { name: string , id: int , • We introduce a non-recursive physical data layout that age: int name: string , separates data values from their metadata, which al- }; dependents:{{DependentType}}? lows us to infer and compact the schema efficiently for }; nested data. • We introduce page-level compression in AsterixDB. CREATE DATASET Employee(EmployeeType) PRIMARY KEY id; This is a similar solution to these adopted by other Figure 1: Defining Employee type and dataset in ADM NoSQL DBMSs to reduce the storage overhead of self- describing records. To query the data stored in AsterixDB, users can submit • We evaluate the feasibility of our design, prototyped their queries written in SQL++ [24, 32], a SQL-inspired using AsterixDB, to ingest and query a variety of large declarative query language for semi-structured data. Fig- semi-structured datasets. We compare our “semantic” ure 2 shows an example of a SQL++ aggregate query posed approach of reducing the storage overhead to the “syn- against the dataset declared in Figure 1. tactic” approach of compression.

The remainder of this paper is structured as follows: Sec- SELECT VALUE nameGroup FROM Employee AS emp tion 2.1 provides a preliminary review of the AsterixDB - GROUP BY emp . name GROUP AS nameGroup chitecture and our implementation for page-level compres- Figure 2: An example of a SQL++ query sion. Section 3.1 details the design and implementation of our tuple compaction framework in AsterixDB. Section 4 presents an experimental evaluation of the proposed frame- 2.2 Storage and Data Ingestion work. Section 5 discusses related work on utilizing the LSM In an AsterixDB cluster, each worker node (Node Con- lifecycle and on schema inference for semi-structured data. troller, or NC for short) is controlled by a Cluster Controller Finally, Section 6 presents our conclusions and discusses po- (CC) that manages the cluster’s topology and performs rou- tential future directions for our work. tine checks on the NCs. Figure 3 shows an AsterixDB clus- ter of three NCs, each of which has two data partitions that 2. APACHE ASTERIXDB OVERVIEW hold data on two separate storage devices. Data partitions In this paper, we use Apache AsterixDB to prototype our in the same NC (e.g., Partition 0 and Partition 1 in NC0) tuple compactor framework. AsterixDB is a parallel semi- share the same buffer cache and memory budget for LSM in- structured Big Data Management System (BDMS) which memory components; however, each partition manages the runs on large, shared-nothing, commodity computing clus- data stored in its storage device independently. In this ex- ters. To prepare the reader, we give a brief overview of As- ample, NC0 also acts as a metadata node, which stores and terixDB [18, 23] and its query execution engine Hyracks [21]. provides access to AsterixDB metadata such as the defined Finally, we present our design and implementation of page- datatypes and datasets. level compression in AsterixDB. AsterixDB stores the records of its datasets, spread across the data partitions in all NCs, in primary LSM B+-tree 2.1 User Model indexes. During data ingestion, each new record is hash- The AsterixDB Data Model (ADM) extends the JSON partitioned using the primary key(s) into one of the config- data model to include types such as temporal and spatial ured partitions (Partition 0 to Partition 5 in Figure 3) and types as well as data modeling constructs (e.g., bag or mul- inserted into the dataset’s LSM in-memory component. As- tiset). Defining an ADM datatype (akin to a schema in an terixDB implements a no-steal/no-force buffer management RDBMS) that describes at least the primary key(s) is re- policy with write-ahead-logging (WAL) to ensure the dura- quired to create a dataset (akin to a table in an RDBMS). bility and atomicity of ingested data. When the in-memory There are two options when defining a datatype in Aster- component is full and cannot accommodate new records, ixDB: open and closed. Figure 1 shows an example of defin- the LSM Tree Manager (called the “tree manager” here- ing a dataset of employee information. In this example, we after) schedules a flush operation. Once the flush operation first define DependentType, which declares two fields name is triggered, the tree manager writes the in-memory compo- and age of types string and int, respectively. Then, we de- nent’s records into a new LSM on-disk component on the fine EmployeeType, which declares id, name and dependents partition’s storage device, Figure 4a. On-disk components

2 SQL++ Queries their types and their names. For our example in Figure 4, and Results AsterixDB stores the information about the field age as it is Cluster Controller not declared. For declared fields (id and name in this exam- SQL++ Compiler ple), their type and name information are stored separately Data Feeds Algebricks Data Publishing (From External Query Optimizer and (to External in the metadata node (NC0). Sources) Rewriter Sources/Apps) Job Manager Metadata Page Metadata Page {id: 0} {id: 2, name: “Bob”, age: 21} INVALID VALID High-speed Interconnect (Flushing) (Flushed) Anti-matter

Insert Flush < 0, - > <0, “Kim”, age: 26> Node Controller 0 Node Controller 1 Node Controller 2 <2, “Bob”, age: 21> <1, “John”, age: 22> Metadata Manager Metadata Manager Metadata Manager In-memory Component C1 C0 Hyracks Dataflow Layer Hyracks Dataflow Layer Hyracks Dataflow Layer Flushing on-disk Component On-disk Component Buffer Cache Buffer Cache Buffer Cache (a) LSM Tree Manager LSM Tree Manager LSM Tree Manager Metadata Page Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 VALID (Flushed Metadata Page

<0, “Kim”, age: 26> INVALID <1, “John”, age: 22> (Merging) Merge Figure 3: Apache AsterixDB cluster configured with two Metadata Page C0 partitions in each of the three NCs VALID <1, “John”, age: 22> (Flushed) <2, “Bob”, age: 21> < 0, - > [C0, C1] during their flush operation are considered INVALID com- <2, “Bob”, age: 21> ponents. Once it is completed, the tree manager marks the C1 flushed component as VALID by setting a validity bit in the (b) component’s metadata page. After this point, the tree man- ager can safely delete the logs for the flushed component. Figure 4: (a) Flushing component C1 (b) Merging the two During crash recovery, any disk component with an unset components C0 and C1 into a new component [C0,C1] validity bit is considered invalid and removed. The recovery manager can then replay the logs to restore the state of the 2.3 Runtime Engine and Query Execution in-memory component before the crash. To run a query, the user submits an SQL++ query to Once flushed, LSM on-disk components are immutable the CC, which optimizes and compiles it into a Hyracks job. and, hence, updates and deletes are handled by inserting Next, the CC distributes the compiled Hyracks job to the new entries. A delete operation adds an ”anti-matter” en- query executors in all partitions where each executor runs try [19] to indicate that a record with a specified key has the submitted job in parallel 1. been deleted. An upsert is simply a delete followed by an Hyracks jobs consist of operators and connectors, where insert with the same key. For example, in Figure 4a, we data flows between operators over connectors as a batch delete the record with id = 0. Since the target record is of records (or a frame of records in Hyracks terminology). stored in C0, we insert an ”anti-matter” entry to indicate Figure 5 depicts the compiled Hyracks job for the query in that the record with id = 0 is deleted. As on-disk com- Figure 2. As shown in Figure 5, records can flow within an ponents accumulate, the tree manager periodically merges executor’s operators through Local-Exchange connectors or them into larger components according to a merge policy [19, they can be repartitioned or broadcast to other executors’ 29] that determines when and what to merge. Deleted and operators through non-local exchange connectors such as the updated records are garbage-collected during the merge op- Hash-Partition-Exchange connector in this example. eration. In Figure 4b, after merging C0 and C1 into [C0,C1], Operators in a Hyracks job process the ADM records in a we do not write the record with id = 0 as the record and the received frame using AsterixDB-provided functions. For in- anti-matter entry annihilate each other. As in the flush op- stance, a field access expression in a SQL++ query is trans- eration, on-disk components created by a merge operation lated into AsterixDB’s internal function getF ield(). Aster- are considered INVALID until their operation is completed. ixDB’s compiler Algebricks [22] may rewrite the translated After completing the merge, older on-disk components (C0 function when necessary. As an example, the field access ex- and C1) can be safely deleted. pression e.name in the query shown in Figure 2 is first trans- On-disk components in AsterixDB are identified by their lated into a function call getF ield(emp, “name”) where the component IDs, where flushed components have monotoni- argument emp is a record and “name” is the name of the re- cally increasing component IDs (e.g., C0 and C1) and merged quested field. Since name is a declared field, Algebricks can components have components IDs that represent the range rewrite the field access function to getF ield(emp, 1) where of component IDs that were merged (e.g., [C0,C1]). As- the second argument 1 corresponds to the field’s index in terixDB infers the recency ordering of components by in- the schema provided by the Metadata Node. specting the component ID, which can be useful for mainte- nance [29]. In this work, we explain how to use this property 2.4 Page-level Compression later in Section 3.2. As shown in Figure 4, the information of the undeclared Datasets’ records (of both open and closed types) in the field age is stored within each record. This could incur an LSM primary index are stored in a binary-encoded physi- unnecessary higher storage overhead if all or most records cal ADM format [3]. Records of open types that have un- declared fields are self-describing, i.e., the records contain 1The default number of query executors is equal to the num- additional information about the undeclared fields such as ber of data partitions in AsterixDB.

3 Scanner Project SortGroupBy ResultWriter Section 4, we evaluate both approaches (syntactic and se- (Partition 0) mantic) when they are applied separately and when they are combined and show their impact on storage size, data Scanner Project SortGroupBy ResultWriter (Partition 1) ingestion rate, and query performance. … … … … 3. LSM-BASED SCHEMA INFERENCE AND Scanner Project SortGroupBy ResultWriter (Partition 5) TUPLE COMPACTION FRAMEWORK The flexibility of schema-less NoSQL systems attracts ap- Local-Exchange connector Hash-Partition-Exchange connector plications where the schema can change without declaring those changes. However, this flexibility is not free. In the Figure 5: A compiled Hyracks job for the query in Figure 2 context of AsterixDB, Pirzadeh et al. [33] explored query ex- ecution performance when all the fields are declared (closed of the Employee dataset have the same undeclared field, as type) and when they are left undeclared (open type). One records store redundant information. MongoDB and Couch- conclusion from their findings, summarized in Figure 7, is base have introduced compression to reduce the impact of that queries with non-selective predicates (using secondary storing redundant information in self-describing records. In indexes) and scan queries took twice as much time to ex- AsterixDB, we introduce page-level compression, which com- ecute against open type records compared to closed type presses leaf pages of the B+-tree of the primary index. records due to their storage overhead. AsterixDB’s new page-level compression is designed to op- 300 erate at the buffer-cache level. On write, pages are com- 350 Open Fields pressed and then persisted to disk. On read, pages are de- 250 300 Closed Fields compressed to their original configured fixed-size and stored 200 250 150 200 in memory in AsterixDB’s buffer cache. Compressed pages 150 Time (s) Time

Size (GB) 100 can be of any arbitrary size. However, the AsterixDB stor- 100 50 50 age engine was initially designed to work with fixed-size data 0 0 pages where the size is a configurable parameter. Larger Open Fields Closed Fields Q6-F Q7-L Q8-F Q9-L Q10-M data pages can be stored as multiple fixed-size pages, but (a) On-disk storage size (b) Query execution time there is no mechanism to store smaller compressed pages. Any proposed solution to support variable-size pages must Figure 7: Summary of the findings in [33] not change the current storage physical layout of AsterixDB. In this section, we present a tuple compactor framework (called the “tuple compactor” hereafter) that addresses the Page0 Pagem-1 e0 en-1 storage overhead of storing self-describing semi-structured ... Page Size Look-Aside File Page Offset records in the context of AsterixDB. The tuple compactor automatically infers the schema of such records and stores them in a compacted form without sacrificing the user ex- perience of schema-less document stores. Throughout this Data File Page . . . Page 0 n-1 section, we run an example of ingesting and querying data in the Employee dataset declared as shown in Figure 8. The Figure 6: Compressed file with its Look-Aside File (LAF) Employee dataset here is declared with a configuration pa- To address this issue, we use Look-aside Files (LAFs) to rameter — {"tuple-compactor-enabled": true} — which en- store offset-length entry pairs for the stored compressed data ables the tuple compactor. pages. When a page is compressed, we store both the page’s offset and its length in the LAF before writing it to disk. CREATE TYPE EmployeeType AS OPEN { id: int }; Figure 6 shows a data file consists of n compressed pages and CREATE DATASET Employee(EmployeeType) its corresponding LAF. The number of entries in the LAF PRIMARY KEY id WITH {"tuple-compactor-enabled": true }; equals the number of pages in the data file, where each entry Figure 8: Enabling the tuple compactor for a dataset (e.g., e0) stores the size and the offset of its corresponding compressed data page (e.g., P age0). LAF entries can occupy We present our implementation of the tuple compactor more than one page, depending on the number of pages in by first showing the workflow of inferring schema and com- the data file. Therefore, to access a data page, we need first pacting records during data ingestion and the implications to read the LAF page that contains the required data page’s of crash recovery in Section 3.1. In Section 3.2, we show the size and offset and then use them to access the compressed structure of an inferred schema and a way of maintaining it data page. This may require AsterixDB to perform an extra on update and delete operations. Then, in Section 3.3, we IO operation to read a data page. However, the number of introduce a physical format for self-describing records that LAF pages is usually small due to the fact that the entry is optimized for the tuple compactor operations (schema in- size is small (12-bytes in our implementation). For instance, ference and record compaction). Finally, in Section 3.4, we a 128KB LAF page can store up to 10,922 entries. Thus, address the challenges of querying compacted records stored LAF pages can be easily cached and read multiple times. in distributed partitions of an AsterixDB cluster. In addition to this “syntactic” approach based on com- pression, the next section introduces a “semantic” approach 3.1 Tuple Compactor Workflow to reducing the storage overhead by inferring and stripping We first discuss the tuple compactor workflow during nor- the schema out of self-describing records in AsterixDB. In mal operation of data ingestion and during crash recovery.

4 3.1.1 Data Ingestion ously inferred schemas. To illustrate, during the second flush When creating the Employee dataset (shown in Figure 8) of the in-memory component to the on-disk component C1 in the AsterixDB cluster illustrated in Figure 3, each parti- in Figure 9b, the records of the new in-memory component, tion in every NC starts with an empty dataset and an empty with id 2 and 3, have their age values as missing and string, schema. During data ingestion, newly incoming records are respectively. As a result, the tuple compactor changes the hash-partitioned on the primary keys (id in our example) type of the inferred age field in the in-memory schema from across all the configured partitions (Partition 0 to Parti- int to union(int, string), which describes the records’ fields tion 5 in our example). Each partition inserts the received for both components C0 and C1. Finally, C1 persists the records into the dataset’s in-memory component until it can- latest in-memory schema S1 into its metadata page. not hold any new record. Then, the tree manager schedules Given that the newest schema is always a super-set of the a flush operation on the full in-memory component. During previous schemas, during a merge operation, we only need the flush operation, the tuple compactor, as shown in the to store the most recent schema of all the mergeable com- example in Figure 9a, factors the schema information out ponents as it covers the fields of all the previously flushed of each record and builds a traversable in-memory structure components. For instance, Figure 9c shows that the result- that holds the schema (described in Section 3.2). At the ing on-disk component [C0,C1] of the merged components same time, the flushed records are written into the on-disk C0 and C1 needs only to store the schema S1 as it is the most component C0 in a compacted form where their schema in- recent schema of {S0,S1}. Note that the merge operation formation (such as field names) are stripped out and stored does not need to access the in-memory schema and, hence, in the schema structure. After inserting the last record into merge and flush operations can be performed concurrently the on-disk component C0, the inferred schema S0 in our without synchronization. example describes two fields name and age with their asso- We chose to ignore compacting records of the in-memory ciated types denoted as F ieldName : T ype pairs. Note that component because (i) the in-memory component size is rel- we do not store the schema information of any explicitly de- atively small compared to the total size of the on-disk com- clared fields (field id in this example) as they are stored in ponents, so any storage savings will be negligible, and (ii) the Metadata Node (Section 2.2). At the end of the flush maintaining the schema for in-memory component, which operation, the component’s inferred in-memory schema is permits concurrent modifications (inserts, deletes and up- persisted in the component’s Metadata Page before setting dates), would complicate the tuple compactor’s workflow the component as V ALID (Section 2.2). Once persisted, and slow down the ingestion rate. on-disk schemas are immutable. 3.1.2 Crash Recovery Implications Metadata {id: 0, name: “Kim”, age: 26} In-memory S0 Page Apache AsterixDB’s LSM-based engine guarantees that {id: 1, name: “John”, age: 22} Schema name: string (i) records of in-memory components during the flush oper- age: int <0, “Kim”, 26> Insert Flush ation are immutable and (ii) flush operations of in-memory Tuple Compactor <1, “John”, 22> C components are atomic (no half-flushed components are al- In-memory Component 0 lowed). Those guarantees make the flush operation suitable (a) for applying transformations on the flushed records before Metadata writing them to disk. The tuple compactor takes this oppor- {id: 2, name: “Ann”} In-memory S1 Page {id: 3, name: “Bob”, age: “old”} Schema name: string tunity to infer the schema during the flush operation as (i) age: union(int,string) <2, “Ann”> guarantees that records cannot be modified. At the same Insert Flush <3, “Bob”, “old”> Tuple Compactor time, the tuple compactor transforms the flushed records C1 In-memory Component S0 into a compacted form and ensures that the compaction pro- name: string cess is guaranteed to be atomic as in (ii). age: int <0, “Kim”, 26> Now, let us consider the case where a system crash occurs <1, “John”, 22> during the second flush shown in Figure 9b. When the sys- C0 tem restarts, the recovery manager will start by activating (b) the dataset and then inspecting the validity of the on-disk S0 components by checking their validity bits. The recovery name: string S1 manager will discover that C1 is not valid and remove it. age: int <0, “Kim”, 26> name: string <1, “John”, 22> age: union(int,string) As C0 is the “newest” valid flushed component, the recov- <0, “Kim”, 26> C0 Merge <1, “John”, 22> ery manager will read and load the schema S0 into memory. S <2, “Ann”> 1 <3, “Bob”, “old”> Then, the recovery manager will replay the log records to name: string [C , C ] age: union(int,string) <2, “Ann”> 0 1 restore the state of the in-memory component before the <3, “Bob”, “old”> crash. Finally, the recovery manager will flush the restored C 1 in-memory component to disk as C1, during which time the (c) tuple compactor operates normally.

Figure 9: (a) Flushing the first component C0 (b) Flush- 3.2 Schema Structure ing the second component C1 (c) Merging the two compo- nents C and C into the new component [C ,C ] Previously, we showed the flow of inferring the schema and 0 1 0 1 compacting the tuples during data ingestion. In this section, As more records are ingested by the system, new fields we focus on the inferred schema and present its structure. may appear or fields may change, and the newly inferred We also address the issue of maintaining the schema in case schema has to incorporate the new changes. The newly in- of delete and update operations, which may result in remov- ferred schema will be a super-set (or union) of all the previ- ing inferred fields or changing their types.

5 3.2.1 Schema Structure Components the field age as a string. From this example, we see that Semi-structured records in document store systems are on delete operations, we need to (i) know the number of ap- represented as a tree where the inner nodes of the tree repre- pearances of each value, and (ii) acquire the old schema of sent nested values (e.g., JSON objects or arrays) and the leaf a deleted or updated record. nodes represent scalar values (e.g., strings). ADM records During the schema inference process, the tuple compactor in AsterixDB also are represented similarly. Let us con- counts the number of appearances of each value and stores sider the example where the tuple compactor first receives it in the schema tree structure’s nodes. In Figure 10b, each the ADM record shown in Figure 10a during a flush oper- node has a Counter value that represents the number of ation followed by five other records that have the structure times the tuple compactor has seen this node during the {"id": int, "name": string}. The tuple compactor traverses schema inference process. From the schema tree structure in the six records and constructs: (i) a tree-structure that sum- Figure 10b, we can see that there are six records that have marizes the records structure, shown in Figure 10b, and (ii) a the field name, including the record shown in Figure 10a. dictionary that encodes the inferred field names strings into Also, we can infer from the schema structure that all fields FieldNameIDs, as shown in Figure 10c. The Counter in the other than name belong to the record shown in Figure 10a. schema tree-structure represents the number of occurrences Therefore, after deleting this record, the schema structure of a value, which we further explain in Section 3.2.2. should only have the field name as shown in Figure 11. The schema tree structure starts with the root object On delete, AsterixDB performs a point lookup to get node which has the fields at the first level of the record the old record from which the tuple compactor extracts its (name, dependents, employment date, branch location, and schema (we call the schema of a deleted record the “anti- working shifts). We do not store any information here about schema”). Then, it constructs an anti-matter entry that the dataset’s declared field id as explained in previously includes the primary key of the deleted record and its anti- in Section 3.1. Each inner node (e.g., dependents) repre- schema and then inserts it into the in-memory component. sents a nested value (object, array, or multiset) and the leaf During the flush operation, the tuple compactor processes nodes (e.g., name) represent the scalar (or primitive) val- the anti-schema by traversing it and decrementing the Counter ues. Union nodes are for object fields or collection (array of each node of the schema tree structure. When the counter’s and multiset) items if their values can be of different types. value of a node in the schema tree structure reaches zero, In this example, the tuple compactor infers the array item we know that there is no record that still has this value type of the field working shifts as a union type of an array (whether it is nested or a primitive value). Then, the tu- of integers and a string. ple compactor can safely delete the node from the schema The edges between the nodes in the schema tree structure structure. As shown in Figure 11, after deleting the record represent the nested structure of an ADM record. Each in- in Figure 10a, the counter value corresponding to the field ner node of a nested value in the schema tree structure can name is decremented from 6 to 5 whereas the other nodes have one or more children depending on the type of the in- of the schema structure (shown in Figure 10a) have been ner node. Children of object nodes (e.g., fields of the Root deleted as they were unique to the deleted record. After object) are accessed by FieldNameIDs (shown as integers processing the anti-schema, the tuple compactors discard it on the edges of object nodes in Figure 10b) that reference before writing the anti-matter entry to disk. Upserts can be the stored field names in the dictionary shown in Figure 10c. performed as deletes followed by inserts. Each field name (or F ieldNameID in the schema tree struc- It is important to note that performing point lookups for ture) of an object is unique, i.e., no two children of an object maintenance purposes is not unique to the schema struc- node share the same field name. However, children of differ- ture. For instance, AsterixDB performs point lookups to ent object nodes can share the same field name. Therefore, maintain secondary indexes [19] and LSM filters [17]. Luo storing field names in a dictionary allows us to canonicalize et al. [28, 29] showed that performing point lookups for repeated field names such as the field name name, which has every upsert operation can degrade data ingestion perfor- appeared twice in the ADM record shown in Figure 10a. A mance. More specifically, checking for key-existence for ev- collection node (e.g., dependents) have only one child, which ery upserted record is expensive, especially in cases where represents the items’ type. An object field or a collection the keys are mostly new. As a solution, a primary key in- item can be of heterogeneous value types. So, their types dex, which stores primary keys only, can be used to check may be inferred as a union of different value types. In a for key-existence instead of using the larger primary index. schema tree structure, the number of children a union node In the context of retrieving the anti-schema on upsert, one can have depends on the number of supported value types in can first check if a key exists by performing a point lookup the system. For instance, AsterixDB has 27 different value using the primary key index. Only if the key exists, an ad- types [2]. Hence, a union node could have up to 27 children. ditional point lookup is performed on the primary index to get the anti-schema of the upserted record. If the key does 3.2.2 Schema Structure Maintenance not yet exist (new key), the record can be inserted as a new record. In Section 4, we evaluate the data ingestion perfor- In Section 3.1 we described the flow involved in inferring mance of our tuple compactor under heavy updates using the schema of newly ingested records, where we “add” more the suggested primary key index. information to the schema structure. However, when delet- ing or updating records, the schema structure might need to be changed by “removing” information. For example, 3.3 Compacted Record Format the record with id 3 shown in Figure 9 is the only record Since the tuple compactor operates during data inges- that has an age field of type string. Therefore, deleting this tion, the process of inferring the schema and compacting record should result in changing the type of the field age the records needs to be efficient and should not degrade from union(int, string) to int as the dataset no longer has the ingestion rate. As the schema can change significantly

6 FieldNameID { Counter "id": 1, "name": "Ann", Root "dependents":{{ 1 2 4 5 6 {"name": "Bob", “age”: 6}, FieldNameID field name string(6) Multiset (1) date(1) point(1) Array (1) {"name": "Carol", “age” : 10} }}, 1 name "employment_date": date("2018-09-20"), 2 dependents Object (2) Union (4) "branch_location": point(24.0, -56.12), 3 age 1 3 "working_shifts": [[8, 16], [9, 17], [10, 18], "on_call"] Array (3) string(1) 4 employement_date } string(2) int(2) 5 branch_location ... + 5 more {"id": int, "name": string} records … int (6) 6 working_shifts

(a) (b) (c) Figure 10: (a) An ADM record (b) Inferred schema tree structure (c) Dictionary-encoded field names

FieldNameID our example, the maximum lengths of the variable-length Counter values and field names are 3 (Ann) and 8 (salaries), respec- FieldNameID field name Value Root tively. Thus, we need at most 3-bits and 5-bits to store the 1 1 name length of each variable-length value or field name, respec- string(5) tively. We only actually need 4-bits for field name lengths; however, the extra bit is used to distinguish inferred fields Figure 11: After deleting the record shown in Figure10a (e.g., name) from declared ones (e.g., id) as we explain next. over time, previously ingested records must not be affected Values’ Tags Fixed-Length Values Variable-Length Values Field Names or require updates. Additionally, sparse records should not Header … … … … … … need to store additional information about missing values Lengths Values Lengths Values such as null bitmaps in RDBMSs’ records. For example, Number of Lengths storing the record with the Length {"id": 5, "name": "Will"} Values bit-widths Offsets schema shown in Figure 10b should not include any infor- 4-bytes 4-bytes 1-bytes 16-bytes mation about other fields (e.g., dependents). Moreover, un- compacted records (in-memory components) and compacted Figure 12: The structure of the vector-based format. records (on-disk components) should be evaluated and pro- cessed using the same evaluation functions to avoid any com- Length Tags Length Lengths bit-widths Offsets Header (25-bytes) 73 9 3-bits* | 5-bits** 50 51 54 57 plexities when generating a query plan. To address those Values’ Type Tags (9-bytes) issues, we introduce a compaction-friendly physical record object int string array int int ������ int ��� data format into AsterixDB, called the vector-based format. 3.3.1 Vector-based Physical Data Format { "id": 6, "name": "Ann", "salaries": [ 70000, 90000 ], "age": 26 } The main idea of the vector-based format is that it sepa- 6 70000 90000 26 3 Ann 0 4 8 3 name salaries age rates the metadata and values of self-describing records into 16-bytes 1-byte* 3-bytes 4-bytes** 15-bytes vectors that allow us to manipulate the record’s metadata ef- Fixed-length Values Variable-length Values Field Names ficiently during the schema inference and record compaction processes. To not be confused with a columnar format, the Figure 13: An example record in the vector-based format vectors are stored within each record and the records are stored contiguously in the primary index (Section 2.2). Fig- After the header, we store the values’ type tags. The ure 12 depicts the structure of a record in the vector-based values’ type tags encode the tree structure of the record in a format. First comes the record’s header, which contains in- Depth-First-Search order. In this example, the record starts formation about the record such as its length. Next comes with an object type to indicate the root’s type. The first the values’ tags vector, which enumerates the types of the value of the root object is of type integer, and it is stored in stored primitive and nested values. Fixed-length primitive the first four bytes of the fixed-length values. Since an object (or scalar) values such as integers are stored in the fixed- tag precedes the integer tag, this value is a child of that length values vector. The next vector is split into two sub- object (root) and, hence, the first field name corresponds vectors, where the first stores lengths and the second stores to it. Since the field id is a declared field, we only store the actual values of variable-length values. Lastly, the field its index (as provided by the metadata node) in the lengths names sub-vectors (lengths and values) store field name in- sub-vector. We distinguish index values from length values formation for all objects’ fields in the record. by inspecting the first bit. If set, we know the length value Figure 13 shows an example of a record in the vector- is an index value of a declared field. The next value in the based format (See Appendix B for an additional example). example record is of type string, which is the first variable- The record has four fields: id, name, salaries, and age with length value in the record. The string value is stored in the types integer, string, array of integers and integer, re- the variable-length values’ vector with its length. Similar to spectively. Starting with the header, we see that the record’s the previous integer value, this string value is also a child total size is 73-bytes and there are nine tags in the values’ of the root and its field name (name) is next in the field type tags vector. Lengths for variable-length values and field names’ vector. As the field name is not declared, the record names are stored using the minimum amount of bytes. In stores both the name of the field and its length. After the

7 string value, we have the array tag of the field salaries. The with other partitions. Therefore, the schema in each parti- subsequent integers’ tags indicate the array items’ types. tion can be different from other schemas in other partitions. The array items do not correspond to any field name, and When a query is submitted, each distributed partition ex- their integer values are stored in the fixed-length values’ ecutes the same job. Having different schemas becomes an vector. After the last item of the array, we store a control issue when the requested query needs to repartition the data tag object to indicate the end of the array as the current to perform a join or group-by. To illustrate, suppose we nesting type and a return to the parent nesting type (object have two partitions for the same dataset but with two dif- type in this example). Hence, the subsequent integer value ferent inferred schemas, as shown in Figure 15. We see that (age) is again a child of the root object type. At the end of the schemas in both partitions have the field name of type the value’s tags, we store a control tag EOV to mark the string. However, the second field is age in partition 0 and end of the values of the record. salary in partition 1. After hash-partitioning the records by As can be inferred from the previous example, the com- the name value, the resulting records are shuffled between plexity of accessing a value in the vector-based format is the two query executors and the last field can be either age linear in the number of tags, which is inferior to the log- or salary. Recall that partitions can be in different machines arithmic time provided by some traditional formats [3, 7]. within the AsterixDB cluster and have no runtime access to We address this issue in more detail in Section 3.4.2. the schema information of other partitions. Consequently, query executors cannot readily determine whether the last 3.3.2 Schema Inference and Tuple Compaction field corresponds to age or salary. Records in vector-based format separate values from meta- data. The example shown in Figure 13 illustrates how the Executor 0 Prepended Executor 1 fixed-length and variable-length values are separated from Result partition ID Result the record’s nested structure (values’ types tags) and field [ , [ , ] ] names. When inferring the schema, the tuple compactor S0 S1 S0 S1 needs only to scan the values’ type tags and the field names’ Broadcasted Group: by $e.name Hash-Partition Group: by $e.name Broadcasted Schemas by: $e.name Schemas vectors to build the schema structure. Aggregate: listfy($e) Aggregate: listfy($e) Compacting vector-based records is a straightforward pro- Project: $e.name, $e Project: $e.name, $e cess. Figure 14 shows the compacted structure of the record $e:= scan $e:= scan in Figure 13 along with its schema structure after the com- S0 S1 paction process. The compaction process simply replaces name: string, <0, “Ann”, 34> <1, “Ann”, 80000> name: string, the field names string values with their corresponding Field- age: int <1, “Bob”, 24> <2, “Sam”, 70000> salary: int NameIDs after inferring the schema. It then sets the fourth Local Schema Partition 0 Partition 1 Local Schema offset to the field names’ values sub-vector in the header SELECT VALUE emp FROM Employee AS e GROUP BY e.name AS name GROUP AS emp (Figure 12) to zero to indicate that field names were re- moved and stored in the schema structure. As shown in Figure 15: Two partitions with two different schemas the example in Figure 14, the record after the compaction needs just two bytes to store the field names’ information, To solve the schema heterogeneity issue, we added func- where each FieldNameID takes three bits (one bit for dis- tionality to broadcast the schema information of each parti- tinguishing declared fields and two for field name IDs), as tion to all nodes in the cluster at the beginning of a query’s compared to the 19 (4+15) bytes in the uncompacted form execution. Each node receives each partition’s schema infor- in Figure 13. mation along with its partition ID and serves the schemas to each executor in the same node. Then, we prepend each Fixed-length Values Variable-length Values Field Names IDs record resulting from the scan operator with the source par- 6 70000 90000 26 3 Ann 0 1 2 3 tition ID. When an operator accesses a field, the operator uses both the prepended partition ID of the record and the 16-bytes 1-byte 3-bytes 2-bytes distributed schema to perform the field access. Broadcast-

Root FieldNameID field name ing the partitions’ schemas can be expensive, especially in clusters with a large number of nodes. Therefore, we only 1 2 3 1 name string array int 2 salaries broadcast the schemas when the query plan contains a non- 3 age local exchange operator such as the hash-partition-exchange int in our example in Figure 15. When comparing the schema Figure 14: The record in Figure 13 after compaction broadcasting mechanism to handling self-describing records, a broadcasted schema represents a batch of records, whereas 3.4 Query Processing the redundant schemas embedded in self-describing records In this section, we explain our approach of querying com- are carried through the operators on a record-by-record basis. pacted records in the vector-based format. We, first, show Thus, transmitting the schema once per partition instead of the challenges of having distributed schemas in different once per record is more efficient. partitions and propose a solution that addresses this issue. Next, we zoom in into each query executor and show the 3.4.2 Processing Compacted Records optimizations needed to process compacted records. One notable difference between the vector-based format and the ADM physical format is the time complexity of 3.4.1 Handling Heterogeneous Schemas accessing a value (as discussed in Section 3.3.1). The As- As a scalability requirement, the tuple compaction frame- terixDB query optimizer can move field access expressions work operates in each partition without any coordination within the plan when doing so is advantageous. For instance,

8 AsterixDB’s query optimizer inlines field access expressions In Section 2.4, we introduced our implementation of the with WHERE clause conjunct expressions as in: page-level compression in AsterixDB. Throughout our ex- emp.age > 25 AND emp.name = “Ann” periments, we also evaluate the impact of compression (us- ing Snappy [12] compression scheme) on the storage size, The inlined field access expression emp.name is evaluated data ingestion rate, and query performance. only if the expression emp.age > 25 is true. However, in the vector-based format, each field access requires a linear Schema Configuration. In our experiments, we evalu- scan on the record’s vectors, which could be expensive. To ated the storage size, data ingestion rate, and query per- minimize the cost of scanning the record’s vectors, we added formance when defining a dataset as (i) open, (ii) closed, one rewrite rule to the AsterixDB query optimizer to con- and (iii) inferred using our tuple compactor. For the open solidate field access expressions into a single function ex- and inferred datasets, we only declare the primary key field, pression. Therefore, the two field access expressions in our whereas in closed datasets, we pre-declare all the fields. The example will be written as follows: records of open and closed datasets are stored using the [$age, $name] ← getV alues(emp, “age”, “name”) ADM physical format, whereas the inferred datasets are us- ing the new vector-based format. Note that the AsterixDB The function getV alues() takes a record and path expres- open case is similar to what schema-less NoSQL systems, sions as inputs and outputs the requested values of the pro- like MongoDB and Couchbase, do for storage. vided path expressions. The two output values are assigned to two variables $age and $name and the final conjunct ex- 4.1 Datasets pression of our WHERE clause example is transformed as: In our experiments, we used three datasets (summarized $age > 25 AND $name = “Ann” in Table 1) which have different characteristics in terms of The function getV alues() is also used for accessing ar- their record’s structure, size, and value types. ray items by providing the item’s index. For example, the Using the first dataset, we want to evaluate ingesting and expression emp.dependents[0].name is translated as follows: querying social network data. We obtained a sample of [$d name] ← getV alues(emp, “dependents”, 0, “name”) tweets using the Twitter API [13]. Due to the daily limit of the number of tweets that one can collect from the Twitter Additionally, we allow “wildcard” index to access nested API, we replicated the collected tweets ten times to have values of all items of an array. For instance, the output of 200GB worth of tweets in total. Replicating the data would the expression emp.dependents[∗].name is an array of all not affect the experiment results as (i) the tuple compactor’s names’ values in the array of objects dependents. scope is the records’ metadata (not the values) and (ii) the original data is larger than the compressible page size. 4. EXPERIMENTS The second dataset we used is the Web of Science (WoS) [8] 2 In this section, we experimentally evaluate the implemen- publication dataset . The WoS dataset encompasses meta tation of our tuple compactor in AsterixDB. In our exper- information about scientific publications (such as authors, iments, we compare our compacted record approach with fundings and abstracts) from 1980 to 2016 with a total AsterixDB’s current closed and open records in terms of (i) dataset size of 253GB. We transformed the dataset’s record on-disk storage size after data ingestion, (ii) data ingestion format from its XML original structure to a JSON one us- rate, and (iii) the performance of analytical queries. ing an existing XML-to-JSON converter [14]. The resulting JSON documents contain some fields with heterogeneous We also conduct additional experiments to evaluate: types, specifically a union of object and array of objects. 1. The performance accessing values in records in vector- The reason behind using such a converter is to mimic the based format (Section 3.3.1) with and without the op- challenges a data scientist can experience when resorting to timization techniques explained in Section 3.4.2. existing solutions. (Due to a lack of support for declared 2. The impact of our approach on query performance us- union types in AsterixDB, we could only pre-declare the ing secondary indexes. fields with homogeneous types in the closed schema case.) 3. The scalability of our framework using computing clus- To evaluate more numeric Internet of Things (IoT)-like ters with different number Amazon EC2 instances. workloads, we generated a third synthetic dataset that mim- ics data generated by sensors. Each record in the sensors’ Experiment Setup We conducted our initial experiments dataset contains captured readings and their timestamps using a single machine with an 8-core (Intel i9-9900K) pro- along with other information that monitors the health status cessor and 32GB of main memory. The machine is equipped of the sensor. The sensor data contains mostly numerical with two storage drive technologies SATA SSD and NVMe values and has a larger field-name-size to value-size ratio. SSD, both of which have 1TB of capacity. The SATA SSD The total size of the raw Sensors data is 122GB. drive can deliver up to 550 MB/s for sequential read and 520 MB/s for sequential write, and the NVMe SSD drive 4.2 Storage Size can deliver up to 3400 MB/s for sequential read and 2500 In this experiment, we evaluate the on-disk storage size af- MB/s for sequential write. Section 4.5 details the setup for ter ingesting the Twitter, WoS and Sensors datasets into As- our additional scale-out experiments. terixDB using the three formats (open, closed and inferred) We used AsterixDB v9.5.0 after extending it with our tu- and we compare it with MongoDB’s storage size. Our goal of ple compaction framework. We configured AsterixDB with comparing with MongoDB’s size is simply to show that the 15GB of total memory, where we allocated 10GB for the compressed open case is comparable to what other NoSQL buffer cache and 2GB for the in-memory component bud- get. The remaining 3GB is allocated as temporary buffers 2We obtained the dataset from Thomson Reuters. Cur- for operations such as sort and join. rently, Clarivate Analytics maintains it [9].

9 Table 1: Datasets summary Twitter WoS Sensors Source Scaled Real-world Synthetic Total Size 200GB 253GB 122GB # of Records 77.6M 39.4M 25M Record Size ∼2.7KB ∼6.2KB 5.1KB # of Scalar val. (min, max, avg) 53, 208, 88 71, 193, 1430 248, 248, 248 Max. Depth 8 7 3 Dominant Type String String Double Union Type? No Yes No systems take for storage using the same compression scheme for the reasons explained earlier. When combined, the ap- (Snappy). (It is not the focus of this paper to compare both proaches were able to reduce the overall storage sizes by 5x, systems’ data ingestion and query performance.) 3.7x and 9.8x for the Twitter, WoS and Sensors datasets, re- We first evaluate the total on-disk sizes after ingesting spectively, compared to the open schema case in AsterixDB. the data into the open, closed and inferred datasets. We begin with the Twitter dataset. Figure 16a shows its total Uncompressed Open Closed Inferred on-disk sizes. We see that the inferred and closed schema Compressed Open Closed Inferred MongoDB datasets have lower storage footprints compared to the open 300 300 140 schema dataset, as both avoid storing field names in each 250 250 120 record. When compression is enabled, both formats still 200 200 100 80 have smaller size compared to the open format and to Mon- 150 150 60

Size (GB) 100 goDB’s compressed collection size. The size of the inferred 100 40 dataset is slightly smaller than the closed schema dataset 50 50 20 since the vector-based format does not store offsets for ev- 0 0 0 ery nested value (as opposed to the ADM physical format (a) Twitter dataset (b) WoS dataset (c) Sensors dataset in the closed schema dataset). For the WoS dataset, Figure 16b shows that the inferred Figure 16: On-disk sizes dataset again has the lowest storage overhead. Even after compression, the open dataset (and the compressed Mon- 4.3 Ingestion Performance goDB collection) had about the same size as the uncom- We evaluated the performance of continuous data inges- pressed inferred dataset. The reason is that WoS dataset tion for the different formats using AsterixDB’s data feeds structure has more nested values compared with the Twit- for the Twitter dataset. We first evaluate the insert-only in- ter dataset. The vector-based format has less overhead for gestion performance, without updates. In the second experi- such data, as it does not store the 4-byte offsets for each ment, we evaluate the ingestion performance for an update- nested value. intensive workload, where previously ingested records are The Sensors dataset contains only numerical values that updated by either adding or removing fields or changing the describe the sensors’ status along with their captured read- types of existing data values. The latter experiment mea- ings, so this dataset’s field name size to value size ratio is sures the overhead caused by performing point lookups to higher compared to the previous datasets. Figure 16c shows get the anti-schemas of previously ingested records. The that, in the uncompressed dataset, the closed and inferred Sensor dataset was also ingested through a data feed and datasets have about 2x and 4.3x less storage overhead, re- showed similar behavior to the Twitter dataset; we omit spectively, than the open dataset. The additional savings for these results here. the inferred dataset results from eliminating the offsets for Continuous data ingestion from a data feed is sensitive readings objects, which contain reading values along with to LSM configurations such as the merge-policy and the their timestamps — {"value": double, "timestamp": bigint}. memory budget. For instance, when cutting the memory Compression reduced the sizes of the open and closed datasets budget by half, the size of flushed components would be- by a factor of 6.2 and 3.8, respectively, as compared to their come 50% smaller. AsterixDB’s default ”prefix-merge pol- uncompressed counterparts. For the inferred dataset, com- icy” [19] could then suffer from higher write-amplification by pression reduced its size only by a factor of 2.1. This in- repeatedly merging smaller on-disk components until their dicates that both the open and closed dataset records in- combined size reaches a certain threshold. To eliminate curred higher storage overhead from storing redundant off- those factors, we also evaluated the performance of bulk- sets for nested fixed-length values (readings objects). As in loading, which builds a single on-disk component for the the Twitter and WoS datasets, the sizes of both the com- loaded dataset. (We evaluated the performance of bulk- pressed open dataset in AsterixDB and the compressed col- loading into open, closed and inferred datasets using the lection in MongoDB were comparable in the Sensors dataset. WoS dataset.) To summarize our size findings, both the syntactic (page- level compression) and semantic (tuple compactor) approaches Data Feed (Insert-only). To evaluate the performance of alleviated the storage overhead as shown in Figure 16. The continuous data ingestion, we measured the time to ingest syntactic approach was more effective than the semantic ap- the Twitter dataset using a data-feed to emulate Twitter’s proach for the Twitter dataset and the two were comparable firehose. We set the maximum mergeable component size to for the WoS dataset. For the Sensors dataset, the semantic 1GB and the maximum tolerable number of components to approach (with our vector-based format) was more effective 5, after which the tree manager triggers a merge operation.

10 Uncompressed Open Closed Inferred dates (Figure 17a). For the inferred dataset, the ingestion Compressed Open Closed Inferred time with updates took ∼27% and ∼23% more time for the uncompressed and compressed datasets, respectively, com- 6000 6000 pared to one with no updates. The ingestion times of the 5000 5000 Insert-only 4000 4000 inferred and open datasets were comparable and both took 3000 3000 less time than the closed dataset. 2000 2000 Time (sec) Time Bulk-load. As mention earlier, continuous data ingestion is 1000 1000 0 0 sensitive to LSM configurations such as the allocated mem- SATA SSD NVMe SSD NVMe SSD ory budget for in-memory components. Additionally, the (a) Twitter dataset — feed (b) Twitter with 50% updates sizes of flushed on-disk components are smaller for the in- ferred and closed datasets as they have smaller storage over- 3000 2500 head than the open dataset (as we saw while ingesting the 2000 Twitter dataset). Smaller on-disk components may trigger 1500 more merge operations to reach the maximum mergeable 1000 component size. To eliminate those factors, we also eval- Time (sec) Time 500 uated the time it takes AsterixDB to bulk-load the WoS 0 dataset. When loading a dataset, AsterixDB sorts the records SATA SSD NVMe SSD and then builds a single on-disk component of the B+-tree in (c) WoS dataset — bulkload a bottom-up fashion. The tuple compactor infers the schema and compacts the records during this process. When load- Figure 17: Data ingestion time ing finishes, the single on-disk component will have a single Figure 17a shows the time needed to complete the data inferred schema for the entire set of records. ingestion for the 200GB Twitter dataset. Ingesting records Figure 17c shows the time needed to load the WoS dataset into the inferred dataset took less time than ingesting into into open, closed, and inferred datasets. As for continuous the open and closed datasets. Two factors played a role in data ingestion, the lower per-record construction cost of the the data ingestion rate. First, we observed that the record vector-based format was the main contributor to the per- construction cost of the system’s current ADM physical for- formance gain for the inferred dataset. We observed that the cost of the sort was relatively the same for the three mat was higher than the vector-based format by ∼40%. Due + to its recursive nature, the ADM physical format requires schema datasets. However, the cost of building the B -tree copying the values of the child to the parent from the leaf was higher for both the open and closed schema datasets to the root of the record, which means multiple memory due to their higher storage overheads (Figure 16b). copy operations for the same value. Closed records took As loading a dataset in AsterixDB does not involve main- even more time to enforce the integrity constraints such as taining transaction logs, the higher throughput of the NVMe the presence and types of none-nullable fields. The second SSD was noticeable here compared to continuous data inges- factor was the IO cost of the flush operation. We noticed tion. When compression is enabled, the SATA SSD slightly that the inferred dataset’s flushed on-disk components are benefited from the lower IO cost; however, the faster NVMe ∼50% and ∼25% smaller than the open and closed datasets, SSD was negatively impacted by the compression due to its respectively. This is due to the fact that compacted records CPU cost. in the vector-based format were smaller in size than the 4.4 Query Performance closed and open records in ADM format (see Figure 16a). Thus, the cost of writing larger LSM components of both We next evaluated the impact of our work on query per- open and closed datasets was higher. formance by running analytical queries against the ingested The ingestion rate for the SATA SSD and the NVMe Twitter, WoS, and Sensor datasets. The objective of our SSD were comparable, as both were actually bottlenecked experiments is to evaluate the IO cost of querying against by flushing transaction log records to the disk. Enabling open, closed, and inferred datasets. Each executed query compression had a slight negative impact on the ingestion was repeated six times and we report the average execution rate for each format due to the additional CPU cost. time of the last five. We ran four queries (listed in Appendix A.1) against the Data Feed (50% Updates) As explained in Section 3.2.2, Twitter dataset which retrieve: updates require point lookups to maintain the schema, which Q1. The number of records in the dataset — COUNT(∗). can negatively impact the data ingestion rate. We evalu- Q2. The top ten users whose tweets’ average length are the ated the ingestion performance for update-intensive work- largest — GROUP BY/ORDER BY. load when the tuple compactor is enabled. In this experi- Q3. The top ten users who have the largest number of ment, we randomly updated 50% of the previously ingested tweets that contain a popular hashtag — EXISTS/GROUP BY records by either adding or removing fields or changing ex- ORDER BY. isting value types. The updates followed a uniform distri- Q4. All records of the dataset ordered by the tweets’ post- bution, where all records are updated equally. We created ing timestamps — SELECT ∗/ORDER BY3. a primary key index, as suggested in [28, 29], to reduce the cost of point lookups of non-existent (new) keys. Fig- Figure 18 shows the execution time for the four queries in ure 17b shows the ingestion time of Twitter dataset, us- the three datasets (open, closed, and inferred) when the data ing the NVMe SSD drive, for the open, closed and inferred 3In Q4, we report only the time for executing the query, dataset with updates. The ingestion times for both open excluding the time for actually retrieving the final formatted and closed datasets were about the same as t with no up- query result.

11 is on the SATA SSD drive and the NVMe SSD drive. On substantially higher in the open and closed datasets as com- the SATA SSD, the execution times of the four queries, with pared to the inferred dataset. Similar to Q3 in the Twitter and without compression, correlated with their on-disk sizes dataset, field access expression consolidation and pushdown from Figure 16a. This correlation indicates that the IO cost were beneficial. Even after enabling compression, the open dominates the execution time. However, on the NVMe SSD and closed schema execution times for Q3 and Q4 remained drive, the CPU cost becomes more evident, especially when about the same despite the storage savings. (We will evalu- page-level compression is enabled. For Q2 and Q4, the ∼2X ate that behavior in more detail in Section 4.4.4). reduction in storage after compression reduced their execu- tion times in the SATA case in all three datasets. How- Uncompressed Compressed ever, the execution times for Q2 and Q4 in the closed and Open Closed Inferred Open Closed Inferred 1200 inferred datasets did not improve as much after compres- sion in the NVMe case, as the CPU became the bottleneck 1000 here. Q3, which filters out all records that do not contain 800 the required hashtag, took less time to execute in the in- ferred dataset. This is due to the way that nested values 600 of records in the vector-based format are accessed. In the (sec) Tine 400 Twitter dataset, hashtags are modeled as an array of ob- jects; each object contains the hashtag text and its position 200 in the tweet’s text. We consolidate field access expressions 0 for records in the vector-based format (as discussed in Sec- Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 tion 3.4.2), and the query optimizer was able to push the SATA SSD NVMe SSD consolidated field access through the unnest operation and extract only the hashtag text instead of the hashtag objects. Figure 19: Query execution time for the WoS dataset Consequently, Q3’s intermediate result size was smaller in the inferred dataset compared to the other two datasets, Uncompressed Compressed and executing Q3 against the inferred dataset was faster. Open Closed Inferred Open Closed Inferred 400 This experiment shows that our schema inference and tuple compaction approach can match (or even improve in some 350 cases) the performance of querying datasets with fully de- 300 clared schemas — without a need for pre-declaration. 250 200 Uncompressed Compressed

Open Closed Inferred Open Closed Inferred (sec) Time 150 700 100 600 50 500 0 400 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 300 SATA SSD NVMe SSD Time (sec) Time 200 Figure 20: Query execution time for the Sensors dataset. 100 0 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 4.4.3 Sensors Dataset SATA SSD NVMe SSD We again ran four queries (listed in Appendix A.3): Q1. The number of records in the dataset — COUNT(∗). Figure 18: Query execution the Twitter dataset Q2. The minimum and maximum reading values that were 4.4.1 Twitter Dataset ever recorded across all sensors — UNNEST/GROUP BY. Q3. The IDs of the top ten sensors that have recorded the 4.4.2 WoS Dataset highest average reading value — UNNEST/GROUP BY/ORDER BY We also ran four queries (listed in Appendix A.2) against Q4. Similar to Q4, but look for the recorded readings in a the WoS dataset: given day — WHERE/UNNEST/GROUP BY/ORDER BY Q1. The number of records in the dataset — COUNT(∗). The execution times are shown in Figure 20. The exe- Q2. The top ten scientific fields with the highest number cution times for Q1 on both uncompressed and compressed of publications — GROUP BY/ORDER BY. datasets correlate with the storage sizes of the datasets from Q3. The top ten countries that co-published the most with Figure 16c. Q2 and Q3 exhibit the effect of consolidating US-based institutes— UNNEST/EXISTS/ GROUP BY/ORDER BY. and pushing down field value accesses of vector-based for- Q4. The top ten pairs of countries with the largest number mat, where both queries took significantly less time to exe- of co-published articles — UNNEST/GROUP BY/ORDER BY cute in the inferred dataset. However, pushing the field ac- cess down is not always advantageous. When compression As Figure 19 illustrates, the execution times for Q1 and is enabled, the execution time of Q4 for the inferred dataset Q2 are correlated with the storage sizes of the three datasets using NVMe SSD was the slowest. This is because the con- (Figure 16b). For Q3 and Q4, the execution times were solidated field accesses (of sensor ID, reading and reporting

12 timestamp) are evaluated before filtering using a highly se- 160 lective predicate (0.001%). In the open and closed datasets, 140 Q1 Q2 Q3 Q4 delaying the evaluation of field accesses until after the filter 120 for Q4 was beneficial. However, the execution times for the 100 80 inferred dataset was comparable to the open case. 60 Time (sec) Time 40 4.4.4 Impact of the Vector-based Optimizations 20 Breakdown of the storage savings. As we showed in our 0 experiments, the time it takes for ingesting and querying Open Closed Inferred Open Closed Inferred records in the vector-based format (inferred) was smaller Unompressed Compressed even when the schema is fully declared for the ADM format (closed). This is due to fact that the vector-baed format (a) Large: 200GB encodes nested values more efficiently using only the type 3.5 tags (as in Section 3.3.1). To measure the impact of the Q1 Q2 Q3 Q4 3 newly proposed format, we reevaluate the storage size of the 2.5 vector-based without inferring the schema or compacting the records (i.e., a schema-less version using the vector-based 2 format), which we refer to as SL-VB. 1.5 Time (sec) Time In Figure 21a, we see the total sizes of the four datasets 1 open, closed, inferred, and SL-VB after ingesting the Twit- 0.5 ter dataset. We see that the SL-VB dataset is smaller than 0 the open dataset but slightly larger than the closed one. Open Closed Inferred Open Closed Inferred More specifically, about half of the storage savings in the 1-core 8-core inferred dataset (compared to the open dataset) is from the more efficient encoding of nested values in the vector-based (b) Small: 5GB format, and the other half is from compacting the record. Figure 22: Impact of the vector-based format on storage For the Sensors dataset, Figure 21b shows a similar pat- tern; however, the SL-VB Sensors dataset is smaller than the closed dataset for the reasons explained in Section 4.2. Field-access consolidation and pushdown. Also in our experiments, we showed that our optimizations of con- Open Closed Inferred SL-VB solidating and pushing down field access expressions can 300 150 tremendously improve query execution time. To isolate the factors that contributed to the performance gains, we reeval- 200 100 uated the execution times for Q2-Q4 of the Sensors dataset 100 50 Size (GB) with and without these optimizations. 0 0 The execution times of the queries are shown in Figure 23. We refer to Inferred (un-op) as querying the inferred dataset (a) Twitter (b) Sensors without our optimization of consolidating and pushing down Figure 21: Impact of the vector-based format on storage field access expressions. When we disable our optimizations, the linear-time field accesses of the vector-based format are Linear-time field access. Accessing values in the vector- performed as many times as there are field access expres- based format is sensitive to the position of the requested sions in the query. For instance, Q3 has three field access value. For instance, accessing a value that appears first in a expressions to get the (i) sensor ID, (ii) readings array, and record is faster than accessing a value that resides at the end. (iii) reporting timestamp. Each field access requires scan- To measure the impact of linear access in the vector-based ning the record’s vectors, which is expensive. Additionally, format, we ran four queries against the Twitter dataset (us- the size of the intermediate results of Q2 and Q3 were then ing the NVMe SSD drive) where each counts the number of larger (array of objects vs. array of doubles). As a result, Q2 appearances of a value. The positions (or indexes) of those and Q3 took twice as much time to finish for Inferred (un- values in the vector-based format are 1, 34, 68, and 136 for op). Q2 is still faster to execute in the Inferred (un-op) case Q1, Q2, Q3 and Q4, respectively, where position 1 means than in the closed case whereas Q3 took slightly more time the first value in the record and the position 136 is the last. to execute. Finally, disabling our optimizations improved Figure 22a shows the time needed to execute the queries. In the execution time for Q4 on the NVMe SSD, as delaying the inferred datasets, the position of the requested value af- the evaluation of field accesses can be beneficial for queries fected the query times, where Q1 was the fastest and Q4 was with highly selective predicates. the slowest. For the open and closed datasets, the execution Vector-based format vs. others. Other formats, such times for all queries were about the same. However, all four as Apache Avro [?], Apache Thrift [?], and Proto- queries took less time to execute in the inferred cases, due to col Buffers [?], also exploit schemas to store semi-structured the storage savings. When all the data fits in-memory, the data more efficiently. In fact, providing a schema is not CPU cost becomes more apparent as shown in Figure 22b. optional for writing records in such formats — as opposed In the case of a single core, the vector-based format was the to the vector-based format, where the schema is optional. slowest to execute Q3 and Q4. When using all 8-cores, the Nonetheless, we compared the vector-based format to Apache execution time for all queries were about the same for the Avro, Apache Thrift using both Binary Protocol (BP) and three datasets. Compact Protocol (CP), and Protocol Buffers to evaluate

13 Uncompressed Compressed Uncompressed Compressed Closed Inferred Inferred (un-op) Closed Inferred Inferred (un-op) Open Closed Inferred Open Closed Inferred 250 0.40 200 4 200 0.30 150 2

0.20 100 0 150 Time (sec) Time 0.10 50 100 0.00 0 Time (sec) Time 0.001% 0.01% 0.1% 1% 10% 20% 50% 50 (a) Low selectivity (b) High selectivity 0.20 0 100 2 Q2 Q3 Q4 Q2 Q3 Q4 0.15 80 1 60 0.10 0 SATA SSD NVMe SSD 40

Time (sec) Time 0.05 20 Figure 23: Impact of consolidating and pushing down field 0.00 0 access expressions 0.001% 0.01% 0.1% 1% 10% 20% 50% 1) the storage size and 2) the time needed to construct the (c) Low selectivity (d) High selectivity records in each format using 52MB of the Twitter dataset. Table 2 summarizes the result of our experiment. We see Figure 24: Query with secondary index (NVMe) that the storage sizes of the different formats were mostly EC2 instances of type c5d.2xlarge (each with 16GB of comparable. In terms of the time needed to construct the memory and 8 virtual cores). We evaluate the ingestion records, Apache Thrift (for both protocols) took the least and query performance of the Twitter dataset using clusters construction time followed by the vector-based format. Apache with 4, 8, 16 and 32 nodes. We configure each node with Avro and Protocol Buffers took 1.9x and 2.9x more time to 10GB of total memory, with 6GB for the buffer cache and construct the records compared to the vector-based format, 1GB for the in-memory component budget. The remaining respectively. 3GB is allocated for working buffers. We used the instance Table 2: Writing 52MB of Tweets in different formats ephemeral storage to store the ingested data. Due to the Space (MB) Time (msec) lack of storage space in a c5d.2xlarge instance (200GB), Avro 27.49 954.90 we only evaluate the performance on compressed datasets. Thrift (BP) 34.30 341.05 Figure 25a shows the total on-disk size after ingesting the Thrift (CP) 25.87 370.93 Twitter data into the open, closed and inferred datasets. ProtoBuf 27.16 1409.13 The raw sizes of the ingested data were 400, 800, 1600 and Vector-based 29.49 485.48 3200 GB for the 4, 8, 16 and 32 node clusters, respectively. Figure 25b shows the time taken to ingest the Twitter data 4.4.5 Secondary Index Query Performance into the three datasets. As expected, we observe the same Pirzadeh et al. [33] previously showed that predeclaring trends seen for the single node cluster (see Figure 16a and the schema in AsterixDB did not improve (notably) the per- Figure 17a), where the inferred dataset has the lowest stor- formance of range-queries with highly selective predicates in age overhead with the highest data ingestion rate. the presence of a secondary index. In this experiment, we evaluated the impact of having a secondary index using the Open Closed Inferred Twitter dataset. We modified the scaled Twitter dataset by generating 2000 4000 monotonically increasing values for the attribute timestamp 1500 3000 to mimic the time at which users post their tweets. We 1000 2000 Size (GB) Size created a secondary index on this generated timestamp at- 500 (sec) Time 1000 tribute and ran multiple range-queries with different selec- 0 0 tivities. For each query selectivity, we executed queries 4 8 16 32 4 8 16 32 Number of Nodes Number of Nodes with different range predicates to warm up the system’s cache and report the average stable execution time. Fig- (a) On-disk size (b) Ingestion time ures 24a and 24b show the execution times for queries, with both low and high selectivity predicates, using the NVMe Figure 25: Storage and ingestion performance (scale-out) SSD, for uncompressed datasets. The execution times for all queries correlated with the storage sizes (Figure 16a), where To evaluate query performance, we ran the same four the closed and inferred datasets have lower storage overhead Twitter queries as in Section 4.4.1. Figure 26 shows the compared to the open dataset. The execution times for com- execution times for the queries against the open, closed pressed datasets (Figures 24c and 24d) showed similar rela- and inferred datasets. All four queries scaled linearly, as tive behavior. expected, and all four queries were faster in the inferred dataset. Since the data is shuffled in Q2 and Q3 to per- 4.5 Scale-out Experiment form the parallel aggregation, each partition broadcasts its Finally, to evaluate the scalability our approach, we con- schema to the other nodes in the cluster (Section 3.4.1) at ducted a scale-out experiment using a cluster of Amazon the start of a query. However, the performance of the queries

14 was essentially unaffected and was still faster to execute in LSM-based storage engines. Also, we support data values the inferred dataset. with heterogeneous types, in contrast to and Parquet. Exploiting LSM lifecycle events to piggyback other Open Closed Inferred operations to improve the query execution time is not new 250 by itself and has been proposed in several contexts [15, Q1 Q2 Q3 Q4 200 17, 35]. LSM-backed operations can be categorized as ei- ther non-transformative operations, such as computing in- 150 formation about the ingested data, or transformative op- 100

Time (sec) Time erations, e.g., in which the records are transformed into a 50 read-optimized format. An example of a non-transformative 0 operation is [17], which shows how to utilize LSM flush and 4 8 16 32 4 8 16 32 4 8 16 32 merge operations to compute range-filters that can accel- Number of Nodes erate time-correlated queries by skipping on-disk compo- Figure 26: Query performance (scale-out) nents that do not satisfy the filter predicate. [15] proposes a lightweight statistics collection framework that utilizes LSM lifecycle events to compute statistical summaries of 5. RELATED WORK ingested data that the query optimizer can use for cardi- Schema inference for self-describing, semi-structured nality estimation. An example of a transformative opera- data has appeared in early work for Object Exchange Model tion is [35], which utilizes LSM-like operations to transform (OEM) and later for XML and JSON documents. For OEM records in the writeable-store into a read-optimized format (and later for XML4), [26] presented the concept of a dataguide, for the readable-store. which is a summary structure for schema-less semi-structured documents. A dataguide could be accompanied with values’ summaries and samples (annotations) about the data, which we also use in our schema structure to keep the number of 6. CONCLUSION AND FUTURE WORK occurrences in each value. In [36], Wang et al. present an In this paper, we introduced a tuple compaction frame- efficient framework for extracting, managing and querying work that addresses the overhead of storing self-describing schema-view of JSON datasets. Their work targeted data records in LSM-based document store systems. Our frame- exploration, where showing a frequently appearing structure work utilizes the flush operations of LSM-based engines to can be good enough. However, in our work, the purpose of infer the schema and compact the ingested records with- inferring the schema is to use it for compacting and query- out sacrificing the flexibility of schema-less document store ing the records, so, we infer the exact schema of the in- systems. We also addressed the complexities of adopting gested dataset. In another work [25], the authors detail an such a framework in a distributed setting, where multiple approach for automatically inferring and generating a nor- nodes run independently, without requiring synchronization. malized (flat) schema for JSON-like datasets, which then We further introduced the vector-based record format, a can be utilized in an RDBMS to store the data. Our work compaction-friendly format for semi-structured data. Ex- here is orthogonal; we target document store systems with periments showed that our tuple compactor is able to reduce LSM-based storage engines. the storage overhead significantly and improve the query Creating secondary indexes is related to declaring at- performance of AsterixDB. Moreover, it achieves this with- tributes in a schema-less document store. Systems such as out impacting data ingestion performance. In fact, the tu- Azure DocumentDB [34] and MongoDB support indexing all ple compactor and vector-based record format can actually fields at once without declaring the indexed fields explicitly. improve data ingestion performance of insert-heavy work- For instance, MongoDB allows users to create an index on loads. In addition to the semantic approach, we also added all fields using a wildcard index that was initially introduced the support for a syntactic approach using page-level com- in v4.2. Doing so requires the system to “infer” the fields of pression to AsterixDB. With both approaches combined, we a collection a priori. Despite the similarities, our objective were able to reduce the total storage size by up to 9.8x and is different. In our work, we infer the schema to reduce stor- improve query performance by the same factor. age overhead by compacting self-describing records residing Our tuple compactor framework targets LSM-backed row- in the primary index. oriented document-store systems. We plan to extend this Semantically compacting self-describing, semi-structured work to introduce a schema-adaptive columnar-oriented doc- records using schemas appears in popular big data systems ument store. First, we plan to explore the viability of adopt- such as Apache Spark [37] and Apache Drill [5]. For in- ing the [16] page format, which could potentially elim- stance, Apache Drill uses schemas of JSON datasets (pro- inate the CPU cost of the linear access time of the vector- vided by the user or inferred by scanning the data) to trans- based format. In a second direction, we want to explore form records into a compacted in-memory columnar format ideas from popular static columnar file formats (such as (Apache Arrow [1]). File formats such as Apache Parquet [6] Apache Parquet and Apache CarbonData [4]) to build an (or Google Dremel [30]) use the provided schema to store LSM-ified version of columnar indexes for self-describing, nested data in a columnar format to achieve higher com- semi-structured data. pressibility. An earlier effort to semantically compact and Acknowledgements This work was supported by a gradu- XML data is presented in [20, 27]. Our work is dif- ate fellowship from KACST. It was also supported by NSF ferent in targeting more “row”-oriented document stores with awards IIS-1838248 and CNS-1925610, industrial support 4 from Amazon, Google, Microsoft and Couchbase, and the Adopted later for XML as Lore project changed from OEM Donald Bren Foundation (via a Bren Chair). to the XML data-model.

15 7. REFERENCES 2015. [1] Apache Arrow. https://arrow.apache.org. [23] M. J. Carey. AsterixDB mid-flight: A case study in [2] AsterixDB Documentation. https: building systems in academia. In ICDE, pages 1–12, //ci.apache.org/projects/asterixdb/index.html. 2019. [3] AsterixDB Object Serialization Reference. https:// [24] D. Chamberlin. SQL++ For SQL Users: A Tutorial. cwiki.apache.org/confluence/display/ASTERIXDB/ Couchbase, Inc., 2018. (Available at Amazon.com). AsterixDB+Object+Serialization+Reference. [25] M. DiScala and D. J. Abadi. Automatic generation of [4] Apache CarbonData. normalized relational schemas from nested key-value https://carbondata.apache.org. data. In ACM SIGMOD, pages 295–310, 2016. [5] Apache Drill. https://drill.apache.org. [26] . Goldman and J. Widom. DataGuides: Enabling [6] Apache Parquet. https://parquet.apache.org. query formulation and optimization in semistructured databases. In VLDB, pages 436–445, 1997. [7] Binary JSON: BSON specification. [27] H. Liefke and D. Suciu. XMill: An efficient compressor http://bsonspec.org/. for XML data. In ACM SIGMOD, pages 153–164, [8] CADRE: Collaborative Archive Data Research 2000. Environment. http://iuni.iu.edu/resources/cadre. [28] C. Luo and M. J. Carey. LSM-based storage [9] Calrivate-Web of Science. techniques: A survey. arXiv preprint https://clarivate.com/products/web-of-science. arXiv:1812.07527, 2018. [10] Couchbase. https://couchbase.com. [29] C. Luo and M. J. Carey. Efficient data ingestion and [11] MongoDB. https://www.mongodb.com. query processing for LSM-based storage systems. [12] Snappy. http://google.github.io/snappy/. PVLDB, 12(5):531–543, 2019. [13] Twitter API Documentation. [30] S. Melnik, A. Gubarev, J. J. Long, G. Romer, https://developer.twitter.com/en/docs.html. S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: [14] xml-to-json: Library and command line tool for interactive analysis of web-scale datasets. PVLDB, converting XML files to json. 3(1-2):330–339, 2010. http://hackage.haskell.org/package/xml-to-json. [31] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The [15] I. Absalyamov, M. J. Carey, and V. J. Tsotras. log-structured merge-tree (LSM-tree). Acta Lightweight cardinality estimation in LSM-based Informatica, 33(4):351–385, 1996. systems. In ACM SIGMOD, pages 841–855, 2018. [32] K. W. Ong, Y. Papakonstantinou, and R. Vernoux. [16] A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data page The SQL++ query language: Configurable, unifying layouts for relational databases on deep memory and semi-structured. arXiv preprint arXiv:1405.3631, hierarchies. The VLDB Journal, 11(3):198–215, 2002. 2014. [17] S. Alsubaiee, M. J. Carey, and C. Li. LSM-based [33] P. Pirzadeh, M. J. Carey, and T. Westmann. Bigfun: storage and indexing: An old idea with timely A performance study of big data management system benefits. In Second international ACM workshop on functionality. In 2015 IEEE International Conference managing and mining enriched geo-spatial data, pages on Big Data (Big Data), pages 507–514. IEEE, 2015. 1–6, 2015. [34] D. Shukla, S. Thota, K. Raman, M. Gajendran, [18] S. Alsubaiee et al. AsterixDB: A scalable, open source A. Shah, S. Ziuzin, K. Sundaram, M. G. Guajardo, BDMS. PVLDB, 7(14), 2014. A. Wawrzyniak, S. Boshra, et al. Schema-agnostic [19] S. Alsubaiee et al. Storage management in AsterixDB. indexing with Azure DocumentDB. PVLDB, PVLDB, 7(10), 2014. 8(12):1668–1679, 2015. [20] A. Arion, A. Bonifati, G. Costa, S. dAguanno, [35] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, I. Manolescu, and A. Pugliese. Efficient query M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, evaluation over compressed XML data. In EDBT, E. O’Neil, et al. C-store: A column-oriented DBMS. pages 200–218, 2004. In VLDB, pages 553–564, 2005. [21] V. Borkar et al. Hyracks: A flexible and extensible [36] L. Wang, S. Zhang, J. Shi, L. Jiao, O. Hassanzadeh, foundation for data-intensive computing. In ICDE, J. Zou, and C. Wangz. Schema management for 2011. document stores. PVLDB, 8(9):922–933, 2015. [22] V. Borkar et al. Algebricks: a data model-agnostic [37] M. Zaharia et al. Spark: Cluster computing with compiler backend for big data languages. In SoCC, working sets. In Proc. HotCloud, 2010.

16 APPENDIX AND array_contains(countries , "USA") ) as collaborators A. QUERIES UNNEST collaborators as country In this section, we show the queries we ran in our experi- WHERE country != "USA" GROUP BY country ments against the Twitter, WoS and Sensors datasets. ORDER BY cnt DESC A.1 Twitter Dataset’s Queries LIMIT 10 Q3

SELECT VALUE count( ∗) SELECT pair , count( ∗) as cnt FROM Tweets FROM ( Q1 SELECT value country_pairs FROM Publications as t LET address = t . static_data SELECT VALUE uname , a . fullrecord_metadata FROM Tweets t . addresses . address_name , GROUP BY t . users . name AS uname countries = ( WITH a AS avg(length(t . text)) SELECT DISTINCT VALUE ORDER BY a DESC a . address_spec . country LIMIT 10 FROM address as a Q2 ORDER by a . address_spec . country ) , SELECT uname , count( ∗) as c country_pairs = ( FROM Tweets t SELECT VALUE [countries[x] , countries[y]] WHERE ( FROM range(0 , SOME ht IN t . entities . hashtags array_count(countries) - 1) as x , SATISFIES lowercase(ht . text) = "jobs" range(x + 1 , ) array_count(countries) - 1) as y GROUP BY user . name as uname ) ORDER BY c DESC WHERE is_array(address) LIMIT 10 AND array_count(countries) > 1 ) as country_pairs Q3 UNNEST country_pairs as pair GROUP BY pair SELECT ∗ ORDER BY cnt DESC FROM Tweets LIMIT 10 ORDER BY timestamp_ms Q4 Q4 A.3 Sensors Dataset’s Queries A.2 WoS Dataset’s Queries

SELECT count( ∗) FROM Sensors s , s . readings r SELECT VALUE count( ∗) FROM Publications as t Q1

Q1 Q2: SELECT max(r . temp) , min(r . temp) SELECT v , count( ∗) as cnt FROM Sensors s , s . readings r FROM Publications as t , t . static_data . fullrecord_metadata Q2 . category_info . subjects . subject AS subject Q3: WHERE subject . ascatype = "extended" SELECT sid , avg_temp GROUP BY subject .‘ value ‘ as v FROM Sensors s , s . readings as r ORDER BY cnt DESC GROUP BY s . sensor_id as sid WITH avg_temp as AVG(r . temp) Q2 ORDER BY t DESC LIMIT 10 SELECT country , count( ∗) as cnt FROM ( Q3 SELECT value distinct_countries FROM Publications as t Q4: LET address = t . static_data SELECT sid , avg_temp . fullrecord_metadata FROM Sensors s , s . readings as r . addresses . address_name , WHERE s . report_time > 1556496000000 countries = ( AND s . report_time < 1556496000000 SELECT DISTINCT VALUE + 24 ∗ 60 ∗ 60 ∗ 1000 a . address_spec . country GROUP BY s . sensor_id as sid FROM address as a WITH avg_temp as AVG(r . temp) ) ORDER BY avg_temp DESC WHERE is_array(address) LIMIT 10 AND array_count(countries) > 1

17 Values’ Type Tags object int string multiset object string int �������� object string int �������� string ������ date point ���

(a) Type tag vector Fixed-length Values Variable-length Values 1 6 10 26 date("2018-09-20") point(24.0, -56.12) 3 3 5 13 “Ann” "Bob" Carol" ”Not_Available"

Lengths Values

(b) Fixed-length values’ vector (c) Variable-length values’ vector Field Names 2 4 10 4 3 4 3 15 15 id name dependents name age name age employment_date branch_location

Lengths Values

(d) Field names’ vector

Figure 27: A record in a vector-based format (another example)

B. VECTOR-BASED FORMAT: ADDITIONAL continue to the next tag which is (int). Since int is a fixed- EXAMPLE length value, we know it is stored in the first four bytes (assuming int is a 4-byte int). Also, because it is preceded In Section 3.3.1, we showed an example of a record in the by the nested tag (object), we know it is a field of this object, vector-based format. However, the example may not clearly and thus its field name corresponds to the first field name id illustrate the structure of a record with complex nested val- in the field names’ vector. Next, we see the tag (string). As ues. In this section, we walk through another example of it is the first variable-length value, the length (i.e., 3) and how to interpret a record in a vector-based format with more the value "Ann" in the variable-length vector belong to this nested values. string value. Followed by the string, we get the tag (multiset), which { is a nested value and a child of the root object type. There- "id": 1, fore, the third field name (length: 10, value: dependents) "name": "Ann", "dependents":{{ corresponds to the multiset value. We see in Figure 28 that {"name": "Bob", “age”: 6}, the the field dependents is a multiset of three elements of {"name": "Carol", “age” : 10} , types: (object), (object) and (string). Thus, the following ”Not_Available" }}, tag (object) corresponds to the first element of the multiset. "employment_date": date("2018-09-20"), As it is a child of a multiset, our object value does not have a "branch_location": point(24.0, -56.12), field name. The next two tags are of type (string) and (int), } which correspond to the field names name and age, respec- tively, as they are children of the preceded (object) tag. The Figure 28: JSON document with more nested values following tag (multiset) marks the end of the current nest- ing type (i.e., object) and we are back to the nesting type Figure 27 shows the structure of the JSON document, (i.e., mutliset). The following tag (object) marks the begin- shown in Figure 28, in the vector-based format. Starting ning of the second element of the mutliset (See Figure 28) with the header (not show), we determine the four vectors, and we process it as we did for the first element. After the namely (i) the type tag vector (Figure 27a), (ii) the fixed- second control tag (multiset), we get the tag (string), which lengths values’ vector (Figure 27b), (iii) the variable-length is the type of the third and the last element of our multiset. values’ vector (Figure 27c), and finally (v) the field names’ The following control tag (object) tells it is the end of the vector (Figure 27d). multiset and we are going back to the root object type. After processing the header, we start by reading the first The next two tags (date) and (point) are the last two tag (object), which determine the root type. As explained children of the root object type and they have the last two in Section 3.3.1, the tags of nested values (i.e., object, array, field names employment date and branch location, re- and multi) and control tags (i.e., object, array, multiset, spectively. Finally, the control tag (EOV ) marks the end of and EOV ) are neither fixed or variable length values. We the record.

18