Clotho:Decoupling Memory Page Layout from Storage Organization
Minglong Shao Jiri Schindler £ Steven W. Schlosser† Anastassia Ailamaki Gregory R. Ganger Carnegie Mellon University
Abstract significantly. One query, for instance, might access all the attributes in a table (full-record access), while another ac- As database application performance depends on cesses only a subset of them (partial-record access). Full- the utilization of the memory hierarchy, smart data record accesses are typical in transactional (OLTP) appli- placement plays a central role in increasing local- cations where insert and delete statements require the en- ity and in improving memory utilization. Existing tire record to be read or written, whereas partial-record techniques, however, do not optimize accesses to accesses are often met in decision-support system (DSS) all levels of the memory hierarchy and for all the queries. Moreover, when executing compound workloads, different workloads, because each storage level one query may access records sequentially while others ac- uses different technology (cache, memory, disks) cess the same records “randomly” (e.g., via non-clustered and each application accesses data using different index). Currently, database storage managers implement a patterns. Clotho is a new buffer pool and stor- single page layout and storage organization scheme, which age management architecture that decouples in- is utilized by all applications running thereafter. As a re- memory page layout from data organization on sult, in an environment with a variety of workloads, only a non-volatile storage devices to enable indepen- subset of query types can be serviced well. dent data layout design at each level of the storage Several data page layout techniques have been proposed hierarchy. Clotho can maximize cache and mem- in the literature, each targeting different query type. No- ory utilization by (a) transparently using appropri- tably, the N-ary Storage Model (NSM)[13] stores records ate data layouts in memory and non-volatile stor- consecutively, optimizing for full-record accesses, while age, and (b) dynamically synthesizing data pages penalizing partial-record sequential scans. By contrast, the to follow application access patterns at each level Decomposition Storage Model (DSM)[7] stores values of as needed. Clotho creates in-memory pages indi- each attribute in a separate table, optimizing for partial- vidually tailored for compound and dynamically record accesses, while penalizing queries that need the en- changing workloads, and enables efficient use of tire record. More recently, PAX [1] optimizes cache per- different storage technologies (e.g., disk arrays or formance, but not memory utilization. Fractured mirrors MEMS-based storage devices). This paper de- [14] reduce DSM’s record reconstruction cost by using an scribes the Clotho design and prototype imple- optimized structure and scan operators, but need to keep mentation and evaluates its performance under a an NSM-organized copy of the database as well to support variety of workloads using both disk arrays and full-record access queries. None of the previously proposed simulated MEMS-based storage devices. schemes provides a universally efficient solution, however, because they all make a fundamental assumption that the 1 Introduction pages used in main memory must have the same contents as those stored on disk. Page structure and storage organization have been the sub- This paper proposes Clotho, a buffer pool and storage ject of numerous studies [1, 3, 6, 9, 10], because they play management architecture that decouples the memory page a central role in database system performance. Research layout from the non-volatile storage data organization. This continues as no single data organization serves all needs decoupling allows memory page contents to be determined within all systems. In particular, the access patterns re- dynamically according to queries being served and offers sulting from queries posed by different workloads can vary two significant advantages. First, it optimizes storage ac-
£ Now with EMC Corporation cess and memory utilization by requesting only the data † Now with Intel Corporation accessed by a given query. Second, it allows new two- Permission to copy without fee all or part of this material is granted pro- dimensional storage mechanisms to be exploited to miti- vided that the copies are not made or distributed for direct commercial gate the trade-off between the NSM and DSM storage mod- advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the els. Clotho chooses data layouts at each level of the mem- Very Large Data Base Endowment. To copy otherwise, or to republish, ory hierarchy to match NSM where it performs best, DSM requires a fee and/or special permission from the Endowment. where it performs best, and outperforms both for query Proceedings of the 30th VLDB Conference, mixes and access types in between. It leverages the Atro- Toronto, Canada, 2004 pos logical volume manager [16] to efficiently access two dimensional data structures stored both in disk arrays and Data Cache–Memory Memory–Storage
in MEMS-based storage devices (MEMStores) [19, 22]. Organization OLTP DSS OLTP DSS
Ô Ô ¢
This paper also describes and evaluates a prototype im- NSM ¢
Ô Ô ¢
plementation of Clotho within the Shore database storage DSM ¢
Ô Ô Ô manager [4]. Experiments with disk arrays show that, with PAX ¢ only a single storage organization, performance of DSS and OLTP workloads is comparable to the page layouts best Table 1: Summary of performance with current page layouts. suited for the respective workload (i.e., DSM and PAX , re- spectively). Experiments with a simulated MEMStore con- port systems (DSS) workloads. Since DSS queries typi- firm that similar benefits will be realized with these future cally access a small number of attributes and most of the devices as well. data in the page is not touched in memory by the scan op- The remainder of this paper is organized as follows. erator, DSM stores only one attribute per page. To ensure Section 2 gives background and related work. Section 3 efficient storage device access, a storage manager maps describes the architecture of a database system that enables DSM pages with consecutive records containing the same decoupling of the in-memory and storage layouts. Sec- attribute into extents of contiguous LBNs. In anticipation of tion 4 describes the design of a buffer pool manager that a sequential scan through records stored in multiple pages, supports query-specific in-memory page layout. Section 5 a storage manager can prefetch all pages in one extent with describes the design of a volume manager that allows effi- a single large I/O, which is more efficient than accessing cient access when an arbitrary subset of table attributes are each page individually by a separate I/O. needed by a query. Section 6 describes our initial imple- A page layout optimized for CPU cache performance, mentation, and Section 7 evaluates this implementation for called PAX [1], offers good CPU-memory performance for several database workloads using both a disk array logical both individual attribute scans of DSS queries and full- volume and a simulated MEMStore. record accesses in OLTP workloads. The PAX layout par- titions data across into separate minipages. A single mini- page contains data of only one attribute and occupies con- 2 Background and related work secutive memory locations. Collectively, a single page con- Conventional relational database systems store data in tains all attributes for a given set of records. Scanning indi- fixed-size pages (typically 4 to 64 KB). To access indi- vidual attributes in PAX accesses consecutive memory lo- vidual records of a relation (table) requested by a query, cations and thus can take advantage of cache-line prefetch a scan operator of a database system accesses main mem- logic. With proper alignment to cache-line sizes, a single ory. Before accessing data, a page must first be fetched cache miss can effectively prefetch data for several records, from non-volatile storage (e.g., a logical volume of a disk amortizing the high latency of memory access compared to array) into main memory. Hence, a page is the basic allo- cache access. However, PAX does not address memory- cation and access unit for non-volatile storage. A database storage performance. storage manager facilitates this access and sends requests All of the described storage models share the same char- to a storage device to fetch the necessary blocks. acteristics. They (i) are highly optimized for one workload A single page contains a header describing what records type, (ii) focus predominantly on one level of the mem- are contained within and how they are laid out. In order to ory hierarchy, (iii) use a static data layout that is deter- retrieve data requested by a query, a scan operator must un- mined a priori when the relation is created, and (iv) apply derstand the page layout, (a.k.a. storage model). Since the the same layout across all levels of the memory hierarchy, page layout determines what records and which attributes even though each level has unique (and very different) char- of a relation are stored in a single page, the storage model acteristics. As a consequence, there are inherent perfor- employed by a database system has far reaching implica- mance trade-offs for each layout that arise when a work- tions on the performance of a particular workload [ 2]. load changes. For example, NSM or PAX layouts waste The page layout prevalent in commercial database sys- memory capacity and storage device bandwidth for DSS tems, called N-ary storage model (NSM), is optimized for workloads, since most data within a page is never touched. queries with full-record access common in an on-line trans- Similarly, a DSM layout is inefficient for OLTP queries ac- action processing (OLTP) workload. NSM stores all at- cessing random full records. To reconstruct a full record
tributes of a relation in a single page [13] and full records with n attributes, n pages must be fetched and n 1 joins are stored within a page one after another. Accessing a on record identifiers performed to assemble the full record. full record is accomplished by accessing a particular record In addition to wasting memory capacity and storage band- from consecutive memory locations. Using an unwritten width, this access is inefficient at the storage device level; rule that access to consecutive logical blocks (LBNs) in the accessing these pages results in random one-page I/Os. In storage device is more efficient than random access, a stor- summary, each page layout exhibits good performance for age manager maps single page to consecutive LBNs. Thus, a specific type of access at a specific level of memory hier- an entire page can be accessed by a single I/O request. archy. as shown in Table 1. An alternative page layout, called the Decomposition Several researchers have proposed solutions to address Storage Model (DSM)[7], is optimized for decision sup- these performance trade offs. Ramamurthy et al. proposed fractured mirrors that store data in both NSM and DSM lay- Storage format Memory format PAGE 1 PAGE HDR (ID, Age) outs [14] to eliminate the need to reload and reorganize 1237 4322 1563 7658 data when access patterns change. Based on the work- SELECT ID 1237 4322 1563 7658 FROM R Jane John Jim Susan 4 load type, a database system can choose the appropriate WHERE Age>30 LBNs data organization. Unfortunately, this approach doubles the required storage space and complicates data management; Table R 30 52 45 20 3 30 52 45 20 ID Name Age LBNs two physically different layouts must be maintained in syn- 1237 Jane 30 4322 John 52 PAGE 2 2865 1015 2534 8791 chrony to preserve data integrity. Hankins and Patel [9] 1563 Jim 45 2865 1015 2534 8791 4 proposed data morphing as a technique to reorganize data 7658 Susan 20 LBNs 2865 Tom 31 Tom Jerry Jean Kate within individual pages based on the needs of workloads 1015 Jerry 25 2534 Jean 54 3 31 25 54 33 that change over time. Since morphing takes place within 8791 Kate 33 LBNs 31 25 54 33 memory pages that are then stored in that format on the Unused space storage device, these fine-grained changes cannot address A-page C-page the trade-offs involved in accessing non-volatile storage. The multi-resolution block storage model (MBSM) [23] Figure 1: Decoupled storage device and in-memory layouts. groups DSM table pages together into superpages, improv- ing DSM performance when running decision-support sys- ples the in-memory data layout from the storage organiza- tems. The Lachesis database storage manager [15] exploits tion and implements different data layouts tailored to each unique disk drive characteristics to improve performance level, without compromising performance at the other lev- of DSS workloads and compound workloads that consist els. The challenge is to ensure that this decoupling works of DSS and OLTP queries competing for the same storage seamlessly within current database systems. This section device. It matches page allocation and access policies to introduces the different data organizations and describes leverage these characteristics, but the storage model itself is the key components of the Clotho architecture. not different; Lachesis transparently stores the in-memory NSM pages in the storage device’s logical blocks. 3.1 Data organization in Clotho MEMStores [5] are a promising new type of storage device that has the potential to provide efficient accesses Clotho allows for decoupled data layouts and different rep- to two-dimensional data. Schlosser et al. proposed data resentations of the same table at the memory and storage
layout for MEMStores that exploits their inherent access levels. Figure 1 depicts an example table, R, with three at-
ÆÑ parallelism [19]. Yu et al. devised an efficient map- tributes: Á, , and . At the storage level, the data is ping of database tables to this layout that takes advantage organized into A-pages. An A-page contains all attributes of the unique characteristics of MEMStores [22] to im- of the records; only one A-page needs to be fetched to re- prove query performance. Similarly, Schindler et al. [ 16] trieve a full record. Exploiting the idea used in PAX [1], proposed data organization for two-dimensional access to an A-page organizes data into minipages that group values database tables mapped to logical volumes composed of from the same attribute for efficient predicate evaluation, several disk drives. However, these initial works did not while the rest of the attributes are in the same A-page. To explore the implications of this new data organization on ensure that the record reconstruction cost is minimized re- in-memory access performance. gardless of the size of the A-page, Clotho allows the device In summary, these solutions either address only some to use optimized methods for placing the contents of the of the performance trade-offs or are applicable to only one A-page onto the storage medium. Therefore, not only does level of the memory hierarchy. Clotho builds on the previ- Clotho fully exploit sequential scan for evaluating predi- ous work and uses a decoupled data layout that can adapt to cates, but it also places A-pages carefully on the device dynamic changes in workloads without the need to main- to ensure near-sequential (or semi-sequential [16]) access tain multiple copies of data, reorganize data layout, or to when reconstructing a record. The placement of A-pages compromise between memory and I/O access efficiency. on the disk is further explained in Section 5.1. The rightmost part of Figure 1 depicts a C-page, which 3 Decoupling data layouts is the in-memory representation of a page. The page frame is sized by the buffer pool manager and is on the order of From the discussion in the previous section, it is clear that 8 KB. A C-page is similar to an A-page in that it also con- designing a static scheme for data placement in memory tains attribute values grouped in minipages, to maximize and on non-volatile storage that performs well across dif- processor cache performance. Unlike an A-page, however, ferent workloads and different device types and technolo- a C-page only contains values for the attributes the query
gies is difficult. Instead of accepting the trade-offs inherent accesses. Since, the query in the example only uses the to a particular page layout that affects all levels of the mem- Á and , the C-page only includes these two attributes, ory hierarchy, we propose a new approach. maximizing memory utilization. Note that the C-page uses As each level of the memory hierarchy have vastly dif- data from two A-pages to fill up the space “saved” from
ferent performance characteristics, the data organization at omitting ÆÑ. In the rest of this paper, we refer to the each level should be different. Therefore, Clotho decou- C-page layout as the Clotho storage model (CSM). 3.2 System architecture
The difficulty in building a database system that can de- OPERATORS (tblscan, idxscan, ... ) couple in-memory page layout from storage organization requests for pages with a access to payload lies in implementing the necessary changes without undue subset of attributes (payload) increase in system and code complexity. To allow decou- buffer page hdr pled data layouts, Clotho changes parts of two database BUFFER POOL Manager pool system components, namely the buffer pool manager and the storage manager. The changes span limited areas in Clotho these components and do not alter the query processing in- STORAGE Manager terface. Clotho also takes advantages of the Atropos logical payload data directly placed volume manager (LVM) that allows efficient access to two storage interface exposes efficient via scatter/gather I/O dimensional structures in both dimensions. access to non-contiguous blocks Figure 2 shows the relevant components of a database disk 0 disk 1 system to highlight the interplay between Clotho and other Atropos components. Each component can independently take ad- LOGICAL VOLUME Manager vantage of enabling hardware/OS technologies at each level of the memory hierarchy, while hiding the details from the disk array logical volume rest of the system. This section outlines the role of each component. The changes to the components are further ex- Figure 2: Interaction of Clotho with other components. plained in Sections 4 and 5.1 while details specific to our prototype implementation are provided in Section 6. To efficiently access data for a variety of access patterns, The operators are essentially predicated scan and store the storage manager relies on explicit hints provided by the procedures that access data from in-memory pages stored logical volume manager. These hints convey which LBNs in a common buffer pool. They take advantage of the can be accessed together efficiently and Clotho uses them query-specific page layout of C-pages that leverages the for allocating A-pages. L1/L2 CPU cache characteristics and cache prefetch logic The logical volume manager (LVM) maps volume LBNs for efficient access to data. to the physical blocks of the underlying storage device(s). The buffer pool manager manages C-pages in the buffer It is independent of the database system and is typically im- pool and enables sharing across different queries that need plemented with the storage system (e.g., disk array). The the same data. In traditional buffer pool managers, a buffer Atropos LVM leverages device-specific characteristics to page is assumed to have the same schema and contents as create mappings that yield efficient access to a collection the corresponding relation. In Clotho, however, this page of LBNs. In particular, the Atropos storage interface ex- may contain a subset of the table schema attributes. To ports two functions that establish explicit relationships be- ensure sharing, correctness during updates, and high mem- tween individual LBNs of the logical volume and enable the ory utilization, the Clotho buffer pool manager maintains Clotho storage manager to effectively map A-pages to in- a page-specific schema that denotes which attributes are dividual LBNs. One function returns the set of consecutive stored within each buffered page (i.e., the page schema). LBNs that yield efficient access (e.g., all blocks mapped The challenge of this approach is to ensure minimal I/O onto one disk or MEMStore track). Clotho maps mini- by determining sharing and partial overlapping across con- pages containing the same attribute to these LBNs for ef- current queries with minimal book-keeping overhead. Sec- ficient scans of a subset of attributes. Another function re- tion 4 details the buffer pool manager operation in detail. turns a set of non-contiguous LBNs that can be efficiently The storage manager maps A-pages to specific logical accessed together (e.g., parallel-accessible LBNs mapped volume’s logical blocks, called LBNs. Since the A-page to different MEMStore tips or disks of logical volume). format is different from the in-memory layout, the storage Clotho maps a single A-page to these LBNs. The Atropos manager rearranges A-page data on-the-fly into C-pages LVM is briefly described in Section 5.1 and detailed else- using the query-specific CSM layout. Unlike traditional where [16, 19]. Clotho need not rely on Atropos LVM to storage managers where pages are also the smallest ac- create query-specific C-pages. With conventional LVMs, cess units, the Clotho storage manager selectively retrieves it can map a full A-page to a contiguous run of LBNs a portion of a single A-page. With scatter/gather I/O and with each minipage mapped to one or more discrete LBNs. direct memory access (DMA), the pieces of individual A- However, with these conventional LVMs, access only along pages can be delivered directly into the proper memory one dimension will be efficient. frame(s) in the buffer pool as they arrive from the logi- cal volume. The storage manager simply sets up the ap- 3.3 Benefits of decoupled data layouts propriate I/O vectors with the destination address ranges for the requested LBNs. The data is placed directly to its The concept of decoupling data layouts at different levels destinations without the storage manager’s involvement or of the memory hierarchy offers several benefits. the need for data shuffling and extraneous memory copies. Leveraging unique device characteristics. At the volatile page start addr max # of rec. max # of rec. (main memory) level, Clotho uses CSM, a data layout # of pages schema # of pages schema Header Header pid 11111 409 (1 block) pid 3 0011 409 (1 block) that maximizes processor cache utilization by minimiz- # of rec presence bits # of rec Minipage 1 presence bits Minipage 1 ing unnecessary accesses to memory. CSM organizes 1 1 1 3 (2 blocks) 1 1 1 3 (2 blocks) Minipage 2 Minipage 2 data in C-pages and also groups attribute values to ensure (3 blocks) (3 blocks) 1 1 1 3 1 1 1 3 that only useful information is brought into the processor Minipage 1 1 1 1 1 4 (2 blocks) caches [1, 9]. At the storage-device level, the granularity Minipage 3 (5 blocks) Minipage 2 of accesses is naturally much coarser. The objective is to (3 blocks) 111 3 1 1 1 1 4 maximize memory utilization for all types of queries by Minipage 1 1 111 1 1 6 (2 blocks) Minipage 4 only bringing into the buffer pool data that the query needs. (5 blocks) Minipage 2 Query-specific memory layout. With memory organi- (3 blocks) 1 1 1 3 1 111 1 1 6 zation decoupled from storage layout, Clotho can decide (a) Full records. (b) Partial records. what data is needed by a particular query, request only the needed data from a storage device, and arrange the data Figure 3: C-page layout. on-the-fly to an organization that is best suited for the par- ticular query needs. This fine-grained control over what The page header stores the following information: page data is fetched and stored also puts less pressure on buffer id of the first A-page, the number of partial A-pages con- pool and storage system resources. By not requesting data tained, the starting address of each A-page, a bit vector in- that will not be needed, a storage device can devote more dicating the schema of the C-page’s contents, and the max- time to servicing requests for other queries executing con- imal number of records that can fit in an A-page. currently and hence speed up their execution. Figure 3(a) and Figure 3(b) depict C-pages with com- Dynamic adaptation to changing workloads. A system plete and partial records, respectively. The leftmost C-page with flexible data organization does not experience perfor- is created for queries that access full records, whereas the mance degradation when query access patterns change over rightmost C-page is customized for queries touching only time. Unlike systems with static page layouts, where the the first two attributes. The space for minipages 3 and 4 binding of data representation to workload occurs during on the left are used to store more partial records from ad- table creation, this binding is done in Clotho only during ditional A-pages on the right. In this example, a single C- query execution. Thus, a system with decoupled data orga- page can hold the requested attributes from three A-pages, nizations can easily adapt to changing workloads and also increasing memory utilization by a factor of three. fine-tune the use of available resources when they are under On the right side of the C-page we list the number of contention. storage device blocks each minipage occupies. In our ex- ample each block is 512 bytes. Depending on the relative 4 Buffer pool manager attribute sizes, as we fill up the C-page using data from more A-pages there may be some unused space. Instead The Clotho buffer pool manager organizes relational ta- of performing costly operations to fill up that space, we ble data in C-pages using the CSM data layout. CSM is choose to leave it unused. Our experiments show that, with a query-optimized in-memory page layout that stores only the right page size and aggressive prefetching, this unused the subset of attributes needed by a query. Consequently, space does not cause a detectable performance deteriora- a C-page can contain a single attribute (similar to DSM), tion (details about space utilization are in Section 7.6). a few attributes, or all attributes of a given set of records (similar to NSM and PAX ) depending on query needs. This section describes how the buffer pool manager constructs 4.2 Data sharing in buffer pool and maintains C-pages and ensures data sharing and con- Concurrent queries do not necessarily access the same sets sistency in the buffer pool. of attributes; concurrent sets of accessed attributes may be disjoint, inclusive, or otherwise overlapping. The Clotho 4.1 In-memory C-page layout buffer pool manager must (a) maximize sharing, ensuring Figure 3 depicts two examples of C-pages for a table with memory space efficiency, (b) minimize book-keeping to four attributes of different sizes. In our design, C-pages keep the buffer pool operations light-weight, and (c) main- only contain fixed-size attributes. Variable-size attributes tain consistency in the presence of updates. are stored separately in other page layouts (see Section 6.2). As an example, query Q1 asks for attributes a1 and a2 A C-page contains a page header and a set of minipages, while query Q2 asks for attributes a2 and a3. Using a sim- each containing data for one attribute and collectively hold- ple approach, the buffer manager could create two sepa- ing all attributes needed by queries. In a minipage, a single rate C-pages tailored to each query. This approach ignores attribute’s values are stored in consecutive memory loca- the sharing possibilities in case these queries scan the ta- tions to maximize processor cache performance. The cur- ble concurrently. To achieve better memory utilization, rent number of records and presence bits are distributed the buffer manager can instead dynamically reorganize the across the minipages. Because the C-page only handles minipages of a1 and a2 inside the two C-pages, fetching fixed-size attributes, the size of each minipage is deter- only the needed values and keeping track of the progress of mined at the time of table creation. each query, dynamically creating new C-pages for Q1 and
if read-only query then When looking for a record, a traditional database buffer if p sch q sch then manager looks for the corresponding page id in the page
Do nothing table, and determines whether the record is in memory.