The Database Architectures Research Group at CWI
Total Page:16
File Type:pdf, Size:1020Kb
The Database Architectures Research Group at CWI Martin Kersten Stefan Manegold Sjoerd Mullender CWI Amsterdam, The Netherlands fi[email protected] http://www.cwi.nl/research-groups/Database-Architectures/ 1. INTRODUCTION power provided by the early MonetDB implementa- The Database research group at CWI was estab- tions. In the years following, many technical inno- lished in 1985. It has steadily grown from two PhD vations were paired with strong industrial maturing students to a group of 17 people ultimo 2011. The of the software base. Data Distilleries became a sub- group is supported by a scientific programmer and sidiary of SPSS in 2003, which in turn was acquired a system engineer to keep our machines running. by IBM in 2009. In this short note, we look back at our past and Moving MonetDB Version 4 into the open-source highlight the multitude of topics being addressed. domain required a large number of extensions to the code base. It became of the utmost importance to support a mature implementation of the SQL- 2. THE MONETDB ANTHOLOGY 03 standard, and the bulk of application program- The workhorse and focal point for our research is ming interfaces (PHP, JDBC, Python, Perl, ODBC, MonetDB, an open-source columnar database sys- RoR). The result of this activity was the first official tem. Its development goes back as far as the early open-source release in 2004. A very strong XQuery eighties when our first relational kernel, called Troll, front-end was developed with partners and released was shipped as an open-source product. It spread to in 2005 [1]. ca. 1000 sites world-wide and became part of a soft- MonetDB remains a product well-supported by ware case-tool until the beginning of the nineties. the group. All its members carry out part of the None of the code of this system has survived, but development and maintenance work, handling user ideas and experiences on how to obtain a fast rela- inquiries, or act as guinea pigs of the newly added tional kernel by simplification and explicit materi- features. A thorough daily regression testing infras- alization found their origin during this period. tructure ensures that changes applied to the code The second half of the eighties was spent on build- base survive an attack of ca. 20 platform configu- ing the first distributed main-memory database sys- rations, including several Linux flavors, Windows, tem in the context of the national Prisma project. FreeBSD, Solaris, and MacOS X. A monthly bugfix A fully functional system of 100 processors and a, release and ca. 3 feature releases per year support for that time, wealthy 1 GB of main memory showed our ever growing user community. The web por- the road to develop database technology from a dif- tal 1 provides access to this treasure chest of modern ferent perspective. Design from the processor to the database technology. It all helped us to create and slow disk, rather than the other way around. maintain a stable platform for innovative research Immediately after the Prisma project, a new ker- directions, as summarized below. The MonetDB nel based on Binary Association Tables (BATs) was spin-offcompany was set up to support its market laid out. This storage engine became accessible take-up, to provide a foundation for QA, support, through MIL, a scripting language intended as a and development activities that are hard to justify target for compiling SQL queries. The target ap- in a research institute on an ongoing basis. plication domain was to better support scientific databases with their (archaic) file structures. It 3. HARDWARE-CONSCIOUS quickly shifted to a more urgent and emerging area. DATABASE TECHNOLOGY Several datamining projects called for better database support. It culminated in our first spin- A key innovation in the MonetDB code base is offcompany, Data Distilleries, in 1995, which based its reliance on hardware conscious algorithms. For, their analytical customer relationship suite on the 1http://www.monetdb.org/ SIGMOD Record, December 2011 (Vol. 40, No. 4) 39 advances in speed of commodity CPUs have far out- with a cache-friendly data access pattern. paced advances in RAM latency. Main-memory ac- Memory Access Cost Modeling. For query cess has therefore become a performance bottleneck optimization to work in a cache-conscious environ- for many computer applications, including database ment, and to enable automatic tuning of our cache- management systems; a phenomenon widely known conscious algorithms on different types of hardware, as the “memory wall.” A revolutionary redesign of we developed a methodology for creating cost mod- database architecture was called for in order to take els that takes the cost of memory access into ac- advantage of modern hardware, and in particular to count [16]. The key idea is to abstract data struc- avoid hitting this memory wall. Our pioneering re- tures as data regions and model the complex data search on columnar and hardware-aware database access patterns of database algorithms in terms of technology, as materialized in MonetDB, is widely simple compounds of a few basic data access pat- recognized, as indicated by the VLDB 2009 10-year terns. We developed cost functions to estimate the Best Paper Award [19, 2] and two DaMoN best pa- cache misses for each basic pattern, and rules to per awards [22, 6]. Here, we briefly highlight im- combine basic cost functions and derive the cost portant milestones. functions of arbitrarily complex patterns. The total Vertical Storage. Whereas traditionally, rela- cost is then the number of cache misses multiplied tional database systems store data in a row-wise by their latency. In order to work on diverse com- fashion (which favors single record lookups), Mon- puter architectures, these models are parametrized etDB uses a columnar storage, which favors analysis at run-time using automatic calibration techniques. queries by better using CPU cache lines. Vectorized Execution. In the “X100” project, Bulk Query Algebra. Much like the CISC vs. we explored a compromise between classical tuple- RISC idea applied to CPU design, the MonetDB re- at-a-time pipelining and operator-at-a-time bulk lational algebra is deliberately simplified compared processing [3]. The idea of vectorized execution to the traditional relational set algebra. Paired is to operate on chunks (vectors) of data that are with an operator-at-a-time bulk execution model, large enough to amortize function call overheads, rather than the traditional tuple-at-a-time pipelin- but small enough to fit in CPU caches to avoid ing model, this allows for much faster implementa- materialization into main memory. Combined with tion on modern hardware, as the code requires far just-in-time light-weight compression, it lowers the fewer function calls and conditional branches. memory wall somewhat. The X100 project has been Cache-conscious Algorithms. The crucial as- commercialized into the Actian/VectorWise com- pect to overcome the memory wall is good use of pany and product line 2. CPU caches, for which careful tuning of memory ac- cess patterns is needed. This led to a new breed of 4. DISTRIBUTED PROCESSING query processing algorithms. Their key requirement After more than a decade of rest at the frontier of is to restrict any random data access pattern to data distributed database processing, we embarked upon regions that fit into the CPU caches to avoid cache several innovative projects in this area again. misses, and thus, performance degradation. For in- Armada. An adventurous project was called Ar- stance, partitioned hash-join [2] first partitions both mada where we searched for technology to create relations into H separate clusters that each fit into a fully autonomous and self regulating distributed the CPU caches. The join is then performed per database system [5]. The research hypothesis was pair of matching clusters, building and probing the to organize a large collection of database instances hash-table on the inner relation entirely inside the around a dynamically partitioned database. Each CPU cache. With large relations and small CPU time an instance ran out of resources, it could so- caches, efficiently creating a large number of clus- licit a spare machine and decide autonomously on ters can become a problem in itself. If H exceeds the what portion to delegate to its peer. The decisions number of TLB entries or cache lines, each memory were reflected in the SQL catalog which triggered reference will trigger a TLB or cache miss, compro- continuous adaptive query modification to hunt af- mising the performance significantly. With radix- ter the portions in the loosely connected network of cluster [17], we prevent that problem by perform- workers. It never matured as part of the MonetDB ing the clustering in multiple passes, such that each distribution, because at that time we did not have pass creates at most as many new sub-clusters as all the basic tools to let it fly. there are TLB entries or cache lines. With radix- Since, the Merovingian toolkit developed and now decluster [18], we complement partitioned hash-join provides the basis for massive distributed process- with a projection (tuple reconstruction) algorithm 2http://www.actian.com/products/vectorwise/ 40 SIGMOD Record, December 2011 (Vol. 40, No. 4) ing. It provides server administration, server dis- query processing can take place. We start from a covery features, client proxying and funneling to single master node in control of the database, and accommodate large numbers of (web) clients, basic with a variable number of worker nodes to be used distributed (multiplex) query processing, and fail- for delegated query processing. Data is shipped over functionality for a large number of MonetDB just-in-time to the worker nodes using a need-to- servers in a network.