Computational Database Systems for Massive Scientific Datasets
Total Page:16
File Type:pdf, Size:1020Kb
Computational Database Systems for Massive Scientific Datasets David O’Hallaron, Anastassia Ailamaki, and Gregory Ganger Carnegie Mellon University Pittsburgh, PA 15213 (This page will not be included in the submitted version) March 1, 2004 Proposal submitted to the National Science Foundation Directorate for Computer and Information Sciences and Engineering, SEIII Program Announcement (NSF 04-528), Division of Information and Intelligent Systems (IIS), Science and Engineering Informatics Cluster, -1 Part A Project Summary We propose to push the frontiers of scientists’ ability to explore and understand massive scientific datasets by devel- oping a new approach called computational database systems. The idea is to integrate scientific datasets stored as spatial databases with specialized, tightly-coupled, and highly-optimized functions that query and manipulate these databases. Doing so will simultaneously simplify domain-specific programming (since datasets are now in a form that can be queried generally) and greatly increase the scale of problems that can be studied efficiently (since the complete dataset no longer needs to fit into memory). Many scientific problems are constrained by the complexity of manipulating massive datasets. Recently, for example, a group of us at Carnegie Mellon received the Gordon Bell Prize for producing and simulating the largest unstructured finite element model ever. The model simulates ground motion during an earthquake in the Los Angeles basin across a mesh with over 1B nodes and elements. Although exciting, this achievement came with some painful lessons about the shortcomings of our current capabilities of dealing with massive datasets. In particular, the dataset that we generated for the Gordon Bell Prize has limited scientific value because it is too large to be queried and explored effectively, requiring heroic efforts just to extract a few waveforms for publication. In a nutshell, dramatic increases in computing power and storage capacity have allowed scientists to build simulations that model nature in more detail than ever. However, such simulations require massive datasets that introduce massive headaches to the modelers. Only with a new approach, such as that proposed here, can we bypass these hurdles. Intellectual merit. Exploring computational database systems creates a new interdisciplinary research area that lies at the intersection of scientific computing, database management, and storage systems. Our general approach seeks to build on decades of database research, but since we are using databases for a new domain (scientific computing), we will need to augment existing techniques with new algorithms and storage structures. A significant intellectual challenge will be understanding the strengths and limitations of previous database work so that we do not reinvent things unnecessarily. The fundamental problem to be addressed is the efficient storage and querying of the massive datasets produced by unstructured finite element simulations. Solving this problem requires the original input mesh to be in a database that can be efficiently queried, so we first must tackle the problem of generating massive queryable unstructured finite element meshes. Preliminary experiences suggest an approach based on linear octrees indexed by B-trees using locational code keys. The key result will be the ability to query the mesh for points that do not align with existing mesh nodes. Given this queryable mesh structure, we then need to develop space- and time-efficient techniques for representing and querying the output datasets. The key problem here is to develop techniques for compressing and indexing the output dataset while still allowing for efficient query execution. Solutions we will explore include multi- level indexing schemes for the compressed data and storage layouts and algorithms that support efficient queries in multiple dimensions by decoupling the layout strategies at different levels of the storage hierarchy. Broader impacts. The research we are proposing will be helpful to a broad group of scientists who generate and access massive datasets. In particular, earth scientists at the Southern California Earthquake Center (SCEC) in Los Angeles will be our first students and customers. The PI works closely with SCEC through the NSF-funded ITR: Community modeling environment for system-level earthquake research (ITR EAR-01-22464) ITR project. The ideas developed by the proposed research will feed directly into the SCEC community modeling environment, and will be used by the earth scientists and their students at SCEC institutions. Other scientists can then follow. A-1 Part B Table of Contents Contents A Project Summary A-1 B Table of Contents B-1 C Project Description C-1 D References Cited D-1 E Biographical Sketches of the PIs E-1 B-1 O’Hallaron, Ailamaki, and Ganger Computational Database Systems for Massive Scientific Datasets Page C-1 of C-15 Part C Project Description 1 Problem Statement This is an exciting time for scientists and engineers who simulate nature on their computers. Processor speeds continue to increase, with 2 GHz machines now the norm. Disk capacity is exploding and price per bit is plummeting, with hundreds of gigabytes of storage available for a few hundred dollars. At the same time, mature RAID storage technology aggregates both storage and I/O bandwidth to provide terabytes of fast storage for only $5K/terabyte, with throughputs that are comparable to typical main memory throughputs of 100 MB/sec [10]. After years of inflated prices, DRAM prices have plummeted, from more than $30/MB five years ago to less than $0.50/MB today, with main memory sizes of 1 GB now common on the desktop. These technology trends allow scientists to use their own inexpensive desktop or lab machines to run massive computations that in the past would have required time on scarce and expensive supercomputers. However, massive computations often require massive input and output datasets, and in our experience, the size of these datasets is rapidly outpacing our ability to manipulate and use them. For example, for the past 10 years, the PI has been building computer models that predict the motion of the ground during strong earthquakes in the Los Angeles basin (LAB) [7, 8, 23, 3]. In 1993, the largest LAB simulation code required an input unstructured finite element mesh with only 50K nodes (1.5 MB) and produced a relatively small 500 MB output dataset. By 2003, the largest LAB simulation required a mesh with 1.37B nodes (45 GB) and generated an output dataset that was over one TB. To our knowledge, this is the largest unstructured finite element model ever produced. For generating this mesh and using it to simulate earthquake ground motion, and for work on inversion, the PI (with other members of the Carnegie Mellon Quake group) received the 2003 Gordon Bell Prize for Special Achievement [3]. The record-setting LAB effort required us to run large parallel simulation codes (solvers) on the Terascale system at Pittsburgh Supercomputing Center. While this activity was nontrivial, it is something that our group and other scientific modelers have become quite adept at over the past decade [7, 8, 36]. Interestingly, the really nasty problems were caused by the massive sizes of the input and output datasets. Here are some of the difficult problems we are confronting: How do we build massive unstructured input meshes? Unstructured meshes are difficult to build because they have irregular topologies that can require complex pointer-based data structures to represent. Thus, conventional mesh generators build a mesh in main memory, where it can be accessed and updated quickly, and then dump the finished product to disk as one or more flat files [44, 43]. Unfortunately, meshes with billions of nodes will not fit in the main memories of most machines. While symmetric multiprocessors with massive shared memories do exist at a few supercomputing centers, such machines are scarce shared resources and their entire address space is not available to applications running on single processors. Recent advances in parallel mesh generation address this issue by aggregating the memories of multiple processors [13, 34]. However, the published size of these meshes is on the order of tens of millions of elements, still several orders of magnitude smaller than our target size. How do we perform efficient queries of compressed spatio-temporal datasets? The output dataset O’Hallaron, Ailamaki, and Ganger Computational Database Systems for Massive Scientific Datasets Page C-2 of C-15 produced by our earthquake solver (hereafter called an earthquake dataset) is a four-dimensional spatio- temporal field that captures the velocities of the mesh nodes over time. Earthquake datasets can be as large as 25 TB, and since 85-95% of the velocities are effectively zero (in the first percentile), they are good candidates for some form of compression. Basic point queries of the earthquake dataset require efficient queries to the input mesh in order to extract the locations of the corners of the enclosing octant. When meshes were small enough to fit entirely in memory, we could just load them up into a memory-resident data structure and query them directly. However, this approach no longer works for massive meshes. We must develop new approaches for querying massive meshes stored on disk. Point queries are aggregated in both space and time. Engineering analyses require time-varying queries where the spatial coordinates are constant and time varies, while visualizations require space-varying queries where the spatial coordinates vary and time is constant. One problem faced by the aggregate queries is that data layouts favoring time-varying queries tend to punish space-varying queries, and vice versa. A key research task is to develop database and storage structures that will allow us to support both types of queries. Another important task is to develop compression techniques that will still allow fast queries over the compressed datasets. In a nutshell, dramatic increases in computing power and storage capacity have allowed scientists and en- gineers to build simulations that model nature in more detail than ever.