Computing Distance Histograms Efficiently in Scientific Databases
Total Page:16
File Type:pdf, Size:1020Kb
IEEE International Conference on Data Engineering Computing Distance Histograms Efficiently in Scientific Databases Yi-Cheng Tu 1, Shaoping Chen 2, and Sagar Pandit 3 1Department of Computer Science and Engineering, University of South Florida 4202 E. Fowler Ave., ENB118, Tampa, FL 33620, U.S.A. [email protected] 2Department of Mathematics, Wuhan University of Technology 122 Luosi Road, Wuhan, Hubei, 430070, P. R. China [email protected] 3Department of Physics, University of South Florida 4202 E. Fowler Ave., PHY114, Tampa, FL 33620, U.S.A. [email protected] Abstract— Particle simulation has become an important re- However, many of such basic analytical queries require super- search tool in many scientific and engineering fields. Data gen- linear processing time if handled in a straightforward way, erated by such simulations impose great challenges to database as they are in current scientific databases. In this paper, we storage and query processing. One of the queries against particle simulation data, the spatial distance histogram (SDH) query, is report our efforts to design efficient algorithms for a type of the building block of many high-level analytics, and requires query that is extremely important in the analysis of particle quadratic time to compute using a straightforward algorithm. simulation data. In this paper, we propose a novel algorithm to compute SDH Particle simulations are computer simulations in which the based on a data structure called density map, which can be easily basic components of large systems (e.g., atoms, molecules, implemented by augmenting a Quad-tree index. We also show the results of rigorous mathematical analysis of the time complexity stars, galaxies ...) are treated as classical entities (i.e., particles) 3 of the proposed algorithm: our algorithm runs on Θ(N 2 ) for that interact for certain duration under postulated empirical 5 two-dimensional data and Θ(N 3 ) for three-dimensional data, forces. For example, molecular simulations (MS) explore rela- respectively. We also propose an approximate SDH processing tionship between molecular structure, movement and function. algorithm whose running time is unrelated to the input size N. These techniques are primarily applicable in the modeling of Experimental results confirm our analysis and show that the complex chemical and biological systems that are beyond the approximate SDH algorithm achieves very high accuracy. scope of theoretical models. MS are most frequently used in material sciences, biomedical sciences, and biophysics, I. INTRODUCTION motivated by a wide range of applications. In astrophysics, the N–body simulations are predominantly used to describe large Many scientific fields have undergone a transition to data scale celestial structure formation [6], [7], [8], [9]. Similar to and computation intensive science, as the result of automated MS in applicability and simulation techniques, the N–body experimental equipments and computer simulations. Recent simulation comes with even larger scales in terms of total years have witnessed significant efforts in building data man- number of particles simulated. agement tools suitable for processing scientific data [1], [2], Results of particle simulations form large datasets of particle [3], [4], [5]. Scientific data imposes great challenges to the configurations. Typically, these configurations store informa- design of database management systems that are tradition- tion about the particle types, their coordinates and velocities ally optimized toward handling business applications. First, - the same type of data we have seen in spatial-temporal scientific data often come in large volumes, thus requires us databases [10]. While snapshots of configurations are interest- to rethink the storage, retrieval, and replication techniques in ing, quantitative structural analysis of inter-atomic structures current DBMSs. Second, user accesses to scientific databases are the mainstream tasks in data analysis. This requires the are focused on complex high-level analytics and reasoning calculation of statistical properties or functions of particle that go beyond simple aggregate queries. While many types coordinates [6]. Of special interest to scientists are those of domain-specific analytical queries are seen in scientific quantities that require coordinates of two particles simultane- databases, the DBMS should be able to support those that are ously. In their brute force form these quantities require O(N 2) frequently used as building blocks for more complex analysis. computations for N particles [7]. In this paper, we focus on one such analytical query: the Spatial Distance Histogram S. Chen is currently a visiting professor in the Department of Computer Science and Engineering at the University of South Florida (USF). His email (SDH) query, which asks for a histogram of the distances of at USF is: [email protected] all pairs of particles in the simulated system. 1084-4627/09 $25.00 © 2009 IEEE 796 DOI 10.1109/ICDE.2009.30 Authorized licensed use limited to: University of South Florida. Downloaded on April 17, 2009 at 11:48 from IEEE Xplore. Restrictions apply. A. Motivation of point-to-point distances that fall into a series of l ranges in R ··· The SDH is a fundamental tool in the validation and analysis the domain: [r0,r1), [r1,r2), [r2,r3), , [rl−1,rl]. A range of particle simulation data. It serves as the main building block [ri,ri+1) in such series is called a bucket, and the span of the − of a series of critical quantities to describe a physical system. range ri+1 ri is called the width of the bucket. In this paper, Specifically, SDH is a direct estimation of a continuous statis- we focus our discussions on the case of standard SDH query tical distribution function called radial distribution functions where all buckets have the same width p and r0 =0,which ··· − (RDF) [6], [11], [12]. The RDF is defined as gives the following series of buckets: [0,p), [p, 2p), , [(l 1)p, lp]. Generally, the boundary of the last bucket lp is set N(r) to be the maximum distance of any pair of points. Although g(r)= 2 (1) 4πr δrρ almost all scientific data analysis only require the computation where N(r) is the number of atoms in the shell between r of standard SDH queries, our solutions can be easily extended and r + δr around any particle, ρ is the average density of to handle histograms with non-uniform bucket width and/or 2 particles in the whole system, and 4πr δr is the volume of arbitrary values of r0 and rl. The only complication of non- the shell. The RDF can be viewed as a normalized SDH. uniform bucket width is that, given a distance value, we need The RDF is of great importance in computation of thermo- O log l time to locate the corresponding bucket instead of dynamic characteristics of the system. Some of the important constant time in processing SDH with equal bucket width. quantities like total pressure, and energy cannot be calculated The SDH is basically a series of non-negative integers h = without g(r). For mono–atomic systems, the RDF can also (h1,h2, ··· ,hl) where hi (0 <i≤ l) is the number of pairs be directly related to the structure factor of the system [13]. of points whose distances are within the bucket [(i − 1)p, ip). In current solutions, we have to calculate distances between In Table I, we list the notations that are used throughout all pairs of particles and put the distances into bins with this paper. Note that symbols defined and referenced in a local a user-specified width, as done in state-of-the-art simulation context are not listed here. data analysis software packages [14], [12]. MS or N–body TABLE I techniques generally consist of large number of particles. For SYMBOLS AND NOTATIONS. example, the Virgo consortium has accomplished a simulation containing 10 billion particles to study the formation of Symbol Definition galaxies and quasars [15]. MS systems also hold up to millions p width of histogram buckets of atoms. Such scale of the simulated systems prohibits the l total number of histogram buckets analysis of large datasets following the brute-force approach. h the histogram with elements hi (0 <i≤ l) From a database viewpoint, it would be desirable to make SDH N total number of particles in data a basic query type with the support of scalable algorithms. i an index symbol for any series DM i B. Contributions and roadmap i the -th level density map d number of dimensions of data We claim the following contributions via this work: δ side length of a cell 1. We propose an innovative algorithm to solve the SDH S area of a region in 2D space problem based on a Quadtree-like data structure we call error bound for the approximate algorithm density map; H total level of density maps, i.e., tree height 2. We accomplish rigorous performance analysis of our 3 algorithm and prove its time complexity to be Θ N 2 5 3 and Θ N for 2D and 3D data, respectively; III. OUR APPROACH 3. Our analytical results on the algorithm give rise to an approximate SDH solution whose time complexity is A. Overview independent to the size of the dataset. In practice, this In processing SDH using the na¨ıve approach, the difficulty algorithm computes SDH with very low error rates. comes from the fact that the distance of any pair of points We continue this paper by formally defining the SDH prob- is calculated to determine which bucket a pair belongs to. lem and listing important notations in Section II; we introduce An important observation here is: a histogram bucket always our SDH processing algorithm in Section III; performance has a non-zero width. Given a pair of points, their bucket analysis of our algorithm is sketched in Section IV; we discuss membership could be determined if we only know a range an approximate SDH solution in Section V; Section VI is that the distance belongs to and this range is contained in a dedicated to experimental evaluation of our algorithms; we histogram bucket.