Scientific Management at the Johns Hopkins Institute for Data Intensive Engineering and Science

Yanif Ahmad, Randal Burns, Michael Kazhdan, Charles Meneveau, Alex Szalay, Andreas Terzis {yanif,randal,misha,terzis}@cs.jhu.edu, {meneveau,szalay}@jhu.edu , MD 21218, USA. http://idies.jhu.edu

1. INTRODUCTION braries, Applied Math, School of Medicine, School Scientific computing has long been one of the of Public Health and the Human Language Tech- deep and challenging applications of computer sci- nology Center of Excellence. The authors of this ence and data management, from early endeavors paper make up the Computer Science members of in numerical simulation, to recent undertakings in IDIES and two of their long-standing collaborators. the life sciences, such as genome assembly. Complex The establishment of a data intensive computing computational problems abound and their solutions center has numerous benefits in addition to provid- transform our understanding of the physical world. ing an umbrella for data-driven research. It acts as The data management community’s interest in sci- a beacon for the remainder of the Hopkins commu- entific applications has grown over the last decade nity and beyond, where our members are the first due to the commoditization of parallelism, dimin- port of call for anyone with problems from ishing system administration costs, and a search for academia, industry, and government agencies in the relevance beyond enterprise applications. vicinity and nationwide, such as the Johns Hop- Research in scientific computing raises non-tech- kins Applied Laboratory, a not-for-profit nical challenges, such as overcoming the paucity of research lab housing 3,000 engineers and scientists. resources needed for experimentation, and estab- The Johns Hopkins University provides an ex- lishing a collaborative research agenda that fosters tremely conducive environment for these efforts. a mutual appreciation of problems, results in a con- The demographics of the university leads to an certed effort to develop software tools, and makes outward-looking faculty within departments. The all researchers successful in their respective fields. model of research being conducted through inter- In light of this, we report on a recently formed in- disciplinary centers is the norm rather than the stitute at the Johns Hopkins University to further exception (e.g. the Human Language Technology the interaction between computer science, and sci- Center of Excellence). This facilitates application- ence and engineering. We describe ongoing projects driven research, combining scientific domain knowl- at the institute and our collaboration experiences. edge with core computer systems research expertise. From an academic standpoint, the interdisciplinary model of IDIES percolates through to the students, 2. IDIES engaging students in an application and dataset- The Institute for Data Intensive Engineering and driven research agenda. Our agenda fosters a breadth Science (IDIES) was founded in April 2009 to bring of knowledge beyond data management and com- together data intensive computing in a variety of puter science, and encourages a hands-on approach science and engineering disciplines. The mission to system development. This results in tangible, us- of IDIES is to foster the interdisciplinary develop- able components that further the underlying scien- ment of tools and methods that derive knowledge tific problem, as well as promoting an experimental from the massive datasets that today’s instruments, approach to data management challenges. experiments, and simulations generate at exponen- This research model leads to the development tially growing rates. of communities around individual datasets and the curation of these datasets both in terms of data 2.1 Research and Academic Model cleaning and functionality supported on top of the IDIES includes faculty members drawn from a dataset. Most often, this process produces public variety of disciplines ranging from Physics and As- resource for widespread academic usage. tronomy, Mechanical Engineering, the Sheridan Li- 2.2 Scientific Data Workflow at IDIES use SQL to formulate their research questions and Traditionally, under the scientific method, basic can run observations instantly, rather than waiting research involves the formation of hypotheses and for months to get their turn at a telescope. theories, designing experiments for their validation, The original database was built in close collabo- collecting data by realising experimentation, and ration between Alex Szalay’s team at JHU and Jim analysing data to guide new insights for further re- Gray of . The database is based on SQL search through iteration of this workflow. All stages Server, contains about 70 tables of data and de- of this workflow loop exhibit heavy computational scriptive metadata, and incorporates a substantial and data management needs, particularly experi- amount of code in the form of User De- mentation, simulation and analysis. The emergence fined Functions. Among other things, we have built of data-driven science has front-loaded much of this a high precision GIS system for astronomy, based on iterative workflow, where an intense period of ex- the Hierarchical Triangular Mesh, implemented as perimentation results in a substantially larger data a set of class libraries, wrapped into C# functions, volume for processing, followed by an intense pe- callable from T-SQL. riod of exploration and analysis of the data. Both Today about 30% of the world’s professional as- modes must address large dataset challenges. tronomy community has a server side database in Many of these challenges appear individually in this system. The two-stage parallel loading envi- other domains, but are greatly exacerbated in scien- ronment and the whole framework is now in use tific computing by the diversity of complex, instance- by many groups, both at JHU and all over the specific semantics. The end result is an application world, and has been repurposed from astronomy domain with extremely high overheads and barriers to many other disciplines, such as environmental to entry through the combination of limited soft- science and radiation oncology. Most recently, the ware reuse, and the brittle infrastructures arising Pan-STARRS project, on the way to its first of from ad-hoc composition of tools that are difficult many petabytes of data, builds upon a scaled-out to parameterize and generalize. version of the SDSS design. We have assembled a critical mass of researchers at IDIES with expertise spanning many aspects 4. APPLICATIONS AND DATASETS of the scientific research workflow: Randal Burns IDIES research spans a wide variety of applica- (storage systems and transaction processing), An- tions and datasets. We present a sample here. dreas Terzis (sensor and wireless networks), Michael Kazhdan (computer graphics, mesh processing and 4.1 Turbulence geometric retrieval) and Yanif Ahmad (data man- Hydrodynamic turbulence is a formidably diffi- agement and declarative languages). We are ex- cult and pressing problem. However, direct numer- tremely fortunate to work with scientists and en- ical simulations of turbulence for the flow conditions gineers who have developed a deep appreciation of prevalent in most engineering, geophysics, and as- data management throughout their careers, includ- trophysics applications are impractical. Also, most ing Alex Szalay (physics, astronomy), and Charles critical scientific problems, such as predicting the Meneveau (mechanical engineering, turbulence). Earth’s climate and developing energy sources, re- quire reliable simulations of turbulent flows. 3. THE Typically, individual researchers perform large The focus on data-intensive computating at JHU simulations that are analyzed during the compu- started when the Department of Physics and As- tation with only a small subset of data stored for tronomy joined the Sloan Digital Sky Survey project subsequent analysis. The majority of the time evo- (SDSS), the astronomy equivalent of the “Human lution is discarded. As a result, the same simu- Genome Project.” The SDSS created a high reso- lations must be repeated after new questions arise lution multi-wavelength map of the Northern Sky that were not initially obvious; most breakthrough with 2.5 trillion pixels of imaging. The results concepts cannot be anticipated in advance. Thus, a of the image segmentation were placed in a rela- new paradigm is emerging that creates large and tional database that has grown to 12 terabytes over easily accessible databases that contain the full the last ten years and has created a new way to space-time history of simulated flows. do astronomy. Based upon citation , the In IDIES, we have embarked on building such database of the SDSS has been the most used as- databases. The JHU public turbulence database tronomy facility in the world. A large fraction of houses a 27 TB database that contains the entire the world’s astronomy community has learned to time history of a 10243 mesh point pseudo-spectral sectors of the U.S. economy in terms of its energy consumption. According to a 2007 EPA report, U.S. data centers consumed 61 billion kWh in 2006— enough energy to power 5.8 million households. Un- der conservative estimates, IT energy consumption is projected to double by 2011. Only a fraction of the electricity consumed powers IT equipment. The rest is used by environmental control systems such as air conditioning, (de-)humidifiers, and wa- ter chillers, or is lost during delivery. A key reason for inefficiency is the lack of visibil- ity into the data center operating conditions. Con- Figure 1: Contour and surface plot of dissi- ventional wisdom dictates that IT equipment need pation field in isotropic turbulence as deter- excessive cooling to operate reliably, so the AC sys- mined from a 2D cut through the 4D data tems are set very cool and fan speeds set high, to re- from the JHU public turbulence database. duce the danger of hot spots. Also, when servers is- sue thermal alarms, data centers have limited means direct numerical simulation of forced isotropic tur- to diagnose and remedy the problem other than fur- bulence [15, 12]. 1024 time-steps have been stored, ther decreasing the temperature. covering a full “large-eddy” turnover time. A Web Given the data center’s complex airflow and ther- service fulfills user requests for velocities, pressure, modynamics, dense and real-time environmental and various space-derivatives and interpolation func- monitoring is necessary to improve energy efficiency. tions (see [7]). The 10244 isotropic turbulence The data collected can help operators troubleshoot database has been in operation continuously since thermal alarms, make intelligent decisions on rack 2008 and has served more than 50 billion point layout and server deployments, and innovate on fa- queries for research and teaching purposes. cility management. 4.2 Life Under Your Feet Wireless sensor network technology is an ideal candidate for this monitoring task as it is low- Lack of field measurements, collected over long cost, nonintrusive, can provide wide coverage, and periods of time and at biologically significant spa- can be easily repurposed. The Data Center (DC) tial granularity, hinders scientific understanding of Genome project—a collaborative effort with Mi- the effects of environmental conditions to the soil crosoft Research—aims to understand how energy is ecosystem. Wireless Sensor Networks (WSNs) promise consumed in data centers as a function of facility de- to address ecologists’ predicaments through a foun- sign, cooling supply, server hardware, and workload tain of readings from low-cost sensors deployed with distribution through data collected from large-scale minimal disturbance to the monitored site. sensor networks, and to use this understanding to In the fall of 2005 we built a proof-of-concept optimize and control data center resources. WSN to validate this claim. The end-to-end Life Under Your Feet (LUYF) system includes motes that collect environmental parameters such as soil 4.4 Data Conservancy moisture and temperature, static and mobile gate- Research projects have finite lifetimes and the ways that periodically download collected measure- data products they produce need to persist long af- ments through the Koala reliable transfer proto- ter the community of researchers have disbanded col [13], a database that stores collected measure- and funding has ceased. The preservation of sci- ments, access tools for analyzing the data, a Web entific data makes scientific discovery available to site that serves the data, and tools to monitor the future researchers and preserves our investment in network. LUYF has been deployed in multiple the sciences. However, the data management com- forests in networks whose sizes range from ten to munity has yet to define self-sustaining preservation fifty nodes, deployed from a couple of weeks to more data management platforms that are widely avail- than two years. The LUYF database contains more able and many critical data sets are being lost. For than 120 million measurements, data derivatives, example, the completed SDSS sky survey represents and the provenance of stored data. 8 years of telescope time and has produced more than 4 TB of curated data. As of now, the project 4.3 DC Genome is complete and scientists and staff are moving. The IT industry is the one of the fastest growing At JHU, we have undertaken the long-term preser- vation of scientific data in the Data Conservancy 5.2 Incremental and Model-Based project: an NSF-funded Sustainable Digital Data Continuous Queries Preservation and Access Network (DataNets) project. Data-driven stream processing arises naturally in Preservation targets include earth science, biology, scientific applications given continuous data acqui- and social science data in addition to astronomy sition. Here at Johns Hopkins we are studying the and the SDSS. The Data Conservancy will create a incremental foundations of stream processing, and preservation environment that retains not only the investigating the use of mathematical representa- data, but the metadata and queries that express tions of data in model-based stream engines. how scientists interact with the data. Scientists now The DBToaster [9] project investigates incremen- and in the future will be able to query SDSS data to tal query processing of large dynamic data work- repeat and validate previous findings and continue loads. While stream processors answer queries over the exploration of the . recent, contiguous windows of data streams, a dy- namic data management system has to handle large, 5. RESEARCH PROJECTS arbitrarily long-lived, non-contiguous state. A dy- We now survey the data management techniques namic data management system is well-suited for inspired by the above applications. streaming analysis, combining streaming data with persistent data for example, in scientific applica- 5.1 Instrumenting the Real World with tions, to detect stream outliers with the aid of a Sensor Networks historical database. DBToaster transforms continuous queries to be Sensor networks deployed to collect scientific data evaluated as incrementally as possible. DBToaster (e.g., [14, 16, 19]) have been shown to be plagued applies the concept of using delta queries, as found with measurement faults. These faults must be de- in incremental view maintenance, recursively, yield- tected to prevent pollution of the experiment and ing higher-order delta queries much as with higher- waste of network resources. At the same time, net- order derivatives from calculus. By intelligently ma- works should autonomously adapt to sensed events, terializing and reusing higher-order deltas, we avoid for example by increasing sampling rates or raising repeated computation of delta queries. alarms. Here, events are measurements that deviate The Pulse [2] project investigates the use of piece- from “normal” data patterns, yet represent features wise polynomials, in a traditional stream processing of the underlying phenomenon, such as rain events engine. The input data stream is represented as in the case of soil moisture measurements. a piecewise polynomial, while queries expressed in However, detection algorithms tailored to specific a declarative first-order logic based language such types of faults lead to false positives when exposed as SQL are transformed into a series of systems of to multiple types of faults [6]. More importantly, al- equations. Query processing then involves solving gorithms which classify measurements that deviate equation systems. Streams represented by mathe- from the recent past as faulty tend to misclassify matical models can interpolate missing data, for ex- events as faults [6]. This misclassification is par- ample from sensor and network failures, or extrapo- ticularly undesirable because measurements from late data for predictive query processing and “what- episodic events are invaluable data for domain sci- if” queries. Furthermore, polynomial streams are a entists and should be given the highest priority. highly compact data representation and can often In our recent work we unified fault and event be processed extremely efficiently. detection under a more general anomaly detection framework, in which online algorithms classify mea- surements that significantly deviate from a learned 5.3 Batch Processing: Data-Driven model of data as anomalies [5]. We avoid misclas- Exploration and Analysis sification by including punctuated, yet infrequent Publicly-available data-intensive scientific services events in the training set, thus allowing the system experience a tragedy of the commons. User work- to distinguish faults from events of interest. Specif- loads are I/O bound and concurrent workloads in- ically, we developed an anomaly detection frame- terfere with each other, creating congestion. As our works based on Echo State Networks as well as scientific data services become more popular, the Bayesian Networks and implemented these frame- user experience inevitably degrades. Jim Gray ex- works on a mote-class device. We showed that pressed this best, in the early days of the Sloan learning-based techniques are more sensitive to sub- Digital Sky Survey, when he said, “The only things tler faults and generate fewer false positives than that we have to fear are success and failure.” the rule-based fault detection techniques. The Sloan Digital Sky Survey uses data replica- tion, workload classification, and server storage to handle long-running, data-intensive queries. The Catalog and Archive Server Jobs (CasJobs) [11] system defined a separate instance of the SDSS database for asynchronous queries that allowed un- restricted SQL access to the SDSS and stored query results on server local storage for subsequent anal- ysis. This insulated power-users from casual, inter- active use and made long-running queries faster and more reliable: jobs transfer results to local storage and job completion does not rely on client-server connectivity. Our subsequent efforts focused on increasing the Figure 2: Visualization of the raw data re- throughput of concurrent jobs based on I/O sharing turned by 3D scanners (left) and a surface among queries. I/O cannot be provisioned incre- reconstruction fit to the data (right). mentally to meet increased demand, e.g. by adding servers, nor is caching effective for data-intensive data rate of 70 GB/s for a parallel SQL workload. scientific queries that scan large data sets. In- The cost and power consumption of the Gray- stead, we identify queries that share data, e.g. scan wulf scales linearly with storage size and therefore the same relation, execute the queries concurrently. will soon face a power consumption wall as scientific For declarative Astronomy queries in which data data sets continue to increase in size. To resolve may be processed in any order, the LifeRaft sched- this challenge, we recently proposed an alternative uler [20] co-schedules queries against an ordering of architecture comprising large number of so-called the data that maximizes data sharing. In LifeR- Amdahl blades that combine energy-efficient CPUs aft, each data region may be accessed a single time with solid state disks to increase sequential read I/O to compute the partial results of all queries that throughput by an order of magnitude while keeping use that data. The Job-Aware Workload (JAWS) power consumption constant [17]. The same prelim- scheduler [21] extends LifeRaft in order to support inary study also showed that while keeping the total workflows in which queries exhibit data dependen- cost of ownership constant, Amdahl blades offer five cies. This is typical of queries to the Turbulence times the throughput of the Graywulf system. Database Cluster in which the results derived from 5.5 Fitting Models to Data: Large Surface one timestep of simulation are used as input to Reconstruction the next timestep. Both LifeRaft and JAWS avoid query starvation through adaptive and incremental Using state-of-the-art laser range scanners, it is trade-offs between query throughput and response has now become possible to acquire 3D data at re- time. Data-driven batch scheduling typically im- markably high resolution. These advances in scan- proves throughput by a factor of four or more. ning technology enable sub-millimeter resolution At present, we are collaborating with Yahoo! to scans from Stanford’s Digital Michelangelo Project combine LifeRaft and JAWS with their work on [10], allowing art historians to make out the de- shared scans for Hadoop! and Pig workloads [1]. tails of individual chisel marks, and reason about sculpting techniques without having to go to Flo- rence. Similarly, with the ten centimeter resolution 5.4 Data-Intensive Architectures LIDAR fly-by over of New York City, environmen- In deploying scientific databases, IDIES has de- talists can plan the city’s solar power capacity. signed and built several data-intensive clusters. The One of the challenges here is transforming the GrayWulf system [18] represents the evolution of data returned by the 3D scanner into a coherent the SDSS architecture to a scalable, multi-tenant 3D model. Specifically, fitting a water-tight surface database cluster. The name pays homage to Jim to the set of disjoint point-samples returned by the Gray who defined this class of computing and ref- scanner (see Figure2.) Research at Johns Hopkins, erences BeoWulf clusters. GrayWulf provides clus- in collaboration with Microsoft Research, addresses ter management tools to manage workflows, localize this challenge by reducing fitting a surface to the computation to data, monitor systems status, and scan data to solving a 3D Poisson equation [8]. recover from faults. The GrayWulf system won the To provide a tractable solution for large models, Supercomputing Storage Challenge in 2008 and, as we have developed a new multigrid technique that part of that competition, demonstrated a sustained supports the solution of Poisson equations formu- lated over an octree adapted to the scanned points. International Symposium on Visual Computing, This solver retains the benefits derived from solving 2009. a global system (providing robustness in the pres- [5] M. Chang, A. Terzis, and P. Bonnet. Mote-Based Online Anomaly Detection using Echo State ence of missing data and noise), is capable of recon- Networks. In DCOSS, 2009. structing models at the resolution of the input data, [6] J. Gupchup, A. Sharma, A. Terzis, R. Burns, and has a space/time complexity that is linear in the A. Szalay. The Perils of Detecting Measurement size of the input. We have extended the approach Faults in Environmental Monitoring Networks. In with a partial ordering of octree nodes along an axis DCOSS, 2008. to provide both out-of-core [3] and distributed pro- [7] JHU Turbulence Database. http://turbulence.pha.jhu.edu. cessing [4]. We have reconstructed models of up to [8] M. Kazhdan, M. Bolitho, H. Hoppe. Poisson one billion points (e.g. the David dataset from [10]) surface reconstruction. In Symposium on on a distributed cluster in just half a day. Geometry Processing, 2006. [9] O. Kennedy, Y. Ahmad, C. Koch. DBToaster: Agile views for a dynamic data management 6. SUMMARY system. In CIDR, 2011. Data-intensive scientific computing has emerged [10] M. Levoy et al. The digital Michelangelo project: as a theme around which we can engage in outreach 3d scanning of large statues. In SIGGRAPH, 2000. [11] N. Li and A. Thakars. CasJobs and MyDB: A for data management. We have enjoyed a rewarding batch query workbench. Computing in Science intellectual experience here at IDIES, and champion and Engineering, 10(1), 2008. this mode of research across the broader computer [12] Y. Li, E. Perlman, M. Wang, Y. Yang, science research community. We encourage any sci- C. Meneveau, R. Burns, S. Chen, A. Szalay, and ence or engineering students interested in studying, G. Eyink. A public turbulence database cluster and applications to study lagrangian evolution of or researchers seeking collaborations on data inten- velocity increments in turbulence. Journal of sive challenges to explore the institute’s webpage Turbulence, 9(31), 2008. (http://idies.jhu.edu) and authors’ webpages, [13] R. Musaloiu-E., C.-J. Liang, and A. Terzis. Koala: and to contact an author via email. Ultra-low power data retrieval in wireless sensor networks. In IPSN, 2008. Acknowledgements. We thank the Whiting School [14] R. Mus˘aloiu-E., A. Terzis, K. Szlavecz, A. Szalay, of Engineering and the Krieger School of Arts and J. Cogan, J. Gray. Life Under your Feet: A WSN Sciences at the Johns Hopkins University in sup- for Soil Ecology. In EmNets Workshop, 2006. [15] E. A. Perlman, R. C. Burns, Y. Li, C. Meneveau. porting the mission of IDIES. We also thank Mi- Data exploration of turbulence simulations using crosoft, the Alfred P. Sloan Foundation, the Gor- a database cluster. In Supercomputing, 2007. don and Betty Moore Foundation, NVIDIA and Ya- [16] L. Selavo, A. Wood, Q. Cao, T. Sookoor, H. Liu, hoo! This material is based upon work supported A. Srinivasan, Y. Wu, W. Kang, J. Stankovic, by the National Science Foundation under Grant D. Young, and J. Porter. LUSTER: Wireless Sensor Network for Environmental Research. In Nos. AST-0939767, OCI-1040114, CMMI-0941530, ACM SenSys, 2007. CCF-0937810, CMMI-0923018, OCI-0830876, AST- [17] A. Szalay, G. Bell, H. Huang, A. Terzis, and 0428325, IIS-0430848, CNS-0546648, OCI-0937947, A. White. Low-power amdahl-balanced blades for AST-0225645, AST-0122449, CNS-0720730, CCF- data intensive computing. In Proceedings of the 0746039. Any opinions, findings, and conclusions 2nd Workshop on Power Aware Computing and Systems (HotPower ’09), 2009. or recommendations expressed in this material are [18] A. S. Szalay et al. GrayWulf: Scalable clustered those of the author(s) and do not necessarily reflect architecture for data intensive computing. In the views of the National Science Foundation. HICSS, 2009. [19] G. Tolle, J. Polastre, R. Szewczyk, N. Turner, K. Tu, P. Buonadonna, S. Burgess, D. Gay, 7. REFERENCES W. Hong, T. Dawson, D. Culler. A Macroscope in [1] P. Agrawal, D. Kifer, and C. Olston. Scheduling the Redwoods. In ACM SenSys, 2005. Shared Scans of Large Data Files. In VLDB, 2008. [20] X. Wang, R. Burns, and T. Malik. Liferaft: [2] Y. Ahmad, O. Papaemmanouil, U. C¸etintemel, Data-driven, batch processing for the exploration J. Rogers. Simultaneous equation systems for of scientific databases. In CIDR, 2009. query processing on continuous-time data [21] X. Wang, E. Perlman, R. Burns, T. Malik, streams. In ICDE, 2008. T. Budav´ari,C. Meneveau, and A. Szalay. JAWS: [3] M. Bolitho, M. Kazhdan, R. Burns, H. Hoppe. job-aware workload scheduling for the exploration Multilevel streaming for out-of-core surface of turbulence simulations. In Supercomputing, reconstruction. In Symposium on Geometry 2010. Processing, 2007. [4] M. Bolitho, M. Kazhdan, R. Burns, H. Hoppe. Parallel poisson surface reconstruction. In