Scientific Data Management at the Johns Hopkins Institute for Data
Total Page:16
File Type:pdf, Size:1020Kb
Scientific Data Management at the Johns Hopkins Institute for Data Intensive Engineering and Science Yanif Ahmad, Randal Burns, Michael Kazhdan, Charles Meneveau, Alex Szalay, Andreas Terzis fyanif,randal,misha,[email protected], fmeneveau,[email protected] Johns Hopkins University, Baltimore MD 21218, USA. http://idies.jhu.edu 1. INTRODUCTION braries, Applied Math, School of Medicine, School Scientific computing has long been one of the of Public Health and the Human Language Tech- deep and challenging applications of computer sci- nology Center of Excellence. The authors of this ence and data management, from early endeavors paper make up the Computer Science members of in numerical simulation, to recent undertakings in IDIES and two of their long-standing collaborators. the life sciences, such as genome assembly. Complex The establishment of a data intensive computing computational problems abound and their solutions center has numerous benefits in addition to provid- transform our understanding of the physical world. ing an umbrella for data-driven research. It acts as The data management community's interest in sci- a beacon for the remainder of the Hopkins commu- entific applications has grown over the last decade nity and beyond, where our members are the first due to the commoditization of parallelism, dimin- port of call for anyone with big data problems from ishing system administration costs, and a search for academia, industry, and government agencies in the relevance beyond enterprise applications. vicinity and nationwide, such as the Johns Hop- Research in scientific computing raises non-tech- kins Applied Physics Laboratory, a not-for-profit nical challenges, such as overcoming the paucity of research lab housing 3,000 engineers and scientists. resources needed for experimentation, and estab- The Johns Hopkins University provides an ex- lishing a collaborative research agenda that fosters tremely conducive environment for these efforts. a mutual appreciation of problems, results in a con- The demographics of the university leads to an certed effort to develop software tools, and makes outward-looking faculty within departments. The all researchers successful in their respective fields. model of research being conducted through inter- In light of this, we report on a recently formed in- disciplinary centers is the norm rather than the stitute at the Johns Hopkins University to further exception (e.g. the Human Language Technology the interaction between computer science, and sci- Center of Excellence). This facilitates application- ence and engineering. We describe ongoing projects driven research, combining scientific domain knowl- at the institute and our collaboration experiences. edge with core computer systems research expertise. From an academic standpoint, the interdisciplinary model of IDIES percolates through to the students, 2. IDIES engaging students in an application and dataset- The Institute for Data Intensive Engineering and driven research agenda. Our agenda fosters a breadth Science (IDIES) was founded in April 2009 to bring of knowledge beyond data management and com- together data intensive computing in a variety of puter science, and encourages a hands-on approach science and engineering disciplines. The mission to system development. This results in tangible, us- of IDIES is to foster the interdisciplinary develop- able components that further the underlying scien- ment of tools and methods that derive knowledge tific problem, as well as promoting an experimental from the massive datasets that today's instruments, approach to data management challenges. experiments, and simulations generate at exponen- This research model leads to the development tially growing rates. of communities around individual datasets and the curation of these datasets both in terms of data 2.1 Research and Academic Model cleaning and functionality supported on top of the IDIES includes faculty members drawn from a dataset. Most often, this process produces public variety of disciplines ranging from Physics and As- resource for widespread academic usage. tronomy, Mechanical Engineering, the Sheridan Li- 2.2 Scientific Data Workflow at IDIES use SQL to formulate their research questions and Traditionally, under the scientific method, basic can run observations instantly, rather than waiting research involves the formation of hypotheses and for months to get their turn at a telescope. theories, designing experiments for their validation, The original database was built in close collabo- collecting data by realising experimentation, and ration between Alex Szalay's team at JHU and Jim analysing data to guide new insights for further re- Gray of Microsoft. The database is based on SQL search through iteration of this workflow. All stages Server, contains about 70 tables of data and de- of this workflow loop exhibit heavy computational scriptive metadata, and incorporates a substantial and data management needs, particularly experi- amount of astronomy code in the form of User De- mentation, simulation and analysis. The emergence fined Functions. Among other things, we have built of data-driven science has front-loaded much of this a high precision GIS system for astronomy, based on iterative workflow, where an intense period of ex- the Hierarchical Triangular Mesh, implemented as perimentation results in a substantially larger data a set of class libraries, wrapped into C# functions, volume for processing, followed by an intense pe- callable from T-SQL. riod of exploration and analysis of the data. Both Today about 30% of the world's professional as- modes must address large dataset challenges. tronomy community has a server side database in Many of these challenges appear individually in this system. The two-stage parallel loading envi- other domains, but are greatly exacerbated in scien- ronment and the whole framework is now in use tific computing by the diversity of complex, instance- by many groups, both at JHU and all over the specific semantics. The end result is an application world, and has been repurposed from astronomy domain with extremely high overheads and barriers to many other disciplines, such as environmental to entry through the combination of limited soft- science and radiation oncology. Most recently, the ware reuse, and the brittle infrastructures arising Pan-STARRS project, on the way to its first of from ad-hoc composition of tools that are difficult many petabytes of data, builds upon a scaled-out to parameterize and generalize. version of the SDSS design. We have assembled a critical mass of researchers at IDIES with expertise spanning many aspects 4. APPLICATIONS AND DATASETS of the scientific research workflow: Randal Burns IDIES research spans a wide variety of applica- (storage systems and transaction processing), An- tions and datasets. We present a sample here. dreas Terzis (sensor and wireless networks), Michael Kazhdan (computer graphics, mesh processing and 4.1 Turbulence geometric retrieval) and Yanif Ahmad (data man- Hydrodynamic turbulence is a formidably diffi- agement and declarative languages). We are ex- cult and pressing problem. However, direct numer- tremely fortunate to work with scientists and en- ical simulations of turbulence for the flow conditions gineers who have developed a deep appreciation of prevalent in most engineering, geophysics, and as- data management throughout their careers, includ- trophysics applications are impractical. Also, most ing Alex Szalay (physics, astronomy), and Charles critical scientific problems, such as predicting the Meneveau (mechanical engineering, turbulence). Earth's climate and developing energy sources, re- quire reliable simulations of turbulent flows. 3. THE SLOAN DIGITAL SKY SURVEY Typically, individual researchers perform large The focus on data-intensive computating at JHU simulations that are analyzed during the compu- started when the Department of Physics and As- tation with only a small subset of data stored for tronomy joined the Sloan Digital Sky Survey project subsequent analysis. The majority of the time evo- (SDSS), the astronomy equivalent of the \Human lution is discarded. As a result, the same simu- Genome Project." The SDSS created a high reso- lations must be repeated after new questions arise lution multi-wavelength map of the Northern Sky that were not initially obvious; most breakthrough with 2.5 trillion pixels of imaging. The results concepts cannot be anticipated in advance. Thus, a of the image segmentation were placed in a rela- new paradigm is emerging that creates large and tional database that has grown to 12 terabytes over easily accessible databases that contain the full the last ten years and has created a new way to space-time history of simulated flows. do astronomy. Based upon citation statistics, the In IDIES, we have embarked on building such database of the SDSS has been the most used as- databases. The JHU public turbulence database tronomy facility in the world. A large fraction of houses a 27 TB database that contains the entire the world's astronomy community has learned to time history of a 10243 mesh point pseudo-spectral sectors of the U.S. economy in terms of its energy consumption. According to a 2007 EPA report, U.S. data centers consumed 61 billion kWh in 2006| enough energy to power 5.8 million households. Un- der conservative estimates, IT energy consumption is projected to double by 2011. Only a fraction of the electricity consumed powers IT equipment. The rest is used by environmental control systems such