Data- Intensive Computing in the 21St Century

GUEST EDITORS’ INTRODUCTION Data- Intensive Computing in the 21st Century Ian Gorton, Pacific Northwest National Laboratory Paul Greenfield,CSIRO Alex Szalay, John Hopkins University Roy Williams, Caltech The deluge of data that future applications must process—in domains ranging from science to business informatics—creates a compelling argument for substantial increased R&D targeted at discovering scalable hardware and software solutions for data-intensive problems. n 1998, William Johnston delivered a paper at fondly reminisce about problems of this scale and be the 7th IEEE Symposium on High-Performance worrying about the difficulties that the looming exascale Distributed Computing1 that described the evolu- applications are posing. tion of data-intensive computing over the previ- Fundamentally, data-intensive applications face two ous decade. While state of the art at the time, the major challenges: Iachievements described in that paper seem modest in comparison to the scale of the problems researchers now • managing and processing exponentially grow- routinely tackle in present-day data-intensive computing ing data volumes, often arriving in time-sensitive applications. streams from arrays of sensors and instruments, or More recently, others including Tony Hey and Anne as the outputs from simulations; and Trefethen,2 Gordon Bell and colleagues,3 and Harvey • significantly reducing data analysis cycles so that Newman and colleagues4 have described the magnitude researchers can make timely decisions. of the data-intensive problems that the e-science community faces today and in the near future. Their descriptions There is undoubtedly an overlap between data- and of the data deluge that future applications must process, compute-intensive problems. Figure 1 shows a simple in domains ranging from science to business informatics, diagram that can be used to classify the application create a compelling argument for R&D to be targeted space between these problems. at discovering scalable hardware and software solutions Purely data-intensive applications process multi-tera- for data-intensive problems. While petabyte datasets byte to petabyte sized datasets. This data commonly and gigabit data streams are today’s frontiers for data- comes in several different formats and is often distrib- intensive applications, no doubt 10 years from now we’ll uted across multiple locations. Processing these datasets 78 Computer Published by the IEEE Computer Society 0018-9162/08/$25.00 © 2008 IEEE typically takes place in multiple-step analytical pipelines Heterogeneous formats that include transformation and fusion stages. Process- Distributed ing requirements typically scale near-linearly with data Petabytes size and are often amenable to straightforward paral- Data-intensive Data/compute- lelization. Key research issues involve data management, problems intensive problems Data filtering and fusion techniques, and efficient querying complexity and distribution. (volume/format/ distribution) Data/compute-intensive problems combine the need Current Compute intensive to process very large datasets with increased computa- problems problems tional complexity. Processing requirements typically Homogeneous format scale superlinearly with data size and require complex Nondistributed Mbytes to Gbytes Model solvers Statistical models searches and fusion to produce key insights from the Simulations Computational Decision support data. Application requirements may also place time Simple search complexity Knowledge generation bounds on producing useful results. Key research issues include new algorithms, signature generation, Figure 1. Data-intensive computing. Data-intensive computing and specialized processing platforms such as hard- research encompasses the problems in the upper two ware accelerators. quadrants. We view data-intensive computing research as encom- passing the problems in the upper two quadrants in Fig- • new metadata management technologies that can ure 1. The following are some applications that exhibit scale to handle complex, heterogeneous, and distrib- these characteristics. uted data sources; Astronomy. The Large Synoptic Survey Telescope • advances in high-performance computing platforms (LSST; www.lsst.org) will generate several petabytes to provide uniform high-speed memory access to of new image and catalog data every year. The Square multi-terabyte data structures; Kilometer Array (SKA; www.skatelescope.org) will • specialized hybrid interconnect architectures to generate about 200 Gbytes of raw data per second that process and filter multigigabyte data streams com- will require petaflops (or possibly exaflops) of processing from high-speed networks and scientific instru- ing to produce detailed radio maps of the sky. Process- ments and simulations; ing this volume of data and making it available in a • high-performance, high-reliability, petascale dis- useful form to the scientific community poses highly tributed file systems; challenging problems. • new approaches to software mobility, so that algo- Cybersecurity. Anticipating, detecting, and respond- rithms can execute on nodes where the data resides ing to cyberattacks requires intrusion-detection systems when it is too expensive to move the raw data to to process network packets at gigabit speeds. Ideally, another processing site; such systems should provide actionable results in sec- • flexible and high-performance software integration onds to minutes rather than hours so that operators can technologies that facilitate the plug-and-play inte- defend against attacks as they occur. gration of software components running on diverse Social computing. Sites such as the Internet Archive computing platforms to quickly form analytical (www.archive.org/) and MySpace (www.myspace.com) pipelines; and store vast amounts of content that must be managed, • data signature generation techniques for data reduc- searched, and delivered to users over the Internet in a tion and rapid processing. matter of seconds. The infrastructure and algorithms required for websites of this scale are challenging, IN THIS ISSUE ongoing research problems. This special issue on data-intensive computing pres- ents five articles that address some of these challenges. DATA-INTENSIVE COMPUtiNG CHALLENGES In “Quantitative Retrieval of Geophysical Parameters The breakthrough technologies needed to address Using Satellite Data,” Yong Xue and colleagues discuss many of the critical problems in data-intensive comput- the remote sensing information service grid node, a tool ing will come from collaborative efforts involving sev- for processing satellite imagery to deal with climate eral disciplines, including computer science, engineer- change. ing, and mathematics. The following list shows some of In “Accelerating Real-Time String Searching with the advances that will be needed to solve the problems Multicore Processors,” Oreste Villa, Daniele Paolo faced by data-intensive computing applications: Scarpazza, and Fabrizio Petrini present an optimiza- tion strategy for a popular algorithm that performs • new algorithms that can scale to search and process exact string matching against large dictionaries and massive datasets; offer solutions to alleviate memory congestion. April 2008 79 GUEST EDITORS’ INTRODUCTION “Analysis and Semantic Querying in Large Biomedical Ian Gorton is the chief architect for Pacific Northwest Image Datasets” by Joel Saltz and colleagues describes a National Laboratory’s Data-Intensive Computing Initia- set of techniques for using semantic and spatial informa- tive. His research interests include software architectures tion to analyze, process, and query large image datasets. and middleware technologies. He received a PhD in com- “Hardware Technologies for High-Performance puter science from Sheffield Hallam University. Gorton is Data-Intensive Computing” by Maya Gokhale and a member of the IEEE Computer Society. Contact him at colleagues offers an investigation into hardware plat- [email protected]. forms suitable for data-intensive systems. In “ProDA: An End-to-End Wavelet-Based OLAP System for Massive Datasets,” Cyrus Shahabi, Mehr- Paul Greenfield is a research scientist in Australia’s Com- dad Jahangiri, and Farnoush Banaei-Kashani describe monwealth Scientific and Industrial Research Organisation. a system that employs wavelets to support exact, His research interests are distributed applications, the analy- approximate, and progressive OLAP queries on large sis of genetic sequence data, and computer system perfor- multidimensional datasets, while keeping update costs mance. He received an MSc in computer science from the relatively low. University of Sydney. Greenfield is a member of the IEEE We hope you will enjoy reading these articles and that and the ACM. Contact him at [email protected]. this issue will become a catalyst in drawing together the multidisciplinary research teams needed to address our data-intensive future. ■ Alex Szalay is the Alumni Centennial Professor in the Department of Physics and Astronomy at the Johns Hop- kins University. His research interests include large spatial References databases, pattern recognition and classification problems, 1. W. Johnston, “High-Speed, Wide Area, Data-Intensive theoretical astrophysics, and galaxy evolution. He received a Computing: A Ten-Year Retrospective,” Proc. 7th IEEE PhD in //what discipline??// from //degree-granting institu- Symp. High-Performance Distributed Computing, IEEE tion//. He is a member of

Data- Intensive Computing in the 21St Century

Astrophysics with Terabytes

Things You Need to Know About Big Data

Gerard Lemson Alex Szalay, Mike Rippin DIBBS/Sciserver Collaborative Data-Driven Science

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets Year 1 Progress Report & Year 2 Proposal

Warmth Elevating the Depths: Shallower Voids with Warm Dark Matter

Large Scale Structures of the Universe Proceedings of the 130Th Symposium of the International Astronomical Union, Dedicated to the Memory of Marc A

2016 – Aug – 09 CURRICULUM VITAE Mike Specian Contact

Migrating a Multiterabyte Archive from Object to Relational Databases

Extreme Data-Intensive Scientific Computing on Gpus

The World-Wide Telescope, an Archetype for Online Science

Mark C. Neyrinck University of the Basque Country Curriculum Vitæ Mobile Ph: +34 663 490 872 (Spain) 48080 Bilbao, Spain +1 808 232 7263 (US)

The Importance of Data Locality in Distributed Computing Applications Alex Szalay, Julian Bunn, Jim Gray, Ian Foster, Ioan Raicu