Data Grid Management Systems
Total Page:16
File Type:pdf, Size:1020Kb
Arun Jagatheesan Reagan Moore San Diego Supercomputer Center (SDSC) University of California, San Diego {arun, moore} @sdsc.edu University of Florida 1 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Storage Resource Broker • Distributed data management technology • Developed at San Diego Supercomputer Center (Univ. of California, San Diego) • 1996 - DARPA Massive Data Analysis • 1998 - DARPA/USPTO Distributed Object Computation Testbed • 2000 to present - NSF, NASA, NARA, DOE, DOD, NIH, NLM, NHPRC • Applications • Data grids - data sharing • Digital libraries - data publication • Persistent archives - data preservation • Used in national and international projects in support of Astronomy, Bio- Informatics, Biology, Earth Systems Science, Ecology, Education, Geology, Government records, High Energy Physics, Seismology University of Florida 2 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Acknowledgement: SDSC SRB Team ¾ Arun Jagatheesan ¾ George Kremenek ¾ Sheau-Yen Chen ¾ Arcot Rajasekar ¾ Reagan Moore ¾ Michael Wan ¾ Roman Olschanowsky ¾ Bing Zhu ¾ Charlie Cowart Not In Picture: ¾ Wayne Schroeder ¾ Tim Warnock(BIRN) ¾ Lucas Gilbert ¾ Marcio Faerman (SCEC) ¾ Antoine De Torcy Students: Emeritus: Xi (Cynthia) Sheng Vicky Rowley (BIRN) Allen Ding Qiao Xin Grace Lin Daniel Moore Jonathan Weinberg Ethan Chen Yufang Hu Reena Mathew Yi Li Erik Vandekieft Ullas Kapadia University of Florida 3 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Tutorial Outline • Introduction • Data Grids • Data Grid Infrastructures • Information Management using Data Grids • Data Grid Transparencies and concepts • Peer-to-peer Federation of Data Grids • Gridflows and Data Grids • Need for Gridflows • Data Grid Language and SDSC Matrix Project • Lets build a Data Grid • Using SDSC SRB Data Grid Management System and its Interfaces University of Florida 4 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grids • Coordinated Cyberinfrastructure • Formed by coordination of multiple autonomous organizations • Preserves local autonomy and provides global consistency • Logical Namespaces (Virtualizations) • Virtualization mechanisms for resources (including storage space, data, metadata, processing pipelines and inter- organizational users) • Location and infrastructure independent logical namespace with persistent identifiers for all resources University of Florida 5 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grid Goals • Automate all aspects of data analysis • Data discovery • Data access • Data transport • Data manipulation • Automate all aspects of data collections • Metadata generation • Metadata organization • Metadata management • Preservation University of Florida 6 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Using a Data Grid – Ask for data in Abstract Data delivered Data Grid •User asks for data from the data grid • University of Florida The data is found and returned •Where & how details are managed by data grid National Partnership•But for access Advanced controls are specified by owner 7 Computational Infrastructure San Diego Superc omputer Center Tutorial Outline • Introduction • Data Grids • Data Grid Infrastructures • Information Management using Data Grids • Data Grid Transparencies and concepts • Peer-to-peer Federation of Data Grids • Gridflows and Data Grids • Need for Gridflows • Data Grid Language and SDSC Matrix Project • Data Grids and You • Open Research Issues and Global Grid Forum Community • Lets build a Data Grid • Using SDSC SRB Data Grid Management System and its Interfaces University of Florida 8 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB Environments • NSF Southern California Earthquake Center digital library • Worldwide Universities Network data grid • NASA Information Power Grid • NASA Goddard Data Management System data grid • DOE BaBar High Energy Physics data grid • NSF National Virtual Observatory data grid • NSF ROADnet real-time sensor collection data grid • NIH Biomedical Informatics Research Network data grid • NARA research prototype persistent archive • NSF National Science Digital Library persistent archive • NHPRC Persistent Archive Testbed University of Florida 9 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Southern California Earthquake Center University of Florida 10 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Southern California Earthquake Center • Build community digital library • Manage simulation and observational data • Anelastic wave propagation output • 10 TBs, 1.5 million files • Provide web-based interface • Support standard services on digital library • Manage data distributed across multiple sites • USC, SDSC, UCSB, SDSU, SIO • Provide standard metadata • Community based descriptive metadata • Administrative metadata • Application specific metadata University of Florida 11 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SCEC Digital Library Technologies • Portals • Knowledge interface to the library, presenting a coherent view of the services • Knowledge Management Systems • Organize relationships between SCEC concepts and semantic labels • Process management systems • Data processing pipelines to create derived data products • Web services • Uniform capabilities provided across SCEC collections • Data grid • Management of collections of distributed data • Computational grid • Access to distributed compute resources • Persistent archive • Management of technology evolution University of Florida 12 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Metadata Organization (Domain View versus Run View) Provenance Simulation Model Program Computer System Velocity Model Fault Model Domain ... Spatial Temporal Physical Numerical Run Output Domain List Formatting University of Florida 13 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure University of Florida 14 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure University of Florida 15 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure NASA Data Grids • NASA Information Power Grid • NASA Ames, NASA Goddard • Distributed data collection using the SRB • ESIP federation • Led by Joseph JaJa (U Md) • Federation of ESIP data resources using the SRB • NASA Goddard Data Management System • Storage repository virtualization (Unix file system, Unitree archive, DMF archive) using the SRB • NASA EOS Petabyte store • Storage repository virtualization for EMC persistent store using the Nirvana version of SRB University of Florida 16 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Assimilation Office HSI has implemented metadata schema in SRB/MCAT Origin: host, path, owner, uid, gid, perm_mask, [times] Ingestion: date, user, user_email, comment Generation: creator (name, uid, user, gid), host (name, arch, OS name & flags), compiler (name, version, flags), library, code (name, version), accounting data Data description: title, version, discipline, project, language, measurements, keywords, sensor, source, prod. status, temporal/spatial coverage, location, resolution, quality Fully compatible with GCMD University of Florida 17 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Management System: Software Architecture DODS Web Browser Computational Clients Jobs DODS DMS GUI SRB CM MCAT CM DB Mass Storage Systems Oracle Irix Tru64 ...etc... Unitree Linux University of Florida 18 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure DODS Access Environment Integration User Desktop App Workstation n o User i t a App n o st i t k r NetCDF a o DMS Native st Server App k W FS DODS- r p o o Native t enable DODS/ FS p W sk App NetCDF o t De WWW sk DODS DODS Libraries Browser De SRB SRB Clients Clients SRB Middleware SRB SRB SRB Clients Server Server ne i Eng Native Legacy e Unitree DMF App ut FS p m o C Storage Storage PBS Job Server Server University of Florida 19 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure National Virtual Observatory Data Grid 1. Portals and Workbenches 2.Knowledge & Resource 3. Metadata Data Catalog Bulk Data Management View View Analysis Analysis Concept space Standard APIs and Protocols 4.Grid Information Metadata Data Data 5. Security Discovery delivery Discovery Delivery Caching Standard Metadata format, Data model, Wire format Replication Backup 6. Catalog Mediator Data mediator Scheduling Catalog/Image Specific Access 7. Compute Resources Derived Collections Catalogs Data Archives University of Florida 20 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure National Virtual Observatory Portals, User Interfaces, Tools VOPlot SkyQuery Aladin Topcat OASIS DIS Mirage conVOT ADQL data Bulk XQuery models processing Registry Layer Data Access Layer Computational Services HTTP and SOAP Web Services visualization data mining y r ) e P u D OAI ADS A source image FI V Q C S O y