Data Grid Management Systems

Total Page:16

File Type:pdf, Size:1020Kb

Data Grid Management Systems Arun Jagatheesan Reagan Moore San Diego Supercomputer Center (SDSC) University of California, San Diego {arun, moore} @sdsc.edu University of Florida 1 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Storage Resource Broker • Distributed data management technology • Developed at San Diego Supercomputer Center (Univ. of California, San Diego) • 1996 - DARPA Massive Data Analysis • 1998 - DARPA/USPTO Distributed Object Computation Testbed • 2000 to present - NSF, NASA, NARA, DOE, DOD, NIH, NLM, NHPRC • Applications • Data grids - data sharing • Digital libraries - data publication • Persistent archives - data preservation • Used in national and international projects in support of Astronomy, Bio- Informatics, Biology, Earth Systems Science, Ecology, Education, Geology, Government records, High Energy Physics, Seismology University of Florida 2 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Acknowledgement: SDSC SRB Team ¾ Arun Jagatheesan ¾ George Kremenek ¾ Sheau-Yen Chen ¾ Arcot Rajasekar ¾ Reagan Moore ¾ Michael Wan ¾ Roman Olschanowsky ¾ Bing Zhu ¾ Charlie Cowart Not In Picture: ¾ Wayne Schroeder ¾ Tim Warnock(BIRN) ¾ Lucas Gilbert ¾ Marcio Faerman (SCEC) ¾ Antoine De Torcy Students: Emeritus: Xi (Cynthia) Sheng Vicky Rowley (BIRN) Allen Ding Qiao Xin Grace Lin Daniel Moore Jonathan Weinberg Ethan Chen Yufang Hu Reena Mathew Yi Li Erik Vandekieft Ullas Kapadia University of Florida 3 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Tutorial Outline • Introduction • Data Grids • Data Grid Infrastructures • Information Management using Data Grids • Data Grid Transparencies and concepts • Peer-to-peer Federation of Data Grids • Gridflows and Data Grids • Need for Gridflows • Data Grid Language and SDSC Matrix Project • Lets build a Data Grid • Using SDSC SRB Data Grid Management System and its Interfaces University of Florida 4 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grids • Coordinated Cyberinfrastructure • Formed by coordination of multiple autonomous organizations • Preserves local autonomy and provides global consistency • Logical Namespaces (Virtualizations) • Virtualization mechanisms for resources (including storage space, data, metadata, processing pipelines and inter- organizational users) • Location and infrastructure independent logical namespace with persistent identifiers for all resources University of Florida 5 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grid Goals • Automate all aspects of data analysis • Data discovery • Data access • Data transport • Data manipulation • Automate all aspects of data collections • Metadata generation • Metadata organization • Metadata management • Preservation University of Florida 6 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Using a Data Grid – Ask for data in Abstract Data delivered Data Grid •User asks for data from the data grid • University of Florida The data is found and returned •Where & how details are managed by data grid National Partnership•But for access Advanced controls are specified by owner 7 Computational Infrastructure San Diego Superc omputer Center Tutorial Outline • Introduction • Data Grids • Data Grid Infrastructures • Information Management using Data Grids • Data Grid Transparencies and concepts • Peer-to-peer Federation of Data Grids • Gridflows and Data Grids • Need for Gridflows • Data Grid Language and SDSC Matrix Project • Data Grids and You • Open Research Issues and Global Grid Forum Community • Lets build a Data Grid • Using SDSC SRB Data Grid Management System and its Interfaces University of Florida 8 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB Environments • NSF Southern California Earthquake Center digital library • Worldwide Universities Network data grid • NASA Information Power Grid • NASA Goddard Data Management System data grid • DOE BaBar High Energy Physics data grid • NSF National Virtual Observatory data grid • NSF ROADnet real-time sensor collection data grid • NIH Biomedical Informatics Research Network data grid • NARA research prototype persistent archive • NSF National Science Digital Library persistent archive • NHPRC Persistent Archive Testbed University of Florida 9 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Southern California Earthquake Center University of Florida 10 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Southern California Earthquake Center • Build community digital library • Manage simulation and observational data • Anelastic wave propagation output • 10 TBs, 1.5 million files • Provide web-based interface • Support standard services on digital library • Manage data distributed across multiple sites • USC, SDSC, UCSB, SDSU, SIO • Provide standard metadata • Community based descriptive metadata • Administrative metadata • Application specific metadata University of Florida 11 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SCEC Digital Library Technologies • Portals • Knowledge interface to the library, presenting a coherent view of the services • Knowledge Management Systems • Organize relationships between SCEC concepts and semantic labels • Process management systems • Data processing pipelines to create derived data products • Web services • Uniform capabilities provided across SCEC collections • Data grid • Management of collections of distributed data • Computational grid • Access to distributed compute resources • Persistent archive • Management of technology evolution University of Florida 12 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Metadata Organization (Domain View versus Run View) Provenance Simulation Model Program Computer System Velocity Model Fault Model Domain ... Spatial Temporal Physical Numerical Run Output Domain List Formatting University of Florida 13 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure University of Florida 14 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure University of Florida 15 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure NASA Data Grids • NASA Information Power Grid • NASA Ames, NASA Goddard • Distributed data collection using the SRB • ESIP federation • Led by Joseph JaJa (U Md) • Federation of ESIP data resources using the SRB • NASA Goddard Data Management System • Storage repository virtualization (Unix file system, Unitree archive, DMF archive) using the SRB • NASA EOS Petabyte store • Storage repository virtualization for EMC persistent store using the Nirvana version of SRB University of Florida 16 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Assimilation Office HSI has implemented metadata schema in SRB/MCAT Origin: host, path, owner, uid, gid, perm_mask, [times] Ingestion: date, user, user_email, comment Generation: creator (name, uid, user, gid), host (name, arch, OS name & flags), compiler (name, version, flags), library, code (name, version), accounting data Data description: title, version, discipline, project, language, measurements, keywords, sensor, source, prod. status, temporal/spatial coverage, location, resolution, quality Fully compatible with GCMD University of Florida 17 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Management System: Software Architecture DODS Web Browser Computational Clients Jobs DODS DMS GUI SRB CM MCAT CM DB Mass Storage Systems Oracle Irix Tru64 ...etc... Unitree Linux University of Florida 18 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure DODS Access Environment Integration User Desktop App Workstation n o User i t a App n o st i t k r NetCDF a o DMS Native st Server App k W FS DODS- r p o o Native t enable DODS/ FS p W sk App NetCDF o t De WWW sk DODS DODS Libraries Browser De SRB SRB Clients Clients SRB Middleware SRB SRB SRB Clients Server Server ne i Eng Native Legacy e Unitree DMF App ut FS p m o C Storage Storage PBS Job Server Server University of Florida 19 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure National Virtual Observatory Data Grid 1. Portals and Workbenches 2.Knowledge & Resource 3. Metadata Data Catalog Bulk Data Management View View Analysis Analysis Concept space Standard APIs and Protocols 4.Grid Information Metadata Data Data 5. Security Discovery delivery Discovery Delivery Caching Standard Metadata format, Data model, Wire format Replication Backup 6. Catalog Mediator Data mediator Scheduling Catalog/Image Specific Access 7. Compute Resources Derived Collections Catalogs Data Archives University of Florida 20 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure National Virtual Observatory Portals, User Interfaces, Tools VOPlot SkyQuery Aladin Topcat OASIS DIS Mirage conVOT ADQL data Bulk XQuery models processing Registry Layer Data Access Layer Computational Services HTTP and SOAP Web Services visualization data mining y r ) e P u D OAI ADS A source image FI V Q C S O y
Recommended publications
  • Vanamala Venkataswamy
    Vanamala Venkataswamy Email: [email protected] Website: http://www.cs.virginia.edu/~vv3xu Research Interest Job Scheduling, Reinforcement Learning, Cloud Computing, Distributed Storage. Applied Machine Learning: Deep Reinforcement Learning for job scheduling in data-centers powered by renewable energy sources, a.k.a Green Data-centers. Education Ph.D., Computer Science. University of Virginia (UVa), Charlottesville, VA. GPA 3.9/4.0 Advisor: Prof. Andrew Grimshaw. 2016-Present M.S., Computer Science. University of Southern California (USC), Los Angeles, CA. 2009-2010 B.E., Information Technology. Visvesvaraya Technological University, India. 2000-2004 Research and Professional Experience TOMORROW’S PROFESSOR TODAY (TPT), UVA AUG 2020 - AUG -2021 • Currently attending workshops and seminars designed to facilitate the transition from student to academic professional. The program focuses on improving preparedness primarily in teaching at the college level, with emphases in professional development and adjustment to a university career. SYSTEM SOFTWARE INTERN, LANCIUM INC. MAY 2019 - AUG -2019 • Configured Lancium’s cloud backend system for accepting jobs from users and scheduling jobs to matching resources in the data center. • Developed scheduler software to match users' job requests' to available matching GPU resources. • Assisted with the beta testing and release of Lancium’s cloud backend software. • Assisted with troubleshooting users' jobs and maintaining the cloud backend. PROGRAMMER ANALYST, UVA, CHARLOTTESVILLE, VA. 2011- 2016 • Developed grid command-line tools and web services for GenesisII software. • Developed framework for unit testing and regression testing for GenesisII software. • Administration and Maintenance of Cross Campus Grid (XCG) and XSEDE grid spanning multiple institutions and supercomputing facilities nationwide. • Interacted with grid users (users from UVa and other institutions) to troubleshoot issues while using grid resources.
    [Show full text]
  • LNCS 8147, Pp
    Mahasen: Distributed Storage Resource Broker K.D.A.K.S. Perera1, T. Kishanthan1, H.A.S. Perera1, D.T.H.V. Madola1, Malaka Walpola1, and Srinath Perera2 1 Computer Science and Engineering Department, University Of Moratuwa, Sri Lanka {shelanrc,kshanth2101,ashansa.perera,hirunimadola, malaka.uom}@gmail.com 2 WSO2 Lanka, No. 59, Flower Road, Colombo 07, Sri Lanka [email protected] Abstract. Modern day systems are facing an avalanche of data, and they are be- ing forced to handle more and more data intensive use cases. These data comes in many forms and shapes: Sensors (RFID, Near Field Communication, Weath- er Sensors), transaction logs, Web, social networks etc. As an example, weather sensors across the world generate a large amount of data throughout the year. Handling these and similar data require scalable, efficient, reliable and very large storages with support for efficient metadata based searching. This paper present Mahasen, a highly scalable storage for high volume data intensive ap- plications built on top of a peer-to-peer layer. In addition to scalable storage, Mahasen also supports efficient searching, built on top of the Distributed Hash table (DHT) 1 Introduction Currently United States collects weather data from many sources like Doppler readers deployed across the country, aircrafts, mobile towers and Balloons etc. These sensors keep generating a sizable amount of data. Processing them efficiently as needed is pushing our understanding about large-scale data processing to its limits. Among many challenges data poses, a prominent one is storing the data and index- ing them so that scientist and researchers can come and ask for specific type of data collected at a given time and in a given region.
    [Show full text]
  • An Interoperable & Optimal Data Grid Solution For
    An Interoperable & Optimal Data Grid Solution for Heterogeneous and SOA based Grid- GARUDA Payal Saluja, Prahlada Rao B.B., ShashidharV, Neetu Sharma, Paventhan A. Dr. B.B Prahlada Rao [email protected] s IPDPS 2010, Atlanta 2010, IPDPS s ’ 19 April 2010 IPDPS10 System Software Development Group Centre for Development of Advanced Computing HPGC Workshop, IEEE Workshop, HPGC C-DAC Knowledge Park, Bangalore, India GARUDA DataGrid Solutions:GSRM – Prahlada Rao.. et all 1 Presentation Outline • Grid Storage Requirements • Various Data Grid Solutions • Comparision of Grid data Solutions • Indian National Grid –GARUDA • Data Management Challenges for GARUDA • GARUDA Data Storage Solution - GSRM • GSRM Highlights • GSRM Architecture Integration with GARUDA s IPDPS 2010, Atlanta 2010, IPDPS s middleware components ’ • GSRM Usage Scenario for Par. applications • Conclusions HPGC Workshop, IEEE Workshop, HPGC GARUDA DataGrid Solutions:GSRM – Prahlada Rao.. et all 2 Grid Storage Requirements • Data Availability • Security • Performance & Latency • Scalability • Fault Tolerance s IPDPS 2010, Atlanta 2010, IPDPS s ’ HPGC Workshop, IEEE Workshop, HPGC GARUDA DataGrid Solutions:GSRM – Prahlada Rao.. et all Data Grid Solutions & Supported Storage Systems Data Grid Storage Systems Storage Grid File Storage Resource System Resource iRODS WS-DAI Broker Manager Gfarm StoRM GPFS DPM Lustre dCache Xtreemfs Grid Storage Storage Grid Solutions Bestman NFSv4 1.File Systems 1.File systems 1. File systems 1.File Systems AMG 2. Archives 2.Parallel File 2. Parallel File 2. Archives A WS- 3. Storage Area Systems Systems 3. Storage DAI s IPDPS 2010, Atlanta 2010, IPDPS s ’ Network (SAN) 3.Object 3.Object storage Area Network 4. Data Bases storage device (SAN) WS- 4.
    [Show full text]
  • Grid Resource Broker
    GRID RESOURCE BROKER ABHISHEK SINGH CSE Deptt. University at Buffalo Fall - 2006 Agenda • Grid Resource Broker- Architecture and Algorithm • Adaptable Resource Broker ( Drawback and Proposed solution) • SDSC Storage Resource Broker (SRB) • Future Enhancements A Sample Grid Grid Resource Broker INTERFACE Resource Resource Information Filter Lookup Services App/User Resource Resource Ranker Make match Fig 1. Resource Broker Architecture showing main components of the Design. Information Service : GIIS Tell me about the computing resources belonging to the HPC Lab that are uniprocessor Linux workstation, with low CPU load and available memory < 250 Mbyte XYZ.abc GRIS GRIS GIIS GRIS GRIS WEB Information Service: GRIS Tell me about the features of “xyz.abc” WEB GRIS Xyz.abc Grid Resource Broker INTERFACE Resource Resource Information Filter Lookup Services App/User Resource Resource Ranker Make match Fig 1. Resource Broker Architecture showing main components of the Design. Resource Brokering Algorithm • Input :- one or more job request • Action :- Select and submit job to most appropriate resources. • Output: none 1. Contact GIIS server(s) to obtain a list of available clusters. 2. Contact each resource's GRIS for static for static and dynamic resource information( hardware and software characteristics, current queue and load, etc.) 3. For each job :- (a) Select the cluster to which the job will be submitted. i. Filter out clusters that do not fulfill the requirements on memory, disk space architecture etc, and clusters that the user are not authorize to use. ii. Estimate the Total Time to Delivery (TTD) for each remaining resource. iii. Select the cluster with the shortest predicted TTD.
    [Show full text]
  • DISCOVER DISCOVER CYBERINFRASTRUCTURE Terashake: Simulating “The Big One” COVER STORY
    San Diego Supercomputer Center SPECIAL ISSUE FOR SC2004 DISCOVER DISCOVER CYBERINFRASTRUCTURE TeraShake: Simulating “the Big One” COVER STORY TERASHAKE EARTHQUAKE SIMULATION Ground surface velocity predicted for earthquake waves in the large-scale SCEC/CME, GEOFFREY ELY, UCSD TeraShake simulation of magnitude 7.7 earthquake on the southern San Andreas fault. The simulation ran for more than four days on SDSC’s 10 teraflops IBM DataStar supercomputer, producing an unprecedented 47 terabytes of data. Figure displays detailed fault-parallel velocity, showing the intense back- and-forth shaking predicted, with blue indicating upward motion and red indicating downward motion. The north-to-south fault rupture (thick white line) produced especially intense ground motion in the sediment-filled basin of the Coachella Valley which “shook like a bowl of jelly,” reaching peak velocities of more than 2 meters per second. TeraShake research will advance basic earthquake science and eventually help design more earthquake-resistant structures. 4 Terashake: Simulating the “Big One” on the San Andreas Fault THE SAN DIEGO SUPERCOMPUTER CENTER SDSC cyberinfrastructure Founded in 1985, the San Diego Supercomputer Center models 7.7 earthquake in (SDSC) has a long history of enabling science and engineering discoveries. Continuing this legacy into the unprecedented detail. next generation, SDSC’s mission is to “extend the reach” of researchers and educators by serving as a core resource for cyberinfrastructure— providing them with high-end hardware technologies, integrative software technologies, and deep interdisciplinary expertise. SDSC is an organized research unit of the University of California, San Diego and is primarily funded by the National Science Foundation (NSF).
    [Show full text]