What Can SDSC Do for You? Michael L
Total Page:16
File Type:pdf, Size:1020Kb
What Can SDSC Do For You? Michael L. Norman, Director Distinguished Professor of Physics SAN DIEGO SUPERCOMPUTER CENTER Mission: Transforming Science and Society Through “Cyberinfrastructure” “The comprehensive infrastructure needed to capitalize on dramatic advances in information technology has been termed cyberinfrastructure.” D. Atkins, NSF Office of Cyberinfrastructure 2 SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do? SAN DIEGO SUPERCOMPUTER CENTER Gordon – World’s First Flash-based Supercomputer for Data-intensive Apps >300,000 times as fast as SDSC’s first supercomputer 1,000,000 times as much memory SAN DIEGO SUPERCOMPUTER CENTER Industrial Computing Platform: Triton • Fast • Flexible • Economical • Responsive service • Being upgraded now with faster CPUs and GPU nodes SAN DIEGO SUPERCOMPUTER CENTER First all 10Gig Multi-PB Storage System SAN DIEGO SUPERCOMPUTER CENTER High Performance Cloud Storage Analogous to AWS S3 • Data preservation and sharing • Low cost • High reliability • Web-accessible 6/15/2013 SAN DIEGO SUPERCOMPUTER CENTER 7 Awesome connectivity to the outside world 10G 100G 20G XSEDE ESnet UCSD RCI 100G your 10G CENIC Link here 384 port 10G384 switch port 10G switch commercial Internet www.yourdatacollection.edu SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do? SAN DIEGO SUPERCOMPUTER CENTER Over 100 in-house researchers and technical staff Core competencies • Modeling & simulation • Parallel computing • Cloud computing • Energy efficient computing • Advanced networking • Software development • Database systems • Data mining/BI tools • Data modeling & integration • Data management • Data processing workflows • Datacenter management SAN DIEGO SUPERCOMPUTER CENTER Application Domains • Fluid dynamics • Structural engineering • Biomolecular simulation • Computational chemistry • Seismic modeling • Coastal hydrology • Geoinformatics • Neuroinformatics • Bioinformatics/genomics • Radiology • Smart energy grids • Medicare fraud detection SAN DIEGO SUPERCOMPUTER CENTER SDSC is at the nexus of the genomic medicine revolution Wayne Pfeiffer SAN DIEGO SUPERCOMPUTER CENTER bioKepler: Programmable and Scalable Workflows for Distributed Analysis of Large-Scale Biological Data MapReduce BLAST Ilkay Altintas Assemble complex processing easily Access transparently to diverse resources Incorporate multiple software tools Assure reproducibility SAN DIEGO SUPERCOMPUTER CENTER Community development model Natasha Balac SAN DIEGO SUPERCOMPUTER CENTER Big Data Predictive Analytics for UCSD Smart Grid SAN DIEGO SUPERCOMPUTER CENTER Over 70,000 sensor streams from UCSD Smart Grid processed on Gordon SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do? SAN DIEGO SUPERCOMPUTER CENTER Center for Large Scale Data Systems Research (CLDS) Chaitan Baru Jim Short SAN DIEGO SUPERCOMPUTER CENTER SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do? SAN DIEGO SUPERCOMPUTER CENTER HPWREN: A Unique Regional Capability for Public-Private Partnerships SAN DIEGO SUPERCOMPUTER CENTER SDSC Teaming with CALFIRE and SDG&E to Respond to and Prevent Wildfires SAN DIEGO SUPERCOMPUTER CENTER What Does SDSC Do? SAN DIEGO SUPERCOMPUTER CENTER SAN DIEGO SUPERCOMPUTER CENTER What Can SDSC Do for You? • Just about anything involving high capability/capacity technical computing, data management, networking • Our technical experts are eager to engage on R&D projects and service agreements customized to meet your needs • There is a spectrum of ways we can interact SAN DIEGO SUPERCOMPUTER CENTER How do you begin working with us? • You have already taken the first step by coming here today • Join the IPP program to learn more about SDSC expertise and resources • POC Ron Hawkins ([email protected]) • Enjoy the rest of the program SAN DIEGO SUPERCOMPUTER CENTER SDSC Data Initiatives Chaitan Baru Associate Director, Data Initiatives Director, Center for Large- scale Data Systems Research (CLDS) SDSC, UC San Diego [email protected] 28 Outline • SDSC and Data • Center for Large-Scale Data Systems Research (CLDS) • Graduate Student Engagement • Data Science Education and Training SDSC IPP Research Review, June 12, 2013 29 SDSC’s Data DNA • 25+ year history as a supercomputer center focused on data • Applied Informatics is what we do ▫ At the intersection of science and data and computational science ▫ Applied and applications-driven research and development • Multidisciplinary projects and interdisciplinary collaborations is how we do it ▫ It is our strength and the secret sauce in our above average success rate on highly competitive proposals • Advancing the state of the art in science, improving the science research process is why we do it ▫ Lessons can be applied to business application as well ▫ We believe many science applications are precursors to future business apps SDSC IPP Research Review, June 12, 2013 30 Data: A rapidly evolving set of problems • Analytics: Real-time and historical trend analysisData velocity and volume • Integration: More, comprehensive, holistic analysisData variety • Costs ▫ Hardware, energy, software, people • Skill Sets ▫ Need for “cross-trained”, data savvy individuals ▫ Ability to thrive in multidisciplinary, holistic, data-driven environments ▫ Break out of narrow academic silos / corporate roles and departments ▫ A real shortage • Competition ▫ Global talent ▫ Increasingly, local problems • Privacy SDSC IPP Research Review, June 12, 2013 31 SDSC R&D Activities in Data • Informatics collaborations in ▫ High-energy physics, astrophysics/astronomy, computational chemistry, bioinformatics, biomedical informatics, geoinformatics, ecoinformatics, social science, neurosciences, smart energy grids, anthropology, archaeology, … • Expertise and Labs in: • Benchmarking • Machine learning • Bioinformatics • Performance Modeling • Computational Science • Predictive analytics • Data warehousing • Scientific data management • Data and info visualization • Spatial data management • Large graph and text data • Workflow systems • Centers of Excellence ▫ CLDS: Center for Large-scale Data Systems Research, Chaitan Baru, Director ▫ PACE: Predictive Analysis Center of Excellence, Natasha Balac, Director ▫ CAIDA: Center for Applied Internet Data Analysis, KC Claffy, Director SDSC IPP Research Review, June 12, 2013 32 CLDS: Center for Large-Scale Data Systems Research • Focus: Technical and technology management aspects related to big data • Key initiatives ▫ Big Data Benchmarking ▫ Data Value and How Much Information? • Principals: Chaitan Baru, James Short SDSC IPP Research Review, June 12, 2013 33 Big Data Benchmarking – 1 • Community activity for development of a system-level big data benchmark, like TPC ▫ Coordinated by SDSC, http://clds.sdsc.edu/bdbc ▫ [email protected]: Biweekly phone meetings • A proposed BigData Top100 List, bigdatatop100.org • Two proposals under discussion ▫ BigBench: Extending TPC-DS for big data ▫ Data Analytics Pipeline: End-to-end analysis of event stream data • Discussions with TPC and SPEC SDSC IPP Research Review, June 12, 2013 34 Big Data Benchmarking – 2 • Workshops on Big Data Benchmarking (WBDB) 1st WBDB: May 2012, San Jose 2nd WBDB: December 2012, Pune, India 3rd WBDB: July 2013, Xi’an, China 4th WBDB: October 2012, San Jose SDSC IPP Research Review, June 12, 2013 35 Big Data Reference Datasets • An initiative of the Cloud Security Alliance-BigData Working Group ▫ Sreeranga Rajan, Fujitsu (Chair), Neel Sundaresan, eBay (Co-Chair), Wilco van Ginkel, Verizon (Co-Chair) Arnab Roy, Fujitsu (Crypto co-lead in BDWG/CSA) • Objective: ▫ Make reference datasets available on one or more platforms, for algorithm-level benchmarking • Hosted by SDSC ▫ http://clds.sdsc.edu/bdbc/referencedata SDSC IPP Research Review, June 12, 2013 36 NIST Big Data Working Group • http://bigdatawg.nist.gov • Co-chairs: Chaitan Baru Robert Marcus, CTO, ET-Strategies, Wo Chang, Chris Greer, NIST • Objective: 1 year time frame ▫ Definitions ▫ Taxonomies ▫ Reference Architectures ▫ Technology Roadmap • First meeting: June 19th • Open to community SDSC IPP Research Review, June 12, 2013 37 Current CLDS Programs • Big Data Benchmarking (Pivotal lead) • Project on Data Value (NetApp lead) ▫ Develop definitions, frameworks, assessment methodology, and tools for Data Value ▫ Proposed Workshop on Data Value, Jan-Feb 2014 • How Much Information 2013 (Seagate lead) ▫ Consumer Information; Enterprise Information • Data Science Institute (Brocade) ▫ SDSC-level program SDSC IPP Research Review, June 12, 2013 38 CLDS Sponsorship • Current sponsors ▫ Seagate, Pivotal, NetApp, Brocade, Intel (soon) • Goals ▫ Small, focused group of core sponsors representing major industry quadrants (non competitive). (6-8 companies) ▫ Extended network of members who provide scale and scope, help fund industry events (20-30 companies) • Sponsor structure: ▫ Founding (100k, multi year) ▫ Program (50K, annual) • Member structure: ▫ Continuing (10K+, pay as you go) ▫ Member (5K, per workshop event) SDSC IPP Research Review, June 12, 2013 39 Big Data Benchmarking: How you can participate • BDBC ▫ Join BDBC mailing list, ~150 members, ~75 organizations ▫ Attend biweekly meetings, every other Thursday ▫ Present at biweekly meetings • WBDB ▫ Submit papers to workshops; attend workshops • Reference Datasets ▫ Participate in the Reference Datasets activity; contribute reference datasets • NIST Big Data Working Group ▫ Join and contribute to the NIST Big Data Working Group • Join CLDS as a sponsor SDSC IPP Research Review, June 12, 2013 40 Data Value; How Much Information? How you can participate