Piz Daint and Its Ecosystem Sadaf Alam Chief Technology Officer Swiss National Supercomputing Centre November 16, 2017 CSCS in a Nutshell

Piz Daint and Its Ecosystem Sadaf Alam Chief Technology Officer Swiss National Supercomputing Centre November 16, 2017 CSCS in a Nutshell

Creating Abstractions for Piz Daint and its Ecosystem Sadaf Alam Chief Technology Officer Swiss National Supercomputing Centre November 16, 2017 CSCS in a Nutshell § A unit of the Swiss Federal Institute of Technology in Zurich (ETH Zurich) § founded in 1991 in Manno § relocated to Lugano in 2012 § Develops and promotes technical and scientific services § for the Swiss research community in the field of high-performance computing § Enables world-class scientific research § by pioneering, operating and supporting leading-edge supercomputing technologies § Employing 90+ persons from about 15+ different nations Supercomputing, 2017 2 Piz Daint and the User Lab http://www.cscs.ch/uploads/tx_factsheet/FSPizDaint_2017_EN.pdf http://www.cscs.ch/publications/highlights/ http://www.cscs.ch/uploads/tx_factsheet/AR2016_Online.pdf Model Cray XC40/XC50 Intel® Xeon® E5-2690 v3 @ 2.60GHz (12 XC50 Compute cores, 64GB RAM) and NVIDIA® Tesla® Nodes P100 16GB XC40 Compute Intel® Xeon® E5-2695 v4 @ 2.10GHz (18 Nodes cores, 64/128 GB RAM) Interconnect Aries routing and communications ASIC, Configuration and Dragonfly network topology Scratch ~9 + 2.7 PB capacity Supercomputing, 2017 3 Piz Daint (2013à2016à) § 5,272 hybrid nodes (Cray XC30) § 5,320 hybrid nodes (Cray XC50) § Nvidia Tesla K20x § Nvidia Tesla P100 § Intel Xeon E5-2670 § Intel Xeon E5-2690 v3 § § 6 GB GDDR5 16 GB HBM2 § 64 GB DDR4 § 32 GB DDR3 § 1,431 multi-core nodes (Cray XC40) § No multi-core à § 2 x Intel Xeon E5-2695 v4 § 64 and 128 GB DDR4 2013 2016 § Cray Aries dragonfly interconnect § Cray Aries dragonfly interconnect § ~33 TB/s bisection bandwidth § ~36 TB/s bisection bandwidth § Fully provisioned for 28 cabinets § Sonexion Lustre file system § Sonexion Lustre file system § ~9 PB (Sonexion 3000) & 2.7 PB § 2.7 PB (Sonexion 1600) (Sonexion 1600) § External GPFS on selected nodes Supercomputing, 2017 4 Piz Daint-–More Versatile Than Before Computing Visualization 2013 Data analysis ✙ à Pre-post processing 2016 Data mover Data Warp Machine learning Deep learning Supercomputing, 2017 5 Overview of Hardware Infrastructure SWITCHLAN (100 Gbit Ethernet) CSCS LAN Dedicated platforms Data Center Network (InfiniBand, Ethernet) Supercomputing, 2017 6 LHConCray Collaborative Project § Teams members from CHIPP (Swiss Institute of Particle Physics) and CSCS § To explore LHC workflows efficiency in a shared environment (Piz Daint) § Goals: full transparency to users while sustaining efficiency metrics and supporting monitoring and accounting tools (complete workflow mapping for multiple experiments) § Publications, presentations and community meetings § G. Sciacca, “ATLAS and LHC Computing on Cray”, CHEP, 2016 § L. Benedecis, M. Gila et. al. “Opportunities for container environments on Cray XC30 with GPU devices”. Proceedings of the Cray User Group meeting, 2015 § Status (Jan 2017): https://wiki.chipp.ch/twiki/pub/LCGTier2/MeetingLHConCRAY20170127/20170127.CSCS_CHIPP_ F2F.pdf § Status (Aug 2017): https://wiki.chipp.ch/twiki/pub/LCGTier2/BlogAcceptanceTests2017/LHConCRAY-Run4_CSCS.pdf § Community tools Production since § https://wiki.chipp.ch/twiki/bin/view/LCGTier2/LHConCRAYMonitoring April 2017 § http://ganglia.lcg.cscs.ch/ganglia/sltop_lhconcray.html Supercomputing, 2017 7 LHConCray Project WLCG platform statistics ~170 sites in 40 countries 350,000+ cores 500+ PB 2+ Mio jobs per day 10-100 Gb links http://wlcg.web.cern.ch/tools Data Center Network (InfiniBand, Ethernet) Supercomputing, 2017 8 Status of LHConCray Project § Operational since April 2017 § Monitor status and progress at http://wlcg.web.cern.ch/tools & https://wiki.chipp.ch/twiki/bin/view/LCGTier2/LHConCRAYMonitoring § Statistics § ~20% of total job submission (<0.4% of total compute resources) § Over 90% of docker/shifter image pull requests are for LHC software § Open items (tuning and optimization) § Data corruption patch generated an issue for the swap test case § Continued investigation into DVS and DWS optimization and tuning Supercomputing, 2017 9 Bridging the Gap à Creating New Abstractions § Light-weight operating system (SLES based) § Possible solution: containers or other virtualization interfaces § Diskless compute nodes § Possible solution: exploit burst buffer or tiered storage hierarchies § Computing nodes connectivity (high speed Aries interconnect) § Possible solution: web services access with no address translations overhead Supercomputing, 2017 10 Future Policy and Technical Considerations Convergence of HPC and Data Science Workflows § Resource Management Systems (job schedulers) § Too many jobs (relative to HPC jobs mix) § Fine grain control and interactive access § Resource Specialization (multi-level heterogeneity) § Subset of nodes with special operating conditions e.g. node sharing § Resource Access (authentication, authorization and accounting) § Delegation of access (service and user accounts mappings) § Resource Accessibility and Interoperability (middleware services) § Secure and efficient access through web services § Interoperability with multiple storage targets (POSIX & object) Supercomputing, 2017 11 Thank you for your attention..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us