WestGrid Stakeholder Session & Annual General Meeting

October 10, 2018 Vancouver, BC WestGrid Update

2 3 WestGrid Key Activities 2017-18

4 ARC Leadership

“I would like to thank the Working Group members who have spent the last nine months with me locked in rooms and on countless phone calls - you have been a highly dedicated group of people who have brought tremendous expertise and commitment to this task. It has been such a pleasure to work with you.”

- Robbin Tourangeau, LCDRI 5 Partnership: Precision Infection Management (PIM)

“Thank you so much Sergei and Lindsay for your contributions on this project! The serious computational muscle gave this project real legitimacy and I am very grateful to both of you for making this a reality! Thanks Lindsay! You are a rockstar, this wouldn't have been possible without your support!”

- Ian Lewis, LSARP Project Lead 6 Bioinformatics Helpdesk

133 subscribed users 22 question posts since launch in january 79 total questions in database 4 one-on-one sessions 6 questions via contact form 7 Advocacy

8 Unique Culture

The West is the Best

● Collaborative ● Social ● Likeminded “get it done” attitude ● Altruistic

9 Summary of WestGrid Value

★ Leadership ★ HQP development ★ Funding diversification ★ Cost savings to members ★ Education, Outreach & Training ★ Engagement & advocacy ★ Addressing user need ★ Improving user experience

10 Projects & Outreach

11 Co-op Students

“I am honored to have worked with so many people who were so passionate about their career, as well as the well-being of our industry. It was truly an amazing experience that I will never forget." -- Steven Bucholtz, Okanagan College Student, Computer Information Services

“This was the perfect opportunity for us to learn how to set up a site from nothing. That everyone is approachable and willing to help without judgement or any hesitation. This opportunity is something I will never forget so again thank you for everything." -- Christina Hebert, Okanagan College, Computer Information Sciences 12 Re-developed Status Page

13 CANHEIT-TECC

● Largest conference ever held at SFU Burnaby campus ● Almost 700 attendees ● 130 total sessions, over 100 hours of programming including the Women in Technology Breakfast, ● Over 30 staff from the Compute Canada federation presented in nearly 20 sessions over 3 days!

14 Survey Numbers: WestGrid User Survey & Accounts Renewals Erin Trifunov Manager, Projects and Outreach WestGrid WestGrid User Survey - 2017

“The services provided by Westgrid are invaluable and our organization would be substantially less productive without WestGrid's assistance. Some of our research would have been impossible without WestGrid.” - University of Victoria, Environmental & Earth Sciences Research Staff GOAL: Awareness of Services

Increase awareness of specialized support services, especially Bioinformatics support. GOAL: WestGrid User Support

Increase awareness of and offerings of support

23% increase GOAL: Training & Outreach

Increase awareness of and offerings of support for new users & in-person sessions. In 2017-18 we delivered: • 1000s hrs, 42 events • 800+ RSVPs, 42% new users • 2 regional summer schools • 24 Software Carpentry events • 11 WestGrid Town Halls • 6 Viz workshops • SMT visits to all 7 sites New User Support

2 In-person HPC Summer Schools

New website for training materials - simpler to use Visualize This!

National competition led by WestGrid’s Visualization and Training Coordinator Alex Razoumov 21 GOAL: Overall Satisfaction

Provide more targeted user support for RAC.

2017 WestGrid User Survey 2018 Compute Canada Renewal Survey Overall Satisfaction with Compute Canada Overall Satisfaction with Compute Canada (283 respondents, 28% PIs) (8675 respondents, 2581 from within WestGrid, 37% PIs) RAC User Support

● RAC Best Practices presentations from WG Director of Operations (3 hrs+ each) ○ 2017: 2 Sites + 1 Virtual ○ 2018: 6 Sites + 2 Virtual ● RAC User Guides, Slides and Site Analysis 2018-19 Goals & Opportunities

Build awareness of WestGrid services and value to stakeholders. User Support

GOAL: Increase awareness of and advocate for more local support presence.

12% -->Down from 56% in 2016

● 11% DO NOT have adequate local support (21% from UofA, 16% from UVIC & SFU)

● 20% “didn’t know” (25% from UBC, 15% UofA)

● 35% of less experienced users felt DO HAVE adequate local support, 38% unsure User Growth vs Staffing

From 2013-2018, CC users from WestGrid based institutions INCREASED 580%

WestGrid FTE levels from 2013-2018 increased by only 66% Need for Local Support

“I strongly believe that the usefulness of Compute Canada in general, and of Westgrid in particular, is a direct function of the strong local support.” - Manitoba Principal Investigator New Opportunities

Explore opportunities to meet resource needs and users outside CC scope. Resources Needs Future needs Industry Activity Operations update

32 Highlights

New Cloud Sept, 2016 Arbutus (UVictoria) Oct 2018 Major additional capacity (cores and storage)

New GP systems Summer, 2017 Cedar (SFU), Graham (Waterloo) Autumn 2017-2018 Major Cedar upgrade: ~2x increase in cores! Planned doubling in storage.

New LP system April, 2018 Niagara (Toronto)

Defunding old systems 2017/2018 Grex (UManitoba, Mar 31, 2018) Parallel/Lattice (UCalgary, Mar 31, 2018) Hungabee/Jasper (UofA, Oct.1, 2017) Silo (USask, Jan.31, 2017)

Migration 2017-2018 Migrate users from legacy to new systems (WestGrid led the CC Migration Working Group)

New System support: 2017-2018 Complete new software distribution system (CVMFS). Software, Comprehensive new software suite. Documentation and Full re-write of user docs. Training Lead by Research Support National Team: extensive WG involvement. 33 Support Tickets

April-May 2018: transition from WG OTRS ticketing system to CC central ticketing system. ● Statistics are from June 1, 2018 (CC OTRS ticketing system)

Total CC tickets in all queues 6,800

Total WG tickets in all queues 2,266 (33%) WG users: ~30%

Average WG tickets per month 566 Includes CC accounts and in all queues software install tickets.

Average tickets per month in 344 WG OTRS 2017: 397 WG queues 34 WG Tickets per Month

March-April-May: ● WG OTRS to Central CC OTRS

March-April ● Account renewals

April WG Tickets in separate WG ● Allocations implemented OTRS system Monthly Mean (June 1-Oct.1) ● 344 (WG OTRS 2017 was 396)

35 Top Responders

1 Ali Kerrache 889 15 Grigory Shamov Manitoba 94

2 Daniel Stubbs Université de Montréal 658 18 Martin Siegert SFU 84

3 Maxime Université de Laval 518 19 Dmitri Rozmanov Calgary 79 Boissonneault 30 Ataollah Roudgar SFU 51 4 Doug Phillips 290 32 Kamil Marcinkowski Alberta 48 5 Roman Baranowski UBC 227 34 John Simpson Alberta 47 6 Erick Giguere Université de Sherbrooke 143 40 Malcolm Petch UBCO 36 7 Belaid Moa University of Victoria 138 44 Adam McKenzie Sask 31 8 Ross Dickson Dalhousie University 133 48 Robert Fridman Calgary 24 9 Pier-Luc St-Onge McGill University 129 49 Chris Want Alberta 22 10 Charles Coulombe Université de Laval 126 36 Past Events

General Jan/Feb Meltdown bug fixes

Arbutus Feb 16 Bug in Openstack High Availability network implementation ● HA was removed after about 2 weeks of instability.

Graham Aug.21 to Facility upgrade to dual 3MW transformers. Sep.5 ● Two separate UPS systems backed by their own generator. (ongoing ● /project filesystem corruption: 2% of files needed to be restored from recovery) tape (8M files!). See https://status.computecanada.ca (Sept.5) for details. Note that restore has only just completed. ● Also seem to be some remaining filesystem issues (/home).

Cedar Aug 30 Router and switch firmware upgrade. Vancouver link is now 100 GBit! About 60 minutes. Jobs continued to run.

August New Atlas Tier 1 is in production. Data migration from old to new is complete.

Arbutus July 22 BCNet network maintenance - short disruptions to cloud network access. Upgrades and Maintenance

Orcinus October 2 day power systems maintenance.

Orcinus Mar.31, 2019 Defunding! Last of the WestGrid legacy systems.

Cedar Oct.15-19 1 week major outage ● OS updates (to CentOS 7.5) ● Related Lustre and Omnipath upgrades ● Intel OPA (Omnipath) Cable issues

Cedar Planning 1. ownCloud upgrade 2. Nearline (tape) service (ready for RAC 2019) 3. Cloud partition

Graham Oct 9-10 Electrical work by regional utility with complete outage.

Arbutus October 3, 2018 Major outage for cloud expansion cloud In progress. ● Additional ~1,400 cores and 3.5 PB usable storage. ● DB service with 2 new DBaaS nodes ● New Openstack version. WG RAC 2018 Stats

Number of Applications from Members 149

Number of Applications receiving an 147 99% of WG applications received some allocation. allocation

Average science score 3.33 Lower limit was 2.0 CC average was 3.33

Number of GPU asks 21 451 CY allocated

Descriptor Score Patrick’s Comments Exceptional 5 This is almost never given, and is only for the highest quality research and proposal. Outstanding 4 Excellent science, excellent technical approaches with high quality proposals. Very strong 3 Small impressions can make a difference between “Strong” and “Very Strong” Strong 2 RAC 2018 cutoff was 2.0, so “Strong” proposals are insufficient for an allocation! Moderate 1 A passable proposal with no significant issues. Insufficient 0 Significant issues either scientific or technical. RAC Resource Shortfall

The major issue by far is the lack of resources. The basic numbers are shortfalls of 2x in computational cores (ask is about 50% of available), and 5x in GPU cores across Compute Canada. Similar numbers are expected for RAC 2019 after legacy systems are defunded and the new GP4 system is installed.

The overwhelming impression from a detailed re-reading of the RAC 2018 WestGrid proposals is that of well-written, well-justified asks from competent and generally experienced teams. The prevalent complaint from the survey, RAC proposals and informal discussion is that of very long queue times and inability to acquire sufficient resources to carry out a research program.

So a strong recommendation is that WestGrid should consider acquiring additional resources.

Note: institutions are re-purposing legacy systems or acquiring new hardware for institutional ARC. 40 Under Pressure!

RAC -2018 Results GPU’s Really Bad

RAC 2018: 21% of asks allocated! GPUs and Machine Learning

● Major increase in GPU asks! ○ Machine Learning is the major use ○ Increasing demand for production training ● Both Cedar and Graham have GPUs (P100’s) ○ Béluga will have lots of them! ○ Current estimate on linear extrapolation: <30% of asks can be satisfied! ● Future Compute Canada planning ○ Opportune time as CC is working with ISED and Federal Gov’t for DRI.

Machine # WG Average (median) Learning asks GY per ask Component 17 11% 23 (96) Projects which use Machine Learning as a GPU requests will component of the proposed methods require very strong Research 4 3% 87 (5) Projects which are actively researching machine learning algorithms. justifications. Platforms and Portals

The Research Platforms and Portals (RPP) Competition enables communities to develop research projects that improve access to shared datasets, enhance existing online research tools and facilities, or advance national or international research collaborations. RPP Development and Management WestGrid RPP 2018 # ● NO DIRECT SUPPORT OFFERED BY CC. Applications ● There is a need for such tier 3 support ○ Requests for support. Portals (large) 16 64% ○ Tickets for specific sysadmin help. ○ Security breaches due to naive sysadmin. Platforms (smaller) 6 24% ● WG has expertise ○ Cloud National Team (WG-centred) provides tier 2 support External portal 3 12% ○ Sysadmin generally ● WG providing some support (.5 FTE for LSARP) Total 25 ● CC has Middleware national team developing useful services (like single-sign-on) Up from 0 three years ago! WG Proposal to hire jr developer/devops RAC Best Practices

RAC Best Practices on-site presentations from WG Director of Operations 1. Introduction by site teams (½ hour, site-dependent) 2. WestGrid Introduction by Patrick (½ hour) 3. RAC Best practices by Patrick (1 hour) Plus Researcher Consultation lead by Patrick (1 hour as necessary)

University of Manitoba Monday September 17

University of Victoria Thursday September 20

University of Alberta Monday September 24

University of Calgary Thursday September 27

University of British Columbia Friday October 12

Simon Fraser University Thursday October 11

Online: Compute Canada Public Q&A Thursday October 4 Online recap and office hours: Mid-October WestGrid Annual General Meeting Member Confirmation

Institution Voting Member Representative Attending: UBC Steve Cundy, Director of ARC (Proxy) Remote SFU No member in attendance UVic Wency Lum, CIO (Proxy) Remote UofC No member in attendance UofA Walter Dixon, AVP(R) Remote USask Dena McMartin, AVP(R) In person UofM James Kerr, Director of Technology Services Remote Agenda

1. Call to Order & Member Representative Confirmation 2. Approval of Agenda 3. Approval of Financial Statements 4. Appointment of Public Accountant 5. Approval of Director Terms 6. Other Business 7. Adjournment Tha y