Reference Architecture: Canonical Charmed Kubernetes on Lenovo Thinksystem Servers for AI Development

Reference Architecture: Canonical Charmed Kubernetes on Lenovo Thinksystem Servers for AI Development

Reference Architecture: Canonical Charmed Kubernetes on Lenovo ThinkSystem Server for AI Development Last update: 15 December 2020 Version 1.0 Reference Architecture for Validated architecture for Kubernetes-based AI deploying and monitoring a development Kubernetes environment Describes hardware and AI workload deployment with software infrastructure Lenovo Intelligent Computing components for installation Orchestration (LiCO) and and deployment Kubeflow Miroslav Hodak, Lenovo Andrey Grebennikov, Canonical J.J. Falkanger, Lenovo Table of Contents 1 Introduction .................................................................................................. 1 2 Business problem and business value .................................................... 2 2.1 Business problem ................................................................................................... 2 2.2 Business value ....................................................................................................... 2 3 Requirements ............................................................................................... 3 3.1 Functional requirements ......................................................................................... 3 3.2 Non-functional requirements ................................................................................... 3 4 Architectural overview ................................................................................ 5 4.1 Software Architecture .............................................................................................. 5 4.1.1 Kubernetes ....................................................................................................................................... 5 4.1.2 Kubernetes and Canonical ............................................................................................................... 5 4.1.3 MAAS (Metal as a Service) physical cloud ....................................................................................... 6 4.1.4 Juju modelling tool ............................................................................................................................ 7 4.1.5 Why use Juju? .................................................................................................................................. 7 4.1.6 Software versions ............................................................................................................................. 7 4.2 Hardware Architecture ............................................................................................ 9 4.2.1 Rack layout ....................................................................................................................................... 9 4.2.2 Server components firmware versions ............................................................................................ 11 5 Component model ..................................................................................... 12 5.1 Ubuntu Kubernetes components .............................................................................. 12 5.1.1 Storage charms .............................................................................................................................. 12 5.1.2 Kubernetes charms ........................................................................................................................ 12 5.1.3 Resource charms ........................................................................................................................... 13 5.1.4 Network space support ................................................................................................................... 15 5.1.5 Monitoring and Logging tools ......................................................................................................... 15 5.2 Cluster logging tools ................................................................................................. 15 5.2.1 Graylog ........................................................................................................................................... 15 5.2.2 Elasticsearch .................................................................................................................................. 15 5.2.3 Filebeat........................................................................................................................................... 16 5.3 Monitoring the cluster ............................................................................................... 16 5.3.1 Prometheus .................................................................................................................................... 16 Canonical Charmed Kubernetes on Lenovo ThinkSystem Servers for AI Development ii Version 1.0 5.3.2 Grafana .......................................................................................................................................... 16 5.3.3 Telegraf ........................................................................................................................................... 16 6 Operational model ..................................................................................... 17 6.1 The node lifecycle ................................................................................................. 17 6.1.1 New ................................................................................................................................................ 17 6.1.2 Commissioning ............................................................................................................................... 17 6.1.3 Ready ............................................................................................................................................. 17 6.1.4 Allocated ......................................................................................................................................... 17 6.1.5 Deploying ....................................................................................................................................... 17 6.1.6 Releasing ....................................................................................................................................... 17 6.2 Install MAAS ......................................................................................................... 17 6.2.1 Configure Your Hardware ............................................................................................................... 18 6.2.2 Install Ubuntu Server ...................................................................................................................... 18 6.2.3 MAAS Installation ........................................................................................................................... 18 6.3 MAAS initial configurations ................................................................................... 19 6.3.1 MAAS Credentials .......................................................................................................................... 19 6.3.2 Enlist and commission servers ....................................................................................................... 19 6.3.3 Set up MAAS KVM pods ................................................................................................................ 19 6.4 Juju components .................................................................................................. 19 6.4.1 Juju controller - the heart of Juju .................................................................................................... 19 6.4.2 Charms ........................................................................................................................................... 20 6.4.3 Bundles .......................................................................................................................................... 20 6.4.4 Provision......................................................................................................................................... 21 6.4.5 Deploy ............................................................................................................................................ 21 6.4.6 Monitor and manage ...................................................................................................................... 21 6.4.7 Comparing Juju to other configuration management tools ............................................................. 21 6.5 Monitoring ............................................................................................................. 22 6.5.1 Observability Tools ......................................................................................................................... 22 6.6 Log Aggregation ................................................................................................... 23 7 Deployment considerations ..................................................................... 24 Machine Learning platforms ........................................................................................... 24 7.1 LiCO as a machine learning/deep learning platform.............................................. 24 7.2 Kubeflow as a machine learning platform ............................................................. 26 7.2.1 Kubeflow and charms ..................................................................................................................... 26 7.3 Server /

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    45 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us