Lico 5.2.1 Installation Guide (For SLES) Second Edition (January 2019)
Total Page:16
File Type:pdf, Size:1020Kb
LiCO 5.2.1 Installation Guide (for SLES) Second Edition (January 2019) © Copyright Lenovo 2018, 2019. LIMITED AND RESTRICTED RIGHTS NOTICE: If data or software is delivered pursuant to a General Services Administration (GSA) contract, use, reproduction, or disclosure is subject to restrictions set forth in Contract No. GS-35F- 05925. Reading instructions • To ensure that you get correct command lines using the copy/paste function, open this Guide with Adobe Acrobat Reader, a free PDF viewer. You can download it from the official Web site https://get.adobe.com/ reader/. • Replace values in angle brackets with the actual values. For example, when you see <*_USERNAME> and <*_PASSWORD>, enter your actual username and password. • Between the command lines and in the configuration files, ignore all annotations starting with #. © Copyright Lenovo 2018, 2019 ii iii LiCO 5.2.1 Installation Guide (for SLES) Contents Reading instructions. ii Check the OS installation . 23 Check NFS . 23 Chapter 1. Overview. 1 Check Slurm . 24 Introduction to LiCO . 1 Check Ganglia . 24 Typical cluster deployment . 1 Check MPI . 25 Operating environment . 2 Check Singularity . 25 Supported servers and chassis models . 3 Check OpenHPC installation . 25 Prerequisites . 3 List of LiCO dependencies to be installed. 25 Install RabbitMQ . 26 Chapter 2. Deploy the cluster Install PostgreSQL. 26 environment . 5 Install InfluxDB . 26 Install an OS on the management node . 5 Install Confluent. 27 Deploy the OS on other nodes in the cluster. 5 Configure user authentication . 28 Configure environment variables . 5 Install OpenLDAP-server . 28 Create a local repository . 8 Install libuser . 29 Install Lenovo xCAT . 8 Install OpenLDAP-client . 29 Prepare OS mirrors for other nodes . 9 Install nss-pam-Idapd . 29 Set xCAT node information . 9 Install Gmond GPU plug-in . 31 Add host resolution . 10 Configure DHCP and DNS services . 10 Chapter 4. Install LiCO . 33 Install a node OS through the network . 10 List of LiCO components to be installed . 33 Create local repository for other nodes . 10 Obtain the LiCO installation package . 34 Configure the memory for other nodes . 11 Configure the local Zypper repository for LiCO. 34 Checkpoint A . 11 Install LiCO on the management node . 34 Install infrastructure software for nodes . 12 Install LiCO on the login node . 35 List of infrastructure software to be installed . 12 Install LiCO on the compute nodes . 36 Configure a local Zypper repository for the management node . 12 Chapter 5. Configure LiCO . 37 Configure a local Zypper repository for login Configure the service account . 37 and compute nodes . 12 Configure cluster nodes . 37 Configure LiCO dependencies repositories . 13 Room information . 37 Install Slurm . 13 Logic group information . 37 Configure NFS . 14 Room row information . 38 Configure NTP . 14 Rack information . 38 GPU driver installation . 15 Chassis information . 38 CUDA installation . 16 Node information . 39 Configure Slurm . 17 List of cluster services . 41 Install Ganglia. 19 Infrastructure configuration . 42 Install MPI . 19 Database configuration . 42 Install Singularity . 20 Login configuration . 42 Checkpoint B . 21 Storage configuration . 42 Chapter 3. Install LiCO Scheduler configuration . 43 dependencies . 23 Alert configuration . 43 Cluster check. 23 Cluster configuration . 43 Check environment variables. 23 Functional configuration . 43 Check the LiCO dependencies repository . 23 User configuration . 44 Configure LiCO components. 45 © Copyright Lenovo 2018, 2019 i lico-vnc-mond . 45 Change a user’s role . 54 lico-portal . 45 Resume a user . 54 lico-ganglia-mond . 46 Import a user . 54 lico-icinga-mond . 46 Import AI images . 54 lico-icinga-plugin . 47 Security improvement . 54 lico-confluent-proxy . 47 Binding settings . 54 lico-confluent-mond . 48 Firewall settings . 58 lico-wechat-agent . 48 SSHD settings . 59 lico-tensorboard-proxy . 48 Import system images . 59 Initialize the system . 49 Image bootstrap files . 60 Initialize users . 49 Create images . 60 Import images into LiCO as system-level Chapter 6. Start and log in to LiCO . 51 images . 61 Start LiCO . 51 Slurm issues troubleshooting . 61 Log in to LiCO . 52 Node status check . 61 Configure LiCO services . 52 Memory allocation error . 61 Status setting error. 62 Chapter 7. Appendix: Important InfiniBand issues troubleshooting . 62 information. 53 Installation issues troubleshooting . 62 Configure VNC . 53 XCAT issues troubleshooting . 63 Standalone VNC installation . 53 Update OS packages . 63 VNC batch installation . 53 Edit nodes.csv from xCAT dumping data . 63 Configure the Confluent Web console . 53 Notices and trademarks . 64 LiCO commands . 54 Set the LDAP administrator password . 54 ii LiCO 5.2.1 Installation Guide (for SLES) Chapter 1. Overview Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for high- performance computing (HPC) and artificial intelligence (AI). It provides features like cluster management and monitoring, job scheduling and management, cluster user management, account management, and file system management. With LiCO, users can centralize resource allocation in one supercomputing cluster and carry out HPC and AI jobs simultaneously. Users can perform operations by logging in to the management system interface with a browser, or by using command lines after logging in to a cluster login node with another Linux shell. Typical cluster deployment This Guide is based on the typical cluster deployment that contains management, login, and compute nodes. Nodes BMC interface Nodes eth interface Parallel file system High speed network interface Public network Login node High speed network Management node TCP networking Compute node Figure 1. Typical cluster deployment © Copyright Lenovo 2018, 2019 1 Elements in the cluster are described in the table below. Table 1. Description of elements in the typical cluster Element Description Core of the HPC/AI cluster, undertaking primary functions such as cluster management, Management node monitoring, scheduling, strategy management, and user & account management. Compute node Completes computing tasks. Connects the cluster to the external network or cluster. Users must use the login node to log Login node in and upload application data, develop compilers, and submit scheduled tasks. Provides a shared storage function. It is connected to the cluster nodes through a high- Parallel file system speed network. Parallel file system setup is beyond the scope of this Guide. A simple NFS setup is used instead. Nodes BMC interface Used to access the node’s BMC system. Nodes eth interface Used to manage nodes in cluster. It can also be used to transfer computing data. High speed network Optional. Used to support the parallel file system. It can also be used to transfer computing interface data. Note: LiCO also supports the cluster deployment that only contains the management and compute nodes. In this case, all LiCO modules installed on the login node need to be installed on the management node. Operating environment Cluster server: Lenovo ThinkSystem servers Operating system: SUSE Linux Enterprise server (SLES) 12 SP3 Client requirements: • Hardware: CPU of 2.0 GHz or above, memory of 8 GB or above • Browser: Chrome (V 62.0 or higher) or Firefox (V 56.0 or higher) recommended • Display resolution: 1280 x 800 or above 2 LiCO 5.2.1 Installation Guide (for SLES) Supported servers and chassis models LiCO can be installed on certain servers, as listed in the table below. Table 2. Supported servers Product code Machine type Product name Appearance Lenovo ThinkSystem sd530 7X21 SD530 (0.5U) Lenovo ThinkSystem sr630 7X01, 7X02 SR630 (1U) Lenovo ThinkSystem sr650 7X05, 7X06 SR650 (2U) Lenovo ThinkSystem sd650 7X58 SD650 (2 nodes per 1U tray) 7Y36, 7Y37, Lenovo ThinkSystem sr670 7Y38 SR670 (2U) LiCO can be installed on certain chassis models, as listed in the table below. Table 3. Supported chassis models Product code Machine type Model name Appearance d2 7X20 D2 Enclosure (2U) NeXtScale n1200 n1200 5456, 5468, 5469 (6U) Prerequisites • Refer to LiCO best recipe to ensure that the cluster hardware uses proper firmware levels, drivers, and settings: https://support.lenovo.com/us/en/solutions/ht507011. • Refer to the OS part of LeSI 18C_SI best recipe to install the OS security patch: https:// support.lenovo.com/us/en/solutions/ht507611. • The installation described in this Guide is based on SLES 12 SP3. • A SLES-12-SP3-Server or SLES-12-SP3-SDK local repository should be added on management node. • Unless otherwise stated in this Guide, all commands are executed on the management node. Chapter 1. Overview 3 • To enable the firewall, modify the firewall rules according to “Firewall settings” on page 58. • It is important to regularly patch and update components and the OS to prevent security vulnerabilities. For details about how to update OS packages, see “Update OS packages” on page 63. Additionally it is recommended that known updates at the time of installation be applied during or immediately after the OS deployment to the managed nodes and prior to the rest of the LiCO setup steps 4 LiCO 5.2.1 Installation Guide (for SLES) Chapter 2. Deploy the cluster environment If the cluster environment already exists, you can skip this chapter. Install an OS on the management node Install an official version of SLES 12 SP3 on the management node. You can select the minimum installation. Run the following commands to configure the memory