Lico 6.0.0 Installation Guide (For SLES) Seventh Edition (August 2020)
Total Page:16
File Type:pdf, Size:1020Kb
LiCO 6.0.0 Installation Guide (for SLES) Seventh Edition (August 2020) © Copyright Lenovo 2018, 2020. LIMITED AND RESTRICTED RIGHTS NOTICE: If data or software is delivered pursuant to a General Services Administration (GSA) contract, use, reproduction, or disclosure is subject to restrictions set forth in Contract No. GS-35F- 05925. Reading instructions • To ensure that you get correct command lines using the copy/paste function, open this Guide with Adobe Acrobat Reader, a free PDF viewer. You can download it from the official Web site https://get.adobe.com/ reader/. • Replace values in angle brackets with the actual values. For example, when you see <*_USERNAME> and <*_PASSWORD>, enter your actual username and password. • Between the command lines and in the configuration files, ignore all annotations starting with #. © Copyright Lenovo 2018, 2020 ii iii LiCO 6.0.0 Installation Guide (for SLES) Contents Reading instructions. ii Check environment variables. 25 Check the LiCO dependencies repository . 25 Chapter 1. Overview. 1 Check the LiCO repository . 25 Introduction to LiCO . 1 Check the OS installation . 26 Typical cluster deployment . 1 Check NFS . 26 Operating environment . 2 Check Slurm . 26 Supported servers and chassis models . 3 Check MPI and Singularity . 27 Prerequisites . 4 Check OpenHPC installation . 27 List of LiCO dependencies to be installed. 27 Chapter 2. Deploy the cluster Install RabbitMQ . 27 environment . 5 Install MariaDB . 28 Install an OS on the management node . 5 Install InfluxDB . 28 Deploy the OS on other nodes in the cluster. 5 Install Confluent. 29 Configure environment variables . 5 Configure user authentication . 29 Create a local repository . 8 Install OpenLDAP-server . 29 Install Lenovo xCAT . 8 Install libuser . 30 Prepare OS mirrors for other nodes . 9 Install OpenLDAP-client . 30 Set xCAT node information . 9 Install nss-pam-Idapd . 30 Add host resolution . 10 Configure DHCP and DNS services . 10 Chapter 4. Install LiCO . 33 Install a node OS through the network . 11 List of LiCO components to be installed . 33 Create local repository for other nodes . 11 Install LiCO on the management node . 33 Configure the memory for other nodes . 12 Install LiCO on the login node . 34 Checkpoint A . 12 Install LiCO on the compute nodes . 34 Install infrastructure software for nodes . 12 Configure the LiCO internal key. 34 List of infrastructure software to be installed . 12 Configure a local Zypper repository for the Chapter 5. Configure LiCO . 35 management node . 13 Configure the service account . 35 Configure a local Zypper repository for login Configure cluster nodes . 35 and compute nodes . 13 Room information . 35 Configure LiCO dependencies repositories . 14 Logic group information . 35 Obtain the LiCO installation package. 14 Room row information . 36 Configure the local repository for LiCO . 15 Rack information . 36 Configure the xCAT local repository . 15 Chassis information . 36 Install Slurm . 15 Node information . 37 Configure NFS . 16 Configure generic resources . 38 Configure Chrony . 17 Gres information. 38 GPU driver installation . 17 List of cluster services . 38 Configure Slurm . 18 Configure LiCO components. 39 Install Icinga2 . 20 lico-vnc-mond . 39 Install MPI . 22 lico-portal . 39 Install Singularity . 23 Initialize the system . 40 Checkpoint B . 23 Initialize users . 40 Chapter 3. Install LiCO Import system images . 41 dependencies . 25 Chapter 6. Start and log in to LiCO . 43 Cluster check. 25 Start LiCO . 43 © Copyright Lenovo 2018, 2020 i Log in to LiCO . 43 Firewall settings. 47 Configure LiCO services . 43 Set firewall on the management node . 47 Set firewall on the login node . 48 Chapter 7. Appendix: Important SSHD settings . 48 information. 45 Improve SSHD security . 48 Configure VNC . 45 Slurm issues troubleshooting . 49 Standalone VNC installation . 45 Node status check . 49 VNC batch installation . 45 Memory allocation error . 49 Configure the Confluent Web console . 46 Status setting error. 49 LiCO commands . 46 InfiniBand issues troubleshooting . 49 Change a user’s role . 46 Installation issues troubleshooting . 49 Resume a user . 46 XCAT issues troubleshooting . 50 Delete a user . 46 MPI issues troubleshooting . 50 Import a user . 47 Edit nodes.csv from xCAT dumping data . 51 Import AI images . 47 Notices and trademarks . 51 Generate nodes.csv in confluent . 47 ii LiCO 6.0.0 Installation Guide (for SLES) Chapter 1. Overview Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for high- performance computing (HPC) and artificial intelligence (AI). It provides features like cluster management and monitoring, job scheduling and management, cluster user management, account management, and file system management. With LiCO, users can centralize resource allocation in one supercomputing cluster and carry out HPC and AI jobs simultaneously. Users can perform operations by logging in to the management system interface with a browser, or by using command lines after logging in to a cluster login node with another Linux shell. Typical cluster deployment This Guide is based on the typical cluster deployment that contains management, login, and compute nodes. Nodes BMC interface Nodes eth interface Parallel file system High speed network interface Public network Login node High speed network Management node TCP networking Compute node Figure 1. Typical cluster deployment © Copyright Lenovo 2018, 2020 1 Elements in the cluster are described in the table below. Table 1. Description of elements in the typical cluster Element Description Core of the HPC/AI cluster, undertaking primary functions such as cluster management, Management node monitoring, scheduling, strategy management, and user & account management. Compute node Completes computing tasks. Connects the cluster to the external network or cluster. Users must use the login node to log Login node in and upload application data, develop compilers, and submit scheduled tasks. Provides a shared storage function. It is connected to the cluster nodes through a high- Parallel file system speed network. Parallel file system setup is beyond the scope of this Guide. A simple NFS setup is used instead. Nodes BMC interface Used to access the node’s BMC system. Nodes eth interface Used to manage nodes in cluster. It can also be used to transfer computing data. High speed network Optional. Used to support the parallel file system. It can also be used to transfer computing interface data. Note: LiCO also supports the cluster deployment that only contains the management and compute nodes. In this case, all LiCO modules installed on the login node need to be installed on the management node. Operating environment Cluster server: Lenovo ThinkSystem servers Operating system: SUSE Linux Enterprise server (SLES) 15 SP1 Client requirements: • Hardware: CPU of 2.0 GHz or above, memory of 8 GB or above • Browser: Chrome (V 62.0 or higher) or Firefox (V 56.0 or higher) recommended • Display resolution: 1280 x 800 or above 2 LiCO 6.0.0 Installation Guide (for SLES) Supported servers and chassis models LiCO can be installed on certain servers, as listed in the table below. Table 2. Supported servers Product code Machine type Product name Appearance Lenovo ThinkSystem sd530 7X21 SD530 (0.5U) Lenovo ThinkSystem sd650 7X58 SD650 (2 nodes per 1U tray) Lenovo ThinkSystem sr630 7X01, 7X02 SR630 (1U) Lenovo ThinkSystem sr645 7D2X, 7D2Y SR645 (1U) Lenovo ThinkSystem sr650 7X05, 7X06 SR650 (2U) Lenovo ThinkSystem sr655 7Y00, 7Z01 SR655 (2U) Lenovo ThinkSystem sr665 7D2V, 7D2W SR665 (2U) 7Y36, 7Y37, Lenovo ThinkSystem sr670 7Y38 SR670 (2U) Lenovo ThinkSystem sr850 7X18, 7X19 SR850 (2U) Lenovo ThinkSystem sr850p 7D2F, 7D2G, 7D2H SR850P (2U) 7X11, 7X12, Lenovo ThinkSystem sr950 7X13 SR950 (4U) LiCO can be installed on certain chassis models, as listed in the table below. Chapter 1. Overview 3 Table 3. Supported chassis models Product code Machine type Model name Appearance d2 7X20 D2 Enclosure (2U) NeXtScale n1200 n1200 5456, 5468, 5469 (6U) Prerequisites • Refer to LiCO best recipe to ensure that the cluster hardware uses proper firmware levels, drivers, and settings: https://support.lenovo.com/us/en/solutions/ht507011. • Refer to the OS part of LeSI 20A_SI best recipe to install the OS security patch: https://support.lenovo.com/ us/en/solutions/HT510293. • The installation described in this Guide is based on SLES 15 SP1. • A SLE-15-SP1-Installer or SLE-15-SP1-Packages local repository should be added on management node. • Unless otherwise stated in this Guide, all commands are executed on the management node. • To enable the firewall, modify the firewall rules according to “Firewall settings” on page 47. • It is important to regularly patch and update components and the OS to prevent security vulnerabilities. Additionally it is recommended that known updates at the time of installation be applied during or immediately after the OS deployment to the managed nodes and prior to the rest of the LiCO setup steps 4 LiCO 6.0.0 Installation Guide (for SLES) Chapter 2. Deploy the cluster environment If the cluster environment already exists, you can skip this chapter. Install an OS on the management node Install an official version of SLES 15 SP1 on the management node. You can select the minimum installation. Run the following commands to configure the memory and restart OS: echo '* soft memlock unlimited' >> /etc/security/limits.conf echo '* hard memlock unlimited' >> /etc/security/limits.conf reboot