XC™ Series Boot Troubleshooting Guide (CLE 6.0.UP01) Contents
Total Page:16
File Type:pdf, Size:1020Kb
XC™ Series Boot Troubleshooting Guide (CLE 6.0.UP01) Contents Contents About the XC™ Series Boot Troubleshooting Guide.................................................................................................5 Introduction to Troubleshooting a Boot of an XC™ Series System...........................................................................8 SMW and CLE Hardware Configuration and Cabling Concepts................................................................................9 SMW Daemons, Processes, and Logs....................................................................................................................13 Daemons on a Stand-alone SMW.................................................................................................................13 Daemons on an SMW HA System................................................................................................................17 SMW Log File Locations................................................................................................................................19 Time Synchronization Among XC™ Series System Components.................................................................21 Anatomy of an XC System Boot with xtbootsys.......................................................................................................25 The Booting Process from the CLE Node View.......................................................................................................40 Booting with PXE Boot for Boot and SDB Nodes..........................................................................................41 Booting tmpfs Method with bnd.....................................................................................................................43 Booting Netroot Method with bnd..................................................................................................................44 cray-ansible and Ansible Logs on a CLE Node.............................................................................................46 Commands Helpful in Troubleshooting a Boot........................................................................................................48 Check RSMS Daemons.................................................................................................................................48 Check diod daemon.......................................................................................................................................48 Check cray-cfgset-cache Daemon................................................................................................................49 Check DHCP or TFTP Daemons...................................................................................................................49 Check Console Messages.............................................................................................................................50 Log In to a Node............................................................................................................................................50 Check Daemons Using xtalive.......................................................................................................................51 Check STONITH on Blade Controller............................................................................................................51 Check for Cabling Issues...............................................................................................................................52 Check Hardware Inventory............................................................................................................................52 Check Boot Configuration..............................................................................................................................52 Enable or Disable a Component....................................................................................................................53 Check Status of Nodes..................................................................................................................................53 Change Node Role Between Service and Compute.....................................................................................53 Check NIMS Map..........................................................................................................................................54 Check Which Boot Images Have Been Assigned..........................................................................................54 Check Node NIMS Group, Boot Image, and Kernel Parameter Assignment................................................54 Check Whether Node is Using Netroot or tmpfs............................................................................................55 Check Which Boot Images Exist on the System...........................................................................................56 Check Which Image Roots Exist on the System...........................................................................................56 S2565 2 Contents Observe Network Traffic on SMW Network Interfaces..................................................................................57 Check Firewall...............................................................................................................................................57 Search a Config Set......................................................................................................................................57 List the Ansible Playbooks in a Config Set and Image Root.........................................................................58 Search the Ansible Playbooks in a Config Set and Image Root....................................................................59 Search Ansible Plays on a Node...................................................................................................................59 Check for Warnings, Alerts, and Reservations..............................................................................................60 Check for Locks.............................................................................................................................................61 Check for PCIe Link Errors............................................................................................................................61 Check for Hardware Errors............................................................................................................................62 Check for LCB and Router Errors..................................................................................................................62 Check Time on a Node..................................................................................................................................63 Techniques for Troubleshooting a Failed Boot.........................................................................................................64 xtcli status Fails.............................................................................................................................................64 xtbootsys Fails with xtbounce Error...............................................................................................................65 xtbootsys Fails with rtr Error..........................................................................................................................66 xtbootsys Fails with xtcablecheck Error.........................................................................................................66 Boot or SDB Node Fails to PXE Boot............................................................................................................66 Possible Problem with Boot Image Assignment............................................................................................69 xtbootsys Exits After Failure to Boot the Boot and SDB Nodes....................................................................69 xtbootsys Exits After Timeout While Booting the Boot and SDB Nodes........................................................70 xtbootsys Waits for Input After Timeout While Booting the Boot and SDB Nodes.........................................71 xtbootsys Never Begins to Boot Service Nodes............................................................................................72 xtbootsys Never Begins to Boot Compute Nodes.........................................................................................74 cray-ansible Fails in Init Phase on any Node................................................................................................75 cray-ansible Fails in Booted Phase on Any Node.........................................................................................77 Node Fails to Mount Local Storage...............................................................................................................79 Node Fails to Mount NFS File System..........................................................................................................79 Node Fails to Mount Direct-attached Lustre (DAL)........................................................................................80 Node Fails to Mount External Lustre File System.........................................................................................82 Node Fails to Mount DVS-projected File System..........................................................................................83