XC™ Series Boot Troubleshooting Guide
Total Page:16
File Type:pdf, Size:1020Kb
XC™ Series Boot Troubleshooting Guide (CLE 6.0.UP07) S-2565 Contents Contents 1 About the XC™ Series Boot Troubleshooting Guide..............................................................................................5 2 Introduction to Troubleshooting a Boot of an XC™ Series System........................................................................ 8 2.1 An Overview of the Boot Process..............................................................................................................8 3 SMW and CLE Hardware Configuration and Cabling Concepts...........................................................................11 4 SMW Daemons, Processes, and Logs................................................................................................................. 15 4.1 Daemons on a Stand-alone SMW........................................................................................................... 15 4.2 Daemons on an SMW HA System.......................................................................................................... 19 4.3 SMW Log File Locations..........................................................................................................................22 4.4 Time Synchronization Among XC™ Series System Components...........................................................23 5 Use the Bootlog Profiler Tool to Analyze Boot Events.......................................................................................... 29 6 About Cray Scalable Services.............................................................................................................................. 32 7 Anatomy of an XC System Boot with xtbootsys....................................................................................................34 7.1 About Boot Automation Files................................................................................................................... 50 8 The Booting Process from the CLE Node View....................................................................................................52 8.1 Booting with PXE Boot for Boot and SDB Nodes.................................................................................... 53 8.2 Booting tmpfs Method with bnd............................................................................................................... 55 8.3 Booting Netroot Method with bnd............................................................................................................ 56 8.4 cray-ansible and Ansible Logs on a CLE Node....................................................................................... 58 9 Commands Helpful in Troubleshooting a Boot..................................................................................................... 60 9.1 Check RSMS Daemons...........................................................................................................................60 9.2 Check diod daemon.................................................................................................................................60 9.3 Check cray-cfgset-cache Daemon.......................................................................................................... 61 9.4 Check DHCP or TFTP Daemons.............................................................................................................61 9.5 Check Console Messages.......................................................................................................................62 9.6 Log In to a Node...................................................................................................................................... 62 9.7 Check Daemons Using xtalive.................................................................................................................63 9.8 Check STONITH on Blade Controller......................................................................................................64 9.9 Check for Cabling Issues.........................................................................................................................64 9.10 Check Hardware Inventory.................................................................................................................... 64 9.11 Check Boot Configuration......................................................................................................................65 9.12 Enable or Disable a Component............................................................................................................65 9.13 Check Status of Nodes..........................................................................................................................66 9.14 Change Node Role Between Service and Compute............................................................................. 66 9.15 Check NIMS Map.................................................................................................................................. 66 9.16 Check Which Boot Images Have Been Assigned..................................................................................67 S2565 2 Contents 9.17 Check Node NIMS Group, Boot Image, and Kernel Parameter Assignment........................................ 67 9.18 Check Whether Node is Using Netroot or tmpfs....................................................................................68 9.19 Check Which Boot Images Exist on the System................................................................................... 68 9.20 Check Which Image Roots Exist on the System................................................................................... 69 9.21 Observe Network Traffic on SMW Network Interfaces.......................................................................... 69 9.22 Check Firewall....................................................................................................................................... 70 9.23 Search a Config Set.............................................................................................................................. 70 9.24 List the Ansible Playbooks in a Config Set and Image Root................................................................. 71 9.25 Search the Ansible Playbooks in a Config Set and Image Root............................................................71 9.26 Search Ansible Plays on a Node........................................................................................................... 72 9.27 Check for Warnings, Alerts, and Reservations......................................................................................73 9.28 Check for Locks.....................................................................................................................................74 9.29 Check for PCIe Link Errors....................................................................................................................74 9.30 Check for Hardware Errors....................................................................................................................75 9.31 Check for LCB and Router Errors..........................................................................................................75 9.32 Check Time on a Node..........................................................................................................................76 10 Techniques for Troubleshooting a Failed Boot....................................................................................................77 10.1 xtcli status Fails..................................................................................................................................... 77 10.2 xtbootsys Fails with xtbounce Error.......................................................................................................78 10.3 xtbootsys Fails with rtr Error.................................................................................................................. 79 10.4 xtbootsys Fails with xtcablecheck Error.................................................................................................79 10.5 Boot or SDB Node Fails to PXE Boot....................................................................................................80 10.6 DHCP Fails to Provide an IP Address to the PXE Booting Node or TFTP Fails to Transfer Any Files to the Node................................................................................................................................. 82 10.7 Possible Problem with Boot Image Assignment.................................................................................... 83 10.8 xtbootsys Exits After Failure to Boot the Boot and SDB Nodes............................................................ 83 10.9 xtbootsys Exits After Timeout While Booting the Boot and SDB Nodes................................................84 10.10 xtbootsys Waits for Input After Timeout While Booting the Boot and SDB Nodes...............................85 10.11 xtbootsys Never Begins to Boot Service Nodes.................................................................................. 86 10.12 xtbootsys Never Begins to Boot Compute Nodes............................................................................... 88 10.13 cray-ansible Fails in Init Phase on any Node...................................................................................... 89 10.14 cray-ansible Fails in Booted Phase on Any Node..............................................................................