XC™ Series Boot Troubleshooting Guide (CLE 6.0.UP01) Contents

XC™ Series Boot Troubleshooting Guide (CLE 6.0.UP01) Contents

XC™ Series Boot Troubleshooting Guide (CLE 6.0.UP01) Contents Contents About the XC™ Series Boot Troubleshooting Guide.................................................................................................5 Introduction to Troubleshooting a Boot of an XC™ Series System...........................................................................8 SMW and CLE Hardware Configuration and Cabling Concepts................................................................................9 SMW Daemons, Processes, and Logs....................................................................................................................13 Daemons on a Stand-alone SMW.................................................................................................................13 Daemons on an SMW HA System................................................................................................................17 SMW Log File Locations................................................................................................................................19 Time Synchronization Among XC™ Series System Components.................................................................21 Anatomy of an XC System Boot with xtbootsys.......................................................................................................25 The Booting Process from the CLE Node View.......................................................................................................40 Booting with PXE Boot for Boot and SDB Nodes..........................................................................................41 Booting tmpfs Method with bnd.....................................................................................................................43 Booting Netroot Method with bnd..................................................................................................................44 cray-ansible and Ansible Logs on a CLE Node.............................................................................................46 Commands Helpful in Troubleshooting a Boot........................................................................................................48 Check RSMS Daemons.................................................................................................................................48 Check diod daemon.......................................................................................................................................48 Check cray-cfgset-cache Daemon................................................................................................................49 Check DHCP or TFTP Daemons...................................................................................................................49 Check Console Messages.............................................................................................................................50 Log In to a Node............................................................................................................................................50 Check Daemons Using xtalive.......................................................................................................................51 Check STONITH on Blade Controller............................................................................................................51 Check for Cabling Issues...............................................................................................................................52 Check Hardware Inventory............................................................................................................................52 Check Boot Configuration..............................................................................................................................52 Enable or Disable a Component....................................................................................................................53 Check Status of Nodes..................................................................................................................................53 Change Node Role Between Service and Compute.....................................................................................53 Check NIMS Map..........................................................................................................................................54 Check Which Boot Images Have Been Assigned..........................................................................................54 Check Node NIMS Group, Boot Image, and Kernel Parameter Assignment................................................54 Check Whether Node is Using Netroot or tmpfs............................................................................................55 Check Which Boot Images Exist on the System...........................................................................................56 Check Which Image Roots Exist on the System...........................................................................................56 S2565 2 Contents Observe Network Traffic on SMW Network Interfaces..................................................................................57 Check Firewall...............................................................................................................................................57 Search a Config Set......................................................................................................................................57 List the Ansible Playbooks in a Config Set and Image Root.........................................................................58 Search the Ansible Playbooks in a Config Set and Image Root....................................................................59 Search Ansible Plays on a Node...................................................................................................................59 Check for Warnings, Alerts, and Reservations..............................................................................................60 Check for Locks.............................................................................................................................................61 Check for PCIe Link Errors............................................................................................................................61 Check for Hardware Errors............................................................................................................................62 Check for LCB and Router Errors..................................................................................................................62 Check Time on a Node..................................................................................................................................63 Techniques for Troubleshooting a Failed Boot.........................................................................................................64 xtcli status Fails.............................................................................................................................................64 xtbootsys Fails with xtbounce Error...............................................................................................................65 xtbootsys Fails with rtr Error..........................................................................................................................66 xtbootsys Fails with xtcablecheck Error.........................................................................................................66 Boot or SDB Node Fails to PXE Boot............................................................................................................66 Possible Problem with Boot Image Assignment............................................................................................69 xtbootsys Exits After Failure to Boot the Boot and SDB Nodes....................................................................69 xtbootsys Exits After Timeout While Booting the Boot and SDB Nodes........................................................70 xtbootsys Waits for Input After Timeout While Booting the Boot and SDB Nodes.........................................71 xtbootsys Never Begins to Boot Service Nodes............................................................................................72 xtbootsys Never Begins to Boot Compute Nodes.........................................................................................74 cray-ansible Fails in Init Phase on any Node................................................................................................75 cray-ansible Fails in Booted Phase on Any Node.........................................................................................77 Node Fails to Mount Local Storage...............................................................................................................79 Node Fails to Mount NFS File System..........................................................................................................79 Node Fails to Mount Direct-attached Lustre (DAL)........................................................................................80 Node Fails to Mount External Lustre File System.........................................................................................82 Node Fails to Mount DVS-projected File System..........................................................................................83

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    98 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us