A Reliable Booting System for Zynq Ultrascale+ Mpsoc Devices
Total Page:16
File Type:pdf, Size:1020Kb
A reliable booting system for Zynq Ultrascale+ MPSoC devices An embedded solution that provides fallbacks in different parts of the Zynq MPSoC booting process, to assure successful booting into a Linux operating system. A thesis presented for the Bachelor of Science in Electrical Engineering at the University of Applied Sciences Utrecht Name Nekija Dˇzemaili Student ID 1702168 University supervisor Corn´eDuiser CERN supervisors Marc Dobson & Petr Zejdlˇ Field of study Electrical Engineering (embedded systems) CERN-THESIS-2021-031 17/03/2021 February 15th, 2021 Geneva, Switzerland A reliable booting system for Zynq Ultrascale+ MPSoC devices Disclaimer The board of the foundation HU University of Applied Sciences in Utrecht does not accept any form of liability for damage resulting from usage of data, resources, methods, or procedures as described in this report. Duplication without consent of the author or the college is not permitted. If the graduation assignment is executed within a company, explicit consent of the company is necessary for duplication or copying of text from this report. Het bestuur van de Stichting Hogeschool Utrecht te Utrecht aanvaardt geen enkele aansprakelijkheid voor schade voortvloeiende uit het gebruik van enig gegeven, hulpmiddel, werkwijze of procedure in dit verslag beschreven. Vermenigvuldiging zonder toestemming van de auteur(s) en de school is niet toegestaan. Indien het afstudeerwerk in een bedrijf is verricht, is voor vermenigvuldiging of overname van tekst uit dit verslag eveneens toestemming van het bedrijf vereist. N. Dˇzemaili page 1 of 110 A reliable booting system for Zynq Ultrascale+ MPSoC devices Preface This thesis was written for the BSc Electrical Engineering degree of the HU University of Applied Sciences Utrecht, the Netherlands. During the degree, I specialized in embedded systems and found myself in an environment that allowed me to excel as an engineer. I'd like to thank my professor Corn´eDuiser for being my mentor throughout the studies. Also for helping me through the thesis as my examiner and answering my many questions. Other professors, in particular Dr. Franc van der Bent, Hubert Schuit, and Bart Bozon are thanked for their interesting courses on embedded systems and fun company in the lab. The thesis was carried out over the period of 14 months at the CMS Data acquisition & trigger group of CERN. The CMS DAQ group granted me the opportunity to work on a challenging and interesting project involving the Zynq Ultrascale+ MPSoC. The thesis is written for engineers that want to learn about the Zynq Ultrascale+ MPSoC and its development. I'd like to thank Dr. Petr Zejdlˇ for mentoring me during the project. His guidance and kindness is tremendously appreciated. Not only did he support me during working hours, but also in his free time. His encouragement and faith motivated me to excel during the project. My countless amount of questions were all answered by Petr's expertise in the field of embedded systems. I'd like to thank Dr. Marc Dobson for being my supervisor and supporting me during my time at CERN. His positive criticism and keenness helped me many times during the thesis writing and SoC meetings. His expertise of the CMS-experiment and the data acquisition system helped me understand what I've been working for. Lastly, I'd like to thank Dominique Gigi, Dr. Emilio Meschi, Dr. Atilla Racz, and Dr. Frans Meijers, along with the rest of the CMS DAQ team, for their help and kindness during my time at CERN. They provided me with a friendly working environment in a time of global pandemics and uncertainty. Nekija Dˇzemaili Geneva, Switzerland 15th of February, 2021 N. Dˇzemaili page 2 of 110 A reliable booting system for Zynq Ultrascale+ MPSoC devices Abstract CERN is working on the High-Luminosity LHC upgrade which will be installed in 2025. As a result, the CMS-experiment and its data acquisition (DAQ) system will also be upgraded. The upgrade of the CMS DAQ system involves the installation of new electronics that will also host the Zynq Ultrascale+ MPSoC from Xilinx (Multiprocessor Systems on a Chip). The Zynq Ultrascale+ MPSoC will run control and monitoring software on a Linux operating system (OS). Booting a Linux OS on the Zynq MPSoC involves a complex multi-stage booting process. The complexity of the booting process introduces possible failures that can prevent the Zynq MPSoC from booting correctly. This thesis presents the research, design, implementation, and testing of a reliable booting system that recovers the Zynq MPSoC from boot failures, upgrade failures, and running failures. The reliable booting system consists of five fallbacks in different parts of the Zynq MPSoC booting process, to account for a wide range of failures. The fallbacks have been designed to bring the Zynq MPSoC to a well-known booted state after a failure. The booting system can also boot through the network and perform automatic firmware upgrades with a rollback on failure. Users of the hardware are automatically notified after a failure was detected and a fallback was triggered in the system. The booting system is automatically built and packaged by a continuous integration build system. It has been made portable for new hardware by integrating the system in an easy-to-use board support package. Research on the possible failures in the Zynq MPSoC has been carried out. The test results have concluded that the fallbacks are able to successfully recover the Zynq MPSoC from all the researched failures. The results also highlighted a few areas that can be researched in a follow-up project to further improve the reliable booting system. N. Dˇzemaili page 3 of 110 A reliable booting system for Zynq Ultrascale+ MPSoC devices Table of contents Introduction 8 1 The CERN laboratory 9 1.1 Introduction to CERN . .9 1.2 CERN's accelerator complex . .9 1.3 The CMS experiment . 10 1.3.1 CMS sub-detectors . 11 1.3.2 CMS DAQ system . 12 2 Project description 14 2.1 Background . 14 2.2 Objective . 15 2.3 Requirements & Preconditions . 15 2.4 Reliability requirement . 16 2.5 Final products . 16 3 Background research 17 3.1 Zynq MPSoC workings and internals . 17 3.1.1 Zynq MPSoC booting overview . 17 3.1.2 Zynq MPSoC hardware overview . 17 3.1.3 The Application Processing Unit (APU) . 18 3.1.4 I/O peripherals and interfaces . 19 3.1.5 The Platform Management Unit (PMU) . 20 3.1.6 The Configuration Security Unit (CSU) . 21 3.2 Zynq MPSoC booting process . 22 3.2.1 The PMU BootROM . 22 3.2.2 The CSU BootROM . 22 3.2.3 The first-stage bootloader (FSBL) . 24 3.2.4 The ARM trusted firmware (ATF) . 25 3.2.5 The second-stage bootloader (U-Boot) . 25 3.2.6 Kernel booting . 26 3.2.7 Booting process summary . 26 3.3 Zynq MPSoC watchdog timers . 27 3.3.1 Watchdog timer workings . 27 3.3.2 Watchdog timer heartbeat daemon in Linux . 28 3.4 The Linux crashkernel . 28 3.4.1 Crashkernel workings . 28 3.4.2 Early kdump support in CentOS 8 . 30 3.4.3 Crashkernel user notifications . 30 3.4.4 Crashkernel support for Zynq MPSoC . 30 4 Research and analysis 31 4.1 Failure requirements . 31 4.1.1 Booting failures . 31 4.1.2 Upgrade failures . 31 4.1.3 Running failures . 31 Table of contents N. Dˇzemaili page 4 of 110 A reliable booting system for Zynq Ultrascale+ MPSoC devices 4.2 Failure categorization . 31 4.2.1 Pre-boot failures . 31 4.2.2 FSBL failures . 32 4.2.3 U-Boot stage failures . 33 4.2.4 Linux kernel boot failures . 33 4.2.5 Other failures . 34 4.3 Follow-up on boot and upgrade failures . 35 4.3.1 Backup-boots, boot counting and firmware upgrades . 35 4.3.2 Existing boot counting feature in U-Boot . 36 4.4 Summary and discussion of failures and fallbacks . 36 4.4.1 Tradeoff between fallbacks . 37 4.4.2 SD-card backup boot device . 37 5 High-level design 38 5.1 Reliable booting system . 38 5.2 RELBOOT & RELUP mechanisms . 39 5.2.1 RELBOOT & RELUP script . 39 5.2.2 RELBOOT & RELUP Linux daemon . 40 6 Implementation 42 6.1 Golden image search mechanism . 42 6.1.1 Boot image preparation . 42 6.1.2 Enabling FSBL debug info . 42 6.2 RELBOOT & RELUP mechanisms . 43 6.2.1 Firmware structure on TFTP server . 43 6.2.2 RELBOOT & RELUP script low-level design . 44 6.2.3 Script integration in boot image . 46 6.2.4 RELBOOT & RELUP Linux daemon implementation . 46 6.3 Crashkernel mechanism . 48 6.3.1 Kernel configuration . 48 6.3.2 Memory reservation . 49 6.3.3 Device-tree modifications . 49 6.3.4 Enabling and starting kdump . 49 6.3.5 Crashkernel workarounds . 50 6.4 Watchdog timer . 52 6.4.1 PMU firmware configuration . 52 6.4.2 Kernel configuration . 52 6.4.3 Device-tree modifications . ..