NOISE REDUCTION IN SOLID STATE DRIVE (SSD) SYSTEM VALIDATION

A Project

Presented to the faculty of the Department of Electrical and Electronic Engineering

California State University, Sacramento

Submitted in partial satisfaction of the requirements for the degree of

MASTER OF SCIENCE

in

Electrical and Electronic Engineering

by

Srishti Gupta

FALL 2020

NOISE REDUCTION IN SOLID STATE DRIVE (SSD) SYSTEM VALIDATION

A Project

by

Srishti Gupta

Approved by:

______, Committee Chair Dr. Praveen Meduri

______, Second Reader Dr. Preetham Kumar

______Date

ii

Student: Srishti Gupta

I certify that this student has met the requirements for format contained in the University format manual, and this project is suitable for electronic submission to the library and credit is to be awarded for the project.

______, Graduate Coordinator ______Dr. Preetham B. Kumar Date

Department of Electrical and Electronic Engineering

iii

Abstract

of

NOISE REDUCTION IN SOLID STATE DRIVE (SSD) SYSTEM VALIDATION

by

Srishti Gupta

The of SSD development includes several stages from design to release. Once the hardware and software for the drive has been implemented, a complete testing on the actual system is critical to test the functionality and performance of the SSD in the real world. Validation systems involve a platform like datacenter servers, PCIe bus, , Quarch, network switches, and software tools. The major challenge with this kind of validation is the fact that the many environmental components can lead to significant noise being introduced. Since, the system validation stage is one of the last phases before the release of a product, there is a strong emphasis on debugging the issues at high a velocity. Hence, considering the complexity of the system and the fixed timelines to deliver with quality, noise reduction is crucial. This project analysis the various noise parameters at the system level, tools to evaluate the noise and methods to minimize it.

______, Committee Chair Dr. Praveen Meduri

______Date

iv

TABLE OF CONTENTS

Page

List of Tables ...... vii

List of Figures ...... viii

Chapter

1. INTRODUCTION ...... 1

2. SYSTEM DESIGN AND COMPONENTS ...... 2

2.1 Overall System Design ...... 2

2.2 Hardware Components ...... 3

2.2.1 Torridon System...... 3

2.2.2 Server Platform ...... 6

2.2.3 PCIe Switches ...... 9

2.2.4 Network Infrastructure ...... 10

2.3 Software Components ...... 12

2.3.1 Operating Systems ...... 12

2.3.2 Test Framework ...... 14

2.3.3 Flexible I/O Tester ...... 16

2.3.4 Medusa Labs Test Tools Suite ...... 18

2.3.5 NVMe CLI ...... 21

2.3.6 PCIMEM ...... 23

2.3.7 Link Training Status and State Machine ...... 25

v

3. DATA COLLECTION AND ANALYSIS ...... 27

3.1 JIRA Tools for Data Collection ...... 27

3.2 Noise Categorization ...... 28

4. FACTORS CONTRIBUTING TO NOISE ...... 29

4.1 Key Hot Plug and Hot Swap Issues ...... 29

4.2 Kernel Events ...... 34

4.3 Windows Blue ...... 39

4.4 Unexpected Shutdown Due to Network Instability ...... 45

5. CONCLUSION ...... 48

References ...... 50

vi

LIST OF TABLES

Tables Page

1. Differences between Pain and Maim ...... ………………………………. 19

2. NVMe CLI Commands ...... ……………………………. 22

3. PCIe Error Messages...... ………….…………………………………. 26

vii

LIST OF FIGURES

Figures Page

1. 28-Port Quarch Controller ...... 4

2. Flex Cable and Quarch...... 4

3. Quarch Connection to the Drive ...... 4

4. Complete Testing Set Up Using the Torridon System...... 5

5. Intel Server Board S2600WF Components...... 6

6. M.2 SSD Connectors on the Server Board ...... 7

7. Onboard OcuLink Connectors ...... 8

8. NVMe Error Handling Using VMD ...... 8

9. Broadcom Gen 4 Switch Topology...... 9

10. Data Center Network Topology ...... 10

11. cat/proc/cpuinfo Output ...... 13

12. /proc/iomem Output ...... 14

13. PCIe Bus Connects to Controller and to the Namespaces ...... 15

14. Accessing NVMe by Creating PCIe and Controller Object ...... 16

15. Admin Command Format ...... 16

16. FIO Output ...... 18

17. Medusa Logging Output ...... 20

viii

18. NVMe CLI Smart Log Output ...... 23

19. PCIMEM Command Syntax ...... 24

20. LTSSM States ...... 25

21. Hot Plug Scope Capture ...... 31

22. Effect of Pin Bounce During a Pull Event ...... 31

23. Effect of Pin Bounce During Hot Plug ...... 32

24. Interposer Layout ...... 33

25. Linux Crash Utility ...... 37

26. Log Command to Display Message Buffer ...... 38

27. Windows BSOD with Bug Check Code ...... 39

28. Bug Check Code Reference ...... 41

29. Bug Check Code Parameter Details ...... 42

30. Bug Analysis in Debug Mode ...... 42

31. SFC Command to Check for File System Errors ...... 45

32. Redundancy Network Group ...... 47

ix

1

CHAPTER 1

INTRODUCTION

With the advent of modern computing which involves processing large quantities of data for implementing technologies such as big data and cloud computing there is a high demand for reliable storage techniques. SSDs provide a more scalable solution to the storage demand as compared to Hard Disk Drives (HDD). Less number of moving parts such as magnetic disks as compared to HDDs, lead to SSDs being more durable, fast and compact. However, SSDs have limited number of program and erase cycles over their lifetime, after which the performance begins to degrade. Therefore, engineering SSD hardware and firmware in a way to maximize the lifetime is critical. There are several techniques such as Wear Levelling and Garbage Collection which help in maximizing performance. [14] Once these features have been executed by the design and development teams, validation of the implementation is crucial in order to ensure that the product meets the high industry standards of reliability and performance. For complete testing of SSDs, there are many different test suites and tools that are used. The tools enable executing various workloads such as stress, power, performance and functional level testing. However, the many tools and platform variables involved in the testing lead to significant amount of noise. This needs to be minimized in order to improve validation efficiency, so that the focus can be on finding SSD related issues. The following sections discuss the different variables used in testing which could potentially lead to noise, noise categorization and analysis, factors contributing to noise, and methods to minimize.

2

CHAPTER 2

SYSTEM DESIGN AND COMPONENTS

2.1 Overall System Design

Data Centers are facilities comprising of racks of servers, power module, network infrastructure, storage, cooling fans and software components. Data centers along with the several software and hardware components create the environment for SSD testing.

Rows of servers connected over the network form the basic building blocks of a data center. Server rack topologies can be subdivided into two categories:

• ToR Topology: With this topology, 1 or 2 Rack Unit (RU) access layer switches

and devices for switching are installed at the top of the rack. These function to

provide the connections within the rack of the server. Using this topology, a

smaller number of cables are required between the rack and end switches, but a

larger number of switches are required for extending the connections to other

racks. [21]

• End of the Row Topology: The switching devices are at the end of the row in this

case as compared to the top of the rack in the ToR topology. The main benefit of

this is that it reduces the number of switches required but increases the cabling

connections between the servers on a rack. [21]

The servers are interconnected using the network subnets infrastructure. The power modules provide the supply and circuit breaker connections. SSD testing also requires

3 components such as Quarch modules, software tests and tools in order to execute the various workloads to validate.

2.2 Hardware Components

The hardware components comprise of the several equipment required for enabling testing such as Torridon Quarch systems required to do hot plug cycles, servers boards for platform, PCIe switches for Gen speed support and Network infrastructure for connectivity.

2.2.1 Torridon System

The Torridon System provides hot plug capabilities for the drive. Without human intervention, drives can be removed and inserted back into the server using the Torridon

System. This allows running many cycles of hot plug and hot swap tests, without the requirement of manual monitoring. [1] There are several applications of the Torridon

System in the storage industry. During the product development stage, bench testing involving cycles of drive insertion and removal followed by checking for drive enumeration can be performed easily. Boundary test conditions, hot swap performance tests and hardware fault tests can be run in an automated environment. The ability to automate hot plug is beneficial during the testing phase as test engineers can add the ability to stress the drive with back to back hot plug tests without any human intervention. On an average about ten thousand hot plug cycles can be performed within the span of a few days. [1] Using the automated system, several other tests such as

4

Format, IO, Sanitize and Resets can be combined to concurrently run with hot plug and hot swap testing. Torridon System consist of controller and a module for the controller.

The controller provides an interface to the control module. The simplest controller can run a single module, but an array controller can run up to 28 modules. Figure 1 shows a

28 Port Quarch controller with 24 ports in the front and 4 at the back.

Figure 1: 28-Port Quarch Controller [1] Figure 2: Flex cable and Quarch [1]

Figure 3: Quarch Connection to the Drive [1]

The drive is connected using a control module and flex cable to the controller. Figure 2 illustrates the Quarch flex cable and module components. The drive is connected to the

5

Quarch module as shown in Figure 3. The other end of the module can be connected to the server platform. [1] An array controller can connect to 28 modules and can be installed on the server for an easy connection to the drives. A single interface can be used to control all the modules installed on the array controller. [1] Figure 4 shows the complete system set up using the 28-port array controller. The drive is connected to the back plane of the server via the Quarch control module. Flex cables then connect the module to the 28-port controller, which is further connected to a 12V power supply. The system is connected through the network or via RS232 cable and hot plug commands can be sent via test scripts.

Figure 4: Complete Testing Set Up Using the Torridon System [1]

6

The Torridon System supports several commands which the script can send to perform different operations. The run power up and run power down are the primary commands used to hot insert and remove the drive respectively. Some other commands supported are reset, identify device, assign signals to control and measure voltage commands

2.2.2 Server Platform

The server platform is one of the primary components of the data center environment and provides the platform for connecting the storage drives. The Intel Server Board S2600WF is a high-performance server that uses the Intel first and second gen Xeon processor and the Intel C624 Chipset. [2] The main components of the server are the processor, riser slots for connecting modules, fan connectors, power connectors, NIC ports and

RAID/VROC module as shown in figure 5.

Figure 5: Intel Server Board S2600WF Components [2]

7

Two sockets are provided to support the connection to the 1st and 2nd Xeon processors, with a maximum Thermal Design Power (TDP) of 205W. The memory support involves

24 DIMM slots and DDR4 configuration. The onboard PCIe and NVMe modules comprise of 4 OcuLink connectors, 2 M.2 connectors and Intel VMD support. The integrated Baseboard Management Control (BMC) provides the advanced server management capabilities via the Intel RMM4 lite. There a 6 system fans supported in two different connector forms to provide the cooling mechanism for the system. [2]

The M.2 SSD form factors can be connected to the two M.2 connectors as shown in

Figure 6. Each M.2 connector can support 80mm form factor PCIe or SATA modules.

Figure 6: M.2 SSD Connectors on the Server Board [2]

The onboard OcuLink connectors provide further options for connecting SSDs using

PCIe to the backplane of the server as shown in figure 7. Connections for OCuLink SSD0 and SSD1 are routed from CPU 1 and that for SSD2 and SSD3 are routed form CPU2.

8

Figure 7: Onboard OcuLink Connectors [2]

The Intel Volume Management Device (VMD) is a hardware logic to manage PCIe

NMVe SSDs. It functions to provide robust support of functionalities such as hot plug by managing system crashes and hangs while inserting and removing the SSDs. [2] Figure 8 shows the extra hardware involved for the error handling mechanism.

Figure 8: NVMe Error Handling Using VMD [2]

9

2.2.3 PCIe Switches

PCIe Gen 4 provides about double the speed as compared to PCIe Gen 3. While Gen 3 speed gives the ability for the SSD to operate at 8GT/s, Gen 4 adds the capability for the

SSD to function at 16 GT/s. The higher data transfer rate at lower latencies leads to Gen 4 being a beneficial alternative. [3] However, not all server boards have inbuilt or native support for Gen 4, hence a PCIe switch is required. The SSD can then be connected to the server port via the switch and operate at Gen 4 speed. Microsemi provides several options for Gen 4 switches via the Switchtec technology. Switchtec provides options from 96 to

24 lanes, supporting features such as port bifurcation, Advanced Error Reporting and debug capabilities. [3] Alternatively, the PEX88000 series of switches is provided by

Broadcom for adding Gen 4 functionalities. The PEX88000 series offer up to 48 DMA channels associated with each x2 PCIe port, allowing data transfer between host and the

SSD. The switches involve an ARM Cortex-R4 CPU, timers and internal RAM which can be programmed for I/O and hot plug capabilities. The embedded CPU provides functionalities such as chassis management, LED control, hot add and so on.[4] Figure 9 shows the topology for support up to 32 dual port x2 or 16 single port x4 NVMe drives.

Figure 9: Broadcom Gen 4 Switch Topology [4]

10

2.2.4 Network Infrastructure

The network infrastructure provides the communication system that connects the servers, services and the external users. The network topology, routing and protocols (security, ethernet, IP) characterize the network infrastructure of the data center. Figure 10 shows the network topology of a conventional network for a data center. The Top of Rack has the switch that provides the connectivity to the servers on the rack. The aggregation or distribution layer has the aggregation switch (AS) that forwards the traffic from the ToR layer to the core through multiple connections. The core layer is responsible for providing the secure connection between the AS and routers in the core. [5]

Figure 10: Data Center Network Topology [5]

11

VLAN is the virtual LAN (local area network) which provides subnetworks to group devices on different physical LANs. VLAN allows network administrators to provide security and partitions traffic based on requirements, enabling dividing systems into logical groups. Data centers using this design can be implemented using switches and hypervisors. Implementing security protocols is important in the case of virtual networks as any malicious attacks on the network can possibly affect all the systems on the subnet.

The subnet masks divide the IP address for a server into a network and a host address part. [5] Further, the subnets are connected to a Thin Client that provides a secure means to store applications and programs. Using Thin Client is an effective way to provide centralized computing abilities. It allows functionalities such as software upgrades and application installations easily in the data center environment. The applications being installed on the systems can be limited, implementing secure installations of software.

The testing framework and tools can be backed up on the thin client and installed on the servers over the network. [6] This is advantageous if similar configurations need to be loaded on multiple systems, as in the case of the data center servers used for testing

SSDs. Thin Clients usually involve processors with low processing power since their purpose is to mainly provide centralized application access securely. The computations and intensive workloads are implemented by the servers. [6] Firewall forms a layer of security by monitoring incoming and outgoing traffic on the network. The network infrastructure not only connects the servers with end users, but also to several other applications such as remote power control, which provides the ability to remotely turn on and off the systems, essential to testing functionalities such as drive .

12

2.3 Software Components

Software components such as Python test scripts along with tools such as FIO, Medusa,

NVMe CLI and PCIMEM enable issuing different IO workloads and Admin commands such as format, sanitize and identify to the drive. Linux and Windows Operating Systems

(OS) enable testing in both environments, ensuring the drive is compatible with the two popularly used OS.

2.3.1 Operating Systems

Red Hat Enterprise Linux (RHEL) distributions are popularly used for data center operating systems. RHEL 7 was released on June 2014 followed by RHEL 8 in May

2019. Fedora is the upstream distribution from which the RHEL releases are forked.

RHEL 7 is based on Fedora 19 and RHEL 8 from Fedora 28. Both RHEL and Fedora are open source, giving the user ability to modify and innovate. [7] RHEL file systems follow the Filesystem Hierarchy Standard (FHS), where the /boot/ directory stores the files that are required during system startup. The /dev/ directory lists the devices physically and virtually attached to the system. This is where the NVMe drive would be listed as well along with any RAID volumes. The /etc/ directory stores the system specific configuration files and the /opt/ directory is where most of the software packages are stored. The /var/ directory is useful for debug purposes as it stores the system logs such as messages indicating the state of the system and last log indicating the time the last reboot happened. [8] File systems on the kernel such as provides options to choose from different levels of data protection that help in preventing loss of data during

13 an unsafe shutdown event. After an unexpected event such as a crash, the ext3 file system should be verified using the e2fsck application. The ext4 file system is an extension to the ext3 file system, which improves efficiency when working with large data by reordering writes issued to the drive. The debug4fs command can be used to find issues with the file system and the e4fsck can be utilized to repair. [8] These commands are useful when debugging issues such as kernel crash that can possibly add noise in the validation environment. The /proc/ directory is another useful debug tool which can be used to find the current kernel state and the system view including what processes are currently running. [8] For example the cat/proc/cpuinfo commands provides information of the

CPU in detail as shown in figure 11. /proc/iomem in figure 12 shows the system memory for the physical devices at a current point in time. This can be used along with the system logs to map the device id with the system memory map.

Figure 11: cat/proc/cpuinfo output [8]

14

Figure 12: /proc/iomem Output [8]

The Linux distributions have long term and stable kernel versions, which are frequently updated with bug fixes. The latest kernel version and bug fix reports are downloaded from Linux kernel archives. The uname -r command can be used to find the kernel version installed on a system.

Microsoft provides a series of server specific OS as well such as Windows 2012,

Windows 2016 and Windows 2019. Other distributions used for datacenter OS are SUSE,

CentOS and Ubuntu Server.

2.3.2 Test Framework

Python is a popularly used in the testing environment as it provides many powerful APIs that can be utilized for extensive testing. As an example, Python provides the Pynvme open source test driver along with the API test developers to create SSD specific test code. Pynvme is based on the Storage Performance Development Kit (SPDK) developed by Intel to provide solutions to developing storage applications. There are many features supported by Pynvme to test for features related to functionality and performance.

15

NVMe devices use the PCIe bus. The PCIe bus connects to NVMe controller and then to the different namespaces as shown in figure 13. Hence, a PCIe object is created first and then a controller object to access NVMe. [9] This can be done as shown in figure 14.

Admin commands can be sent using the Pynvme APIs. Figure 15 shows an example of the format command being sent after setting the timeout. Several other testing tools such as Flexible I/O Tester, Medusa, NVMe CLI, PCIMEM and Link Training Status State

Machine can be combined into the testing framework built with Python. These tools can then be used to exercise different workloads on the drive such as I/O, back to back formats, sanitize, resets and power cycles. The test scenarios can also be combined to run in concurrent threads using the concurrent.futures module of Python. The test tools combined with Python scripts together form the testing framework. The following sections discusses each tool of the framework in detail.

Figure 13: PCIe Bus Connects to Controller and to the Namespaces [14]

16

Figure 14: Accessing NVMe by Creating PCIe and Controller Object [9]

Figure 15: Admin Command Format [9]

2.3.3 Flexible I/O Tester

The Flexible I/O Tester (FIO) gives the ability to issue I/O workloads to the drive without the need to write extensive test functions, defining each workload. FIO was written by

Jens Axboe and is available as open source from git. It works by spawning multiple threads for doing a specific IO defined by the user. FIO commands are straight forward to run via the command line or terminal. The format of the command is:

$ fio [options] [job file 1] [Job file 2] …

The above command starts the commands specified with the job file. To run consecutive

IO workloads, multiple job files can be listed with the command, which are then scheduled in series by the FIO engine. The job file is thus an important parameter to define in order to run workloads. [10] The main parameters to define in the job file are as follows.

17

• I/O type - This specifies whether the workload issued to the drive is sequential or

randomly written. This parameter also specifies if reads and writes are mixed and

if the I/O is buffered or direct.

• Block Size - Specifies the number of chunks the I/O is being issued. The default

block size is 4096. Other options can be specified using bs, for example: bs=256k.

• I/O Size - Specifies the amount of data to be read or written

• I/O Engine - Whether the is memory mapped, spliced, asynchronous or SCSI.

Libaio as example is the I/O engine for Linux asynchronous workloads.

• I/O Dept h- Queueing depth for cases when the I/O engine is async.

• Target file/device - Number of files or devices the workload is issued.

• Threads/ Processes- Number of threads used for the workload. [10]

Apart from the above parameters, FIO also provides several command lines options such as –debug to enable logging of FIO workloads and -output to direct to output file. An example of running FIO command is:

$ fio --rw=randwrite --ioengine=libaio --iodepth=2 --bs=16k --direct=0 --numjobs=2

The above command issues an asynchronous random write workload using the libaio engine with an IO depth of 2. The number of jobs is specified is 2, which results in spawning 2 identical jobs. After running the FIO workload on the drive, any failures can be investigated and debugged by referring to the output file, an example of which is shown in figure 16. The first line prints the job name with the id of the group, number of jobs and error id if any found. Next the type of I/O is listed, which is a write with

Bandwidth 623 KiB in this case. Slat stands for the Submission latency, that is the time is

18 takes to submit the I/O. Clat is the completion latency and specifies the time for the completion of I/O. Lat is the total latency, i.e., the total time from I/O issue to completion. CPU usage specifies the details about the system time and number of page faults. [10]

Figure 16: FIO Output [10]

2.3.4 Medusa Labs Test Tools Suite

The Medusa Lab Test Tool Suite provided by Viavi solutions gives an efficient way for running data integrity workloads on the drive. Medusa workloads work on a host-target interaction design, where the I/O is initiated by the host and target the SSD. The workloads can be run using the command line, Medusa GUI or by embedding the

19 commands in the Python scripts. The main tools for I/O used by Medusa are Pain and

Maim. Pain is used for scheduling synchronous I/O per thread while Maim is issues multiple asynchronous I/O per thread. Table 1 shows the differences between Pain and

Maim. Depending on these attributes, the two can be used for different use cases. [11]

Table 1: Differences Between Pain and Maim [11]

The FindLBA utility of Medusa provides the ability to debug issues such as data corruption. It is possible that the tool logging shows an address that does not match with the exact physical address. Using the FindLBA utility along with an analyzer, the exact memory areas can be determined to see which LBA the data corruption corresponds to.

To start a Medusa workload, a license needs to be checked out from the server. The

GetKey utility provides the ability to checkout license and is useful specially to check out license in offsite testing. The Medusa Agent is a service that used TCP protocol to provide functionalities such as license client, system discovery and remote execution.

Running Medusa from the command line requires injection of the test parameters based on the drive size and queue depth to test. As an example, in the following Medusa command a synchronous workload is issued using pain on physical drive 1. The buffer size is 512k and 6 is the thread count and 125 is the data pattern used.[11]

20

pain -f\\.\physicaldrive1 -b512k -t6 -125

Figure 17: Medusa Logging Output [11]

The buffer size corresponds to the block size on the drive. The -Q option can be used to specify the queue depth, i.e., the number of outstanding commands to be issued. The

21 output generated after running Medusa workload are stored in a log file, which can be referenced to debug any errors generated during the testing. There is also an error log file generated incase if the workload fails. This is usually under the name thread.bad and has information about the thread that failed due to data corruption. The generation of extensive logs makes Medusa an efficient tool to debug data integrity issues related with

SSDs. [11] Figure 17 shows the output generated by Medusa along with description for each logging. An error with code 13 is logged in this case for a Read workload on

Physical drive 1. The LBA on the drive where the data corruption occurred is also specified.

2.3.5 NVMe CLI

NVMe CLI is a tool specifically designed for NVMe SSDs to provide functionalities such as drive health monitoring, endurance checks, issuing admin commands such as format and sanitize. The NVMe CLI commands are available via Linux as open source and directly match with the NVMe specifications. This makes this tool readily available and easy to use. For example, the specs define the identify data structure to find device information such as model number. NVMe CLI has an equivalent command called the nvme id ctrl that corresponds to the identify data structure.[12] On Rhel distributions,

NVMe CLI package can be installed using the sudo yum install nvme-cli command.

Some useful NVMe CLI commands are shown in Table 2.

22

Table 2: NVMe CLI Commands [12]

These commands help in testing the basic functionality of the SSD and the output can be used to obtain important drive debug information. Figure 18 shows the output for the smart log command run with nvme CLI. This is useful in obtaining important debug information such as power cycle counts, temperature and critical warnings on the drive.

In cases when the drive temperature exceeds the allowed safe limit of the drive, this tool can be issued to find out the exact temperature of the drive at a given time. Similarly, if it needs to be determined in case any unsafe shutdown events happened for the drive, the logging from this command can be used to get information about the unsafe and safe shutdown count. Media errors during I/O are also indicated in the logs.

23

Figure 18: NVMe CLI Smart Log Output [12]

2.3.6 PCIMEM

NVMe SSDs are connected over the PCIe bus, debugging issues often require accessing and check PCIe register status. Setting PCIe and controller register values to inject different states is another use case in validating SSDs. PCIMEM gives the ability to read and write the PCIe registers and is available as a git repository. [13] The NVMe specs define several system bus and controller registers. The controller registers are in the PCIe

Bar 0 and Bar 1 mapped to the in-order access space. The Controller Configuration (CC)

24 register is useful in the validation environment as it gives the ability to exercise control over the drive such as controller enable and disable. Bits 0 of the CC register corresponds to CC.EN. When this bit is set to 1 the processor services commands based on the entries in the Submission Queue. When this bit is cleared the controller is not allowed to process commands or add entries to the Completion Queue. The controller sets the CSTS.RDY bit to 0 in response to the CC.EN register being cleared. A controller reset occurs when the CC.EN bit goes from 1->0, causing deletion of Queues and resetting of Admin

Queues, causing the drive to go into idle mode. Reset does not affect the PCIe level registers such as MMIO and MSI registers. Setting the CC.EN bit when the CSTS.RDY bit is cleared is not a defined action. [14] The above functionality and register actions can be validated by using PCIMEM as it gives access to these registers. The CC.EN register can be set using PCIMEM to 0. Once this is done, various checks on the controller can be made to see if it is following the required protocols as per the NVMe specs. One of these would be to check that the CTST.RDY bit is set to 0 in response to this action by the controller and within the required amount of time. The syntax for PCIMEM commands shown in figure 19.

Figure 19: PCIMEM Command Syntax [13]

25

2.3.7 Link Training Status and State Machine

PCIe bus connects the host to the endpoint devices through three different layers:

Transaction Layer, Data Link Layer and Physical Layer. The physical layer configures the link and initializes the connected endpoint devices to enumerate over the PCIe bus.

During the initialization process the link transitions into different states specified by the

Link Training Status and State Machine (LTSSM) illustrated in figure 20. [15]

Figure 20: LTSSM States [15]

26

During the Detect state the presence of an end point device is discovered. The training ordered sets are transmitted during the Polling state. The Configuration state is when the host and drive send and receive packets at a specific data rate and the SSD drive in configured. The Recovery state allows the ability to change the configured data rate. L0 is the normal state where packets are transmitted and received as per the protocols. L0s,

L1 and L2 are the increasing levels of power saving state. The Loopback state is used in case of fault and the Disabled state is entered when the device enters a state of electrical idle. [15] The different states for the LTSSM can be exercised and tested by issuing events such as Hot Reset, Link Disable/Enable, Hot Plug and Link equalization. The test results are monitored to see if any errors happen during these workloads. PCIe errors can be categorized as Correctable or Uncorrectable errors. Correctable errors can be recovered by the hardware and data is not lost. Uncorrectable errors on the other hand are more fatal as they affect the device functionality and cannot be recovered by the hardware. [15] Different error messages indicating the various severity of errors reported are shown in Table 3.

Table 3: PCIe Error Messages [15]

27

CHAPTER 3

DATA COLLECTION AND ANALYSIS

3.1 JIRA Tools for Data Collection

Jira is a popular software used widely in the industry as the issue tracking and sprint management tool. Developed by Atlassian for agile product management, it is used by around 75,000 customers around the world. Initially, the design intent for Jira for the purpose of bug tracking. [16] However, it has evolved over time to have several applications during the requirements, development and testing phase of products. There are several use cases of Jira which include scrum, task management and bug tracking.

The issue tracking feature has an important application in the validation environment.

Issues arising out of the testing are tracked using this feature. Jira filters provide the ability to list issues related to certain features and product. When a Jira ticket is opened, the software allows several customizable data entry options such as product, firmware version, created date, updated date and test. JIRA dashboards provide options to list issues in a personalized way using the filters based on the data entries. Further, the filtered issues can be analyzed by adding pie charts, column views and line graphs. The

GUI makes analyzing issues and large data sets efficiently. [16] The Jira tool helped in analyzing issues related to noise in the validation environment. Many issues opened from start of the year to mid-year could be filtered out and analyzed. The created date, organization and noise fields of Jira could be used to list all such issues on Jira dashboard. The issues could further be broken into a pie chart based on the hardware or

28 software component that led to the issues. Using this it could be found which components were resulting in what percentage of the total issues filed.

3.2 Noise Categorization

Using the Jira charting tools, an estimate could be made on what were the main components resulting in the noise. These categories were then further bucketized into larger categories to group issues with a smaller contribution to the pie chart percentages.

Categorizing the issues based on the different system components of the noise helped in narrow down the focus for the project. Accordingly, different strategies could be built around how to manage the various categories of noise. The priorities were set for the categories that were contributing to the highest percentage of noise. Jira also provides the ability to download the issues from the GUI to a spreadsheet. [16] This was another method used to extract and categorize the issues. The issues from the Jira dashboard could be downloaded into a spreadsheet. These could then be clubbed together into the larger buckets. Such as FIO, Medusa, NVMe CLI, Medusa Lab Test Suites, PCIMEM and LTSSM could be grouped as software tool issues. Issues due to the Python test scripts could be listed as the Test issues. PCIe switches and Torridon Systems categorized as hardware issues. Network and server bucketized as infrastructure issues. Adding a column in the spreadsheet according to the new larger buckets could enable filtering issues based on these categories. PVT charts in spreadsheet further allowed creating pie charts based on these new buckets. Pie charts generated using spreadsheet PVT tools indicated which of the categories was leading to what percentage of the issues.

29

CHAPTER 4

FACTORS CONTRIBUTING TO NOISE

4.1 Key Hot Plug and Hot Swap Issues

Many factors are involved in keeping the validation system and services running.

Torridon Systems execute tests related to running hot plug and hot swap tests on the drives. The Quarch is connected to the drive before it is plugged into the server to perform drive unsafe shutdown test scenarios. This way hot plug cycles can be carried out without the need of manually removing the drive. [1] However, any glitches in the

Quarch set up or functioning can result in unexpected power down events during testing.

Such events are undesired in the validation environment and are a source of noise. The ability to do Hot swap forms an important testing aspect in SSDs. The drive is replaced with another drive or switched with a drive in another slot during Hot swap. Hot plug on the other hand refers to just the insertion of the SSD into the system. The variability introduced by human and mechanical factors makes hot swap a complex process. In the last few years, SSDs have become faster and more dependent on the host. The tight coupling with the host has resulted in increasing the failure rates. [17] One of the key issues is the pin connection sequence during hot plug. The pins in the connectors can have differences in length, resulting in not all pins mating at the same time. This can cause unexpected behavior during operation. During a hot-plug operation the pins in a connector system do not all mate together in the same time. A planned sequence for the pins to mate is a possible solution to reduce this effect. The connectors in this case have

30 multiple pin lengths such that more critical pins can be mated at the required specified time. For example, Ground pins are usually connected first to create a system Ground as early as possible. [17] The issue of Pin Bounce occurs when pins do not mate cleanly and bounce in the connection repeatedly before finally connecting successfully. Since, pins bounce at an indeterminate rate, the connection each time is unique. This is the reason why Hot plug may not always fail but have a failure rate such as 1 out of 100 cycles.

There are some pins, example: reset and mode select which remain idle until the drive detection is complete. Such pins are less likely to suffer from pin bounce failures as they get sufficient time to successfully connect. Pin bounce can lead to issues such causing

SM bus transactions to be corrupt. [17] The host status at the time when the hot swap takes place is another key parameter to consider. When the system is in the idle state, there are less variables and the hot swap operation is straight forward. However, if the system is busy with an intensive workload then it involves handling the different transactions, pending workloads, OS tasks and Bus transactions. An issue with the hot swap may leave the memory in a corrupted or invalid state, even if this is not visible at the specific point in time. Figure 21 shows the scope capture with Ground, pre-Charge and Power signals. Due to pins mating at different points in time, the signals change state at various time intervals. Figure 22 shows the zoomed in version of the pull-out events, where pin bounce causes spikes in the signal levels. Figure 23 shows the scope capture during a hot plug event and the effect of pin bounce during the drive insertion operation.

31

Figure 21: Hot Plug Scope Capture (Gold- Ground, Pink- PreCharge and Blue- Power)[17]

Figure 22: Effect of Pin Bounce During a Pull Event [17]

32

Figure 23: Effect of Pin Bounce During Hot Plug [17]

The type of host system also affects the Hot plug operation. Systems that have more isolation between drive and host provide a more stable environment. This makes some systems simpler than other when it comes to Hot plug. On SATA or SAS SSDs failures are represented by an error at the controller end and managed by the OS. This is because the controller separates the drive and the host. [17] On the other hand, NVMe SSDs may be connected with no additional controller logic, and hence coupling the drive to the host more closely. This way a better performance can be achieved but also comes at the cost of increasing the dependency on host behavior. The close coupling with the host can even cause a failure in the drive to cause a system crash. Based on the aforementioned key issues with Hot plug and Hot swap, the following are some important considerations to take into account during testing.

33

• Scenarios need to be created such that are precise and repeatable irrespective of

pin mating irregularities

• Create pin bounce cased that can be repeated in case of a failure so that these can

be debugged, and root caused

• Host timing needs to be simulated and characterized

• Analysis of the host type and listing the failure points

Automated testing gives control over the host swap process but not over the drive and host interaction. An interposer in the bus which does not interfere or retime is required so that the drive and host communication is not altered in the operation. Figure 24 shows an interposer layout with individual power and sideband connectors. Some of these signals require isolation and driving to simulate backplane timings. Conducting an interface research enables in characterizing the pin bounce extent involved in a specific scenario.

This can be done using a Protocol Analyzer or doing a Scope capture on the signals. In the case of Pin bounce, there is a large set of possible scenarios, therefore characterizing for limited points and extreme conditions, the intermediate levels can be assumed for desired operation as well. [17]

Figure 24: Interposer Layout [17]

34

4.2 Linux Kernel Crash Events

Drive failures can result in system crash events. However, kernel crash can have other underlaying causes such as faulty driver, host characteristic, OS install issues and kernel bugs. Separating the drive failures from other causes of kernel crash is critical. When the

OS detects a fatal error that occurred internally in the system from which it cannot recover safely, it goes into a kernel panic state. The kernel panic state in Linux OS causes the system to not function, preventing any loss of data that can happen if the system continues to run. [18] Kernel panic can occur due to hardware failure or software bugs. In most cases, the OS can handle the error and continue to run. However, if the system is unstable leading to the possible security breach, the OS goes into a freeze state as an action to prevent any damage. One common cause of kernel panic is when the kernel is incorrectly configured or installed, resulting in a kernel panic state during boot up after the kernel binary image is recompiled. Devices or RAM installed on the system which are faulty can also result in kernel panic issues. A panic state can be reached if the drivers are incompatible with the OS or if a root file cannot be found on the system. Failure to spawn or the init process terminating can trigger a kernel panic during the final stage of initialization sequence. [18] Due to the various factors resulting in a kernel panic, several debug tools are used to find the root cause of the crash. In linux systems, logs such as dmesg and var logs store messages about the kernel state and can be used to find more information on the events that may have caused the kernel crash. However, logging in these may not be sufficient to find the root cause for the crash and complete system memory dump maybe required to analyze. is utility that dumps the system

35 memory and saves it before a crash to analyze once the system is recovered from crash. It uses kexec, that allows the Linux kernel from another kernel’s context bypassing the BIOS. This way the first kernel’s memory can be saved. When a crash occurs, kexec is used by kdump to boot to the second kernel that is loaded into a reserved system memory area and cannot be accessed by the first kernel. [8] This kernel saved the system memory of the first kernel in the event of a crash and saves it in the form of a crash dump. This is the only information available when the crash happens, making it a critical source of information to analyze failures. It is recommended to update the kexec tools during the kernel update cycles. For crash dump to be saved, it is important for kdump to reserve a part of the system memory inaccessible to the main kernel to capture the crash events. The amount of system memory required changes based on the host type. The command uname -m can be used to find out the machine architecture name. On most systems kdump can automatically calculate the amount of memory that will be required and reserve it for storing the crash dump. In the cases where the system has less available memory than that typically used by the crash dump engine, the memory can be set aside manually. Kdump is installed by default with the OS install for RHEL 7 in many cases. A screen for kdump configuration is provided by the Anaconda installer that provides an interactive interface with the options for kdump. [8] This provides the options to enable kdump and configure the amount of system memory that will be reserved. The kickstart installation does not give the option to install. To add kdump in these cases the following command needs to be executed: yum install kexec-tools. To check if kdump is installed, the following command is used: rpm -q kexec-tools. RHEL 7.4 and onwards supports the

36

Intel IOMMU driver. Versions before this are recommended to disable Intel IOMMU when enabling kdump. [8] The system memory reserved for kdump is configured during boot and is specified with the system boot loader. The crashkernel option can be used to specify the value to be assigned. When the value auto is assigned to this, the configuration of the system memory is done automatically by kdump. The crash dump can be stored as a file local to the system or sent over the network. To send the file via the network, Network File System (NFS) or Secure Shell (SSH) can be used. By default, it is stored in the crash directory under var. To analyze the crash, a utility called crash utility can be installed. This gives the ability to analyze the crash dump created by the kdump mechanism. The “yum install crash” command can be used to install this utility via the shell. The debug info package also needs to be installed along with this utility to provide the appropriate packages for the server. [8] After installation, the crash utility can be run with the command shown in figure 25. Various commands can be used with the crash utility for analysis. The log command in figure 26 is used to display the kernel message buffer. The most critical system crash information is stored in the kernel message buffer. This is the first information that is dumped in case of a crash into the dmesg file. It is specifically of use when the vmcore file cannot be accessed due to an issue such as the lack of memory available at target. The default location for the vmcore is the crash directory. The bt command can be used to display the backtrace for the crash events. The back trace (bt) command can be used to find the address that led to the crash event. The sym command along with this address loads the name of the module at that address. The ps command is used to display the status of the processes in the system. The

37 vm command is used to display the virtual memory processes of the system. Files command is used to display the files that are opened by a specific process. To quit the crash utility, exit command can be used. [8]

Figure 25: Linux Crash Utility [8]

38

Figure 26: Log Command to Display Message Buffer [8]

Linux provides options for installing different kernel releases available from the Linux kernel archives to be downloaded and installed. The Mainline kernel released every couple of months is where new features are added and is maintained by Linux creator,

Linus Torvalds. Stable kernel versions are the release versions of the Mainline kernel with bug fixes from the mainline release tree. Long Term kernel versions backport fixes from kernel trees that are older and have critical bug fixes. These versions are based on

Stable releases over a longer period of time. [8] Bug fixes are frequently reported and resolved with new releases. The details on specific fixes are available in the kernel archives. Using the Long-Term kernel version can help in avoiding noise as these have bug fixes that have been tested over a longer period of time as compared to the other kernel version types.

39

4.3 Windows

The equivalent term for kernel panic in Windows is Stop error or Blue Screen of Death

(BSOD) where the bug check code with information on what possibly caused the system crash is displayed on a blue background. Bug check or system crash or stop error all refer to the condition that indicates that safe operation is compromised, and the OS needs to stop the running state. Continuing operation in such conditions can be deemed to be unsafe and compromise the system or data integrity. In case of such an event a crash dump file is saved when crash dump capture is enabled in the system settings.

Alternatively, when a live debugger is attached to the system, the system view switches to the debug mode for further investigation. A blue screen with a bug check code appears when the live debugger is not attached. The bug check or stop code indicates the source of the error such as Page Fault. [19]

Figure 27: Windows BSOD with Bug Check Code [19]

40

Figure 27 shows the Blue Screen with the Bug Check code in one such scenario.

Sometimes the name of the module may also be available and is displayed. A corresponding hex number is associated with each code, which can be matched with the bug check code reference as shown in figure 28. This contains four parameters that describe the stop code and provides useful information on the events leading to the crash.

Clicking on the specific bug check shown in figure 28 on the Bug Check online page leads to the details of that specific stop code with description of each parameter as shown in figure 28. Figure 29 shows the details of each parameter for the APC Index Mismatch bug check code as an example. Further, the link also contains insights on possible causes for the bug and debugging techniques. This information can be critical to analyze a crash, hence there are several ways provided to obtain the code:

• The Event viewer logs are available in windows systems with event properties

that list the stop code parameters.

• The Crash dump file can be opened using WinDbg and the !analyze command can

be used to obtain the 4 parameters.

• A live kernel debugger can be attached, which will receive the stop code

parameters when a crash happens. [19]

When a live kernel debugger is attached to the system, the system with the crash moves into the debug mode. The Blue Screen does not appear in such a case and the crash details are sent to the debugger interface. Figure 30 shows the debug mode and the

!analyze command executed with the debugger. The Driver Power State Failure bug check code in this case is associated with the hex code 9F. This code can be referred in

41 the bug check link to find information on its parameters, possible cause and recommendations for further debug of the issue. The kernel debug option is useful to debug a repetitive issue or when the crash dump does not generate enough data to find the root cause for the crash. The events causing the issue can be recorded and a workaround can be created once the information on the crash is found from the analysis. Breakpoints are a useful technique that can be utilized to step into code and debug code sections. [19]

Figure 28: Bug Check code reference [19]

42

Figure 29: Bug Check Code Parameter Details [19]

Figure 30: Bug Analysis in Debug Mode [19]

On an average it is observed that majority of the BSOD issues are caused by issues related to drivers that are faulty. To debug driver related bugs, Driver Verifier is a useful utility to analyze driver faults. Using Driver Verifier memory resources like memory pools can be examined in real time. When an error is found in the code an exception for that part is created. This code can then be examined in detail by debugging line by line.

43

The driver verifier utility is inbuilt on Windows systems and can be started via the command prompt. Typing Verifier opens the utility, which can be further configured to specify which drivers need to be examined. Specifying a limited number of suspected drivers allows reducing the overhead of loading a large amount of code. Driver Verifier works by stressing the drivers such as kernel and graphic drivers to test them and expose any illegal behavior or memory corruptions issues. The ability to configure what tests and workloads to subject the driver under test helps in targeting specific scenarios. Multiple drivers or individual drivers can be tested at the same time by selecting the names of the drivers from a list. Driver Verifier is used with WinDbg to debug driver issues live. [19]

In order to reduce crash issues due to factors not related to the drive, following are some measures that can be taken:

• Update the chipset driver to the latest version on the system and ensure no yellow

exclamation marks are seen in the device manager for the driver.

• Track changes and upgrades made to drivers and system services. If a new bug is

found, the deltas in the testing can then be determined and the service recently

added can be investigated for bugs.

• Based on if the any exclamation marks are seen in the device manager, the event

logs can be tracked to check the related driver properties and determined if there

is any faulty driver. [19]

• Monitor the Event logs for any errors by checking the critical errors section.

During a BSOD event, check the timestamp to determine which critical error

coincides with the time of the BSOD.

44

• Run hardware diagnosis reports for the different hardware installed on the system

• Use the Memory diagnostics tool provided by Windows to check for memory

violations. This tool can be accessed from the control panel on the system. The

Event logs track the results from the diagnosis in a separate result called Memory

Diagnostics results. [19]

• The OS and System application require free memory for swap files. Ensure that

there is sufficient space for these operations. This can be confirmed by checking

the available drive space is typically 10% to 15% available for such OS level

operations.

• Use the Safe Mode option when removing hardware. This enables loading only

the installed services on boot up. Safe mode can be accessed from the security

update settings. The maintenance mode can be booted in from recovery – startup

option and safe mode can be set during the system boot.

• Check compatibility of hardware installed with Windows Specifications

• Confirm file system errors are not present by running the scan .

• Repair corrupted files using the File Checker Tool for repairing system files. The

SFC scan command shown in figure 31 can be used from the command prompt in

Windows to run a diagnostic on the files. [19]

• Enable the crash dump option on the system and install debug tools specially to

investigate issues that may not re-occur

45

Figure 31: SFC command to check for File System Errors [19]

4.4 Unexpected Shutdown due to Network Instability

Shutdown cycles form an important part of SSD testing. There are several different types of shutdowns defined by the NVMe specs. These include safe shutdown, unsafe shutdown and hot plug. Various drive behaviors can be validated by scheduling the different shutdown events. For example, the SMART counter for unsafe shutdown should increment when the system suddenly loses power and the count should not increment during a safe shutdown event or during normal testing. [14] An unexpected loss of system power can result in undesired behavior such as increment of drive counters and downtime while testing. Functional events are also affected by these events. As example, if there is workload issued to the drive, the unexpected shutdown can cause unexpected aborts. Such events can be difficult to track with the many components involved in the testing. Several debug techniques and logging can enable debugging such events. The main factors resulting in unexpected shutdown events include Network glitches, Power module failures and Quarch issues. The power module can be used to remotely schedule a shutdown event and Quarch can be used for remote hot plug. Network connects the test, power module and Quarch to the system. Network failures can cause the traffic on the network such as scheduling of power cycle events at incorrect points in time. Data center failures can be bucketized into Link failures or Device failures.

46

• Link Failures: The data center infrastructure is connected by a network of links

and switches. Buses such as PCIe connect the various devices. When the link

experiences a fault, the failure is categorized as link failure.

• Device Failures: This is a result of the device not functioning while routing the

traffic in the required way. There can be different reasons for this such as

hardware faults and downtime.

Analysis of the two different types of failures indicated that Link issues are variable in occurrence and dependent on the protocol. [20] Further, looking into the Link failures indicated that many link failures also originate due to underlying device failures. One significant cause for Device failures is maintenance update. It was observed that Top of the Rack had the lowest number of failures in the day, suggesting that low cost commodity switches were not a cause for the noise. Load balancer links on the other hand were found to have a high number of failures, causing traffic associated with load balancers to often fail. The links at the higher levels of the network topology were the next likely failure points as compared to lower links, which had an average failure rate of

5%. The Time to Repair plays an important role when gauging the extent of the failures.

[20] The Time to Repair is the downtime from when the failure occurred to when it was resolved, indicating the impact of the failure on the service. It was calculated that load balancer issues can be resolved on an average within 10 minutes. Link failures related with the buses connected to servers on different racks took between 20 to 30 minutes to resolve. Since, hardware errors can require replacement of the device, these can take longer to repair when compared with software failures. Software issues can have a

47 downtime associated with upgrades, bug fixes and patching. Redundancy methods can be employed to lower the total number of network related failure. Redundant groups of network devices and interconnects can enable masking the failures of the network. [20].

The ratio between the average traffic during the failure to the traffic before the failure was computed on a link by link basis. To calculate the effectiveness of redundancy, the comparison can be made for different links in the network. If the failure is completely masked, the ratio is expected to be 1as this would indicate that the traffic during the failure is same as that when link was normally functioning prior to the failure.[20]

Figure 32 shows a redundancy group example, where Aggregation Primary Switch

(AggS-P) is connected to the Access Router Primary (AccR-P) as well as to the Backup

Aggregate Switch (Aggs-B) and the Backup Access Router (AccR-B) for implementing redundancy. The backup routing functions in case of failures to provide a separate route for traffic. However, implementing redundancy may not always be feasible for all devices due to limitation on cost and space. It may be useful to establish redundancy for core devices, involved in critical functionalities to reduce the impact of failures.

Figure 32: Redundancy Network Group [20]

48

CHAPTER 5

CONCLUSION

SSD validation involves using several hardware and software components. Hardware components such as server platforms and network infrastructure provide the environment for validation. Torridon Systems implement the hardware required to perform hot plug and hot swap testing. [1] The OS and test tools are used to issue workloads such a as sequential and random I/O. While the tools and the platform allow testing SSDs efficiently before release, the bugs related to the environment can be a significant source of noise in validation and can mask actual SSD related issues. JIRA dashboards are useful in filtering issues from a specific period of time. Using JIRA tools and spreadsheet

Pivot charts, it could be analyzed what are the main sources of noise. Accordingly, the factors frequently leading to noise could be targeted to be debugged with a higher priority. Pin bounce issues during hot plug testing can result in SM bus transactions being corrupt. Further, the dependence of hot swap on the host and Quarch related bugs cause hot plug and hot swap to add significant amount of noise. To improve the testing efficiency, some of the solutions for hot plug testing include creating scenarios for precise and repeatable pin mating irregularities and characterization of host timing using the protocol analyzer. [17] In order to prevent the drive and host communication from being altered an interposer in the bus can be used such that does not retime the transactions. Linux kernel crash can have several underlying issues such as faulty driver, host characteristic, OS install issues and kernel bugs. When a fatal error occurs on a linux

49 system, the OS goes into a kernel panic state, where certain functionalities are not allowed to prevent any security breaches. [18] The equivalent of this in Windows system is called the Blue Screen of Death. Many tools are available for debugging the underlying cause for kernel panic and BSOD. Kdump is a RHEL utility that dumps the system memory and saves it before a crash to analyze once the system has recovered from a crash. [8] The Windbg utility of Windows can be used to find the Bug check code parameters leading to the crash. Dmesg logs in Linux and Event logs in windows also provide more details on the devices and events leading to the crash state of the OS. Using the Long-Term kernel version can help in avoiding noise as these have bug fixes that have been tested over a longer period of time as compared to the other kernel version types. On an average it is observed that majority of the BSOD issues are caused by issues related to drivers that are faulty. Some of the measures that can be taken to avoid crash issues include updating the chipset driver, monitoring any faulty drivers on the device manager, using the memory diagnostics tool provided by Windows to check for memory violations, and running SFC scan command from the command prompt in Windows to run a diagnostic on the files. [19] . Network failures can cause the traffic on the network such as scheduling of power cycle events at incorrect points in time. Data center failures can be bucketized into Link failures or Device failures. The Time to Repair plays an important role when gauging the extent of the failures. Redundancy methods can be employed to lower the total number of network related failure. [20] It may be useful to implement redundancy for core devices due to the limitation on cost and space.

50

REFERENCES

[1] “Quarch Technology Ltd Torridon System User Manual”, Quarch Technology.

[Online] Available: https://quarch.com/downloads/manual/. [Accessed Oct 1, 2020]

[2] “Technical Specifications for the Intel® Server Board and Intel® Server System

Based on Intel® Server Board S2600WF Family”, Intel. [Online] Available:

https://www.intel.com/content/www/us/en/support/articles/000023750/server-

products/server-boards.html. [Accessed Oct 3, 2020]

[3] “PCI Express® Solutions Field-Proven, Interoperable and Standards-Compliant

Portfolio”, Microchip. [Online]Available: http://ww1.microchip.com/downloads

/en/DeviceDoc/00003074A.pdf. [Accessed Oct 6, 2020]

[4] “ExpressFabric PCIe Gen4.0 and Gen 3.0 Switch and Retimer Solutions”, Broadcom.

[Online] Available: https://docs.broadcom.com/doc/BC-0484EN. [Accessed Oct 15,

2020]

[5] M. F. Bari et al., "Data Center Network Virtualization: A Survey," IEEE

Communications Surveys & Tutorials, vol. 15, 2013.

[6] J. Saetent, N. Vejkanchana and S. Chittayasothorn, "A thin client application

development using OCL and conceptual schema," 2011 International Conference for

Internet Technology and Secured Transactions, Abu Dhabi, 2011.

[7] “Red Hat Enterprise Linux”, Wikipedia. [Online] Available: https://en.wikipedia

.org/wiki/Red_Hat_Enterprise_Linux. [Accessed Oct 3, 2020]

51

[8] “Deployment, configuration and administration of Red Hat Enterprise Linux 5”, Red

Hat. [Online]. Available: https://access.redhat.com/documentation/en-

us/red_hat_enterprise_linux/5/html/deployment_guide/index [Accessed Oct 9, 2020]

[9] “Pynvme Docs”, Pynvme [Online]. Available: https://pynvme.readthedocs.io/

features.html. [Accessed Oct 14, 2020]

[10] “fio - Flexible I/O tester rev. 3.23”, FIO [Online]. Available: https://fio.readthedocs

.io/en/latest/fio_doc.html. [Accessed Oct 21, 2020]

[11] “Medusa Labs Test Tools Suite Version 7.4”, Viavi Solutions [Online]. Available:

https://www.viavisolutions.com/en-us/literature/medusa-labs-test-tools-suite-users-

guide-version-74-manual-user-guide-en.pdf. [Accessed Oct 15, 2020]

[12] “Open Source NVMe™ Management Utility – NVMe Command Line Interface

(NVMe-CLI)”, NVM Express [Online]. Available: https://nvmexpress.org/open-

source-nvme-management-utility-nvme-command-line-interface-nvme-cli.

[Accessed Sep 30, 2020]

[13] Bill Farrow, “PCIMEM” [Online]. Available: https://github.com/billfarrow/pcimem.

[Accessed Oct 6, 2020]

[14] “NVM Express Base Specification NVM Express Revision 1.4”, NVM Express

June 10, 2019 [Online]. Available: https://nvmexpress.org/wp-content/uploads/

NVM-Express-1_4-2019.06.10-Ratified.pdf.

[15] “PCI Express® Base Specification Revision 5.0 Version 1.0”, PCI-SIG

22 May 2019. [Online]. Available: https://pcisig.com/specifications

52

[16] “Jira use cases”, Atlassian. [Online]. Available: https://www.atlassian.com/software/

jira/guides/use-cases [Accessed Oct 3, 2020]

[17] “Testing hot-swap on storage devices”, Quarch Technology. [Online] Available:

https://quarch.com/downloads/manual/. [Accessed Oct 19, 2020]

[18] “Kernel panic”, Wikipedia [Online] Available: https://en.wikipedia.org/wiki/

Kernel_panic#Causes. [Accessed Oct 12, 2020]

[19] “Debugging Tools for Windows”, Microsoft [Online] Available:

https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger. [Accessed

Oct 21, 2020]

[20] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. “Understanding network

failures in data centers: measurement, analysis, and implications”. SIGCOMM

Comput. Commun. Rev. 41, 2011.

[21] Stephen R. Smoot, Nam K. Tan, “Private Cloud Computing”, MK Publishers, 2012.