Investigating the Potential of Custom Instruction Set Extensions for SHA-3 Candidates on a 16-Bit Microcontroller Architecture
Total Page:16
File Type:pdf, Size:1020Kb
Investigating the Potential of Custom Instruction Set Extensions for SHA-3 Candidates on a 16-bit Microcontroller Architecture Jeremy Constantin1, Andreas Burg1, and Frank K. Gürkaynak2 1 Telecommunications Circuits Laboratory, EPFL, Switzerland {jeremy.constantin,andreas.burg}@epfl.ch 2 Microelectronics Designs Center, ETH Zurich, Switzerland [email protected] Abstract. In this paper, we investigate the benefit of instruction set extensions for software implementations of all five SHA-3 candidates. To this end, we start from optimized assembly code for a common 16-bit microcontroller instruction set architecture. By themselves, these implementations provide reference for complexity of the algo- rithms on 16-bit architectures, commonly used in embedded systems. For each algorithm, we then propose suitable instruction set extensions and implement the modified processor core. We assess the gains in throughput, mem- ory consumption, and the area overhead. Our results show that with less than 10% additional area, it is possible to increase the execution speed on average by almost 40%, while reducing memory requirements on average by more than 40%. In particular, the Grøstl algorithm, which was one of the slowest algorithms in previous reference implementations, ends up being the fastest implementation by some margin, once minor (but dedicated) instruction set extensions are taken into account. Key words: SHA-3 Final Round Candidates, Software Implementation, Assembler, 16-bit Microcontroller, In- struction Set Extensions, ISA Exploration 1 Introduction In 2007, the U.S. National Institute of Standards and Technology (NIST) started a public competition aiming at the selection of a new standard for cryptographic hashing [13]. The cryptographic community was asked to propose new hash functions and to evaluate the security level of other candidates. In 2008, 51 functions were accepted to the first round, the second round (2009) reduced this number to 14, and for the final, third round (2010) the field has been further reduced to five candidates: BLAKE, Grøstl, JH, Keccak, and Skein. Following the third SHA-3 candidate conference, a winner algorithm is expected to be announced in 2012. Potential applications of SHA-3 standard range from multi-gigabit data transmission protocols with high per- formance requirements to radio-frequency identification (RFID) tags, which typically have to operate with severely constrained resources. As a result, the organizers are not only interested in the cryptographic strength of the candidates, but also in the evaluation of the performance of the algorithm implemented on different platforms. Following the success of the public AES selection process, the SHA-3 evaluation has attracted many contributions comparing the performance of candidate algorithms on different platforms. Extensive results have been published for software [2] [14], and hardware [8] [5] [9] implementations of SHA-3 candidates. In recent years the performance of microprocessors has continued its exponential growth. Modern microprocessors have become increasingly complex structures, consisting of several levels of on-chip cache memory and in most cases multiple parallel cores and execution units. Some recent CPUs even include instructions to execute AES round opera- tions on 128-bit wide registers. At the same time, small and simple microcontroller units (MCU) have become popular for embedded applications. MCUs are regularly integrated together with other components to form Systems-on-Chips (SoC) that combine the flexibility of programmable components with the cost benefits of increased integration densi- ties and the performance due to modern manufacturing technology nodes below 100nm feature size. Despite the use of advanced process technologies, MCU implementations in embedded systems are still severely constrained by resources such as total code size and required data memory. Furthermore, achieving a given throughput with the least possible complexity (number of instructions) is critical due to constraints on execution speed and energy consumption. A study by Wenzel-Benner and Gräf has compared the performance of SHA-3 candidates on several MCU platforms [18]. Thomas Pornin [14] has also published a library (sphlib) containing implementations of SHA-3 candidates geared towards MCUs with constrained resources. Unfortunately, many of the available resources provide only little reference on the performance of carefully hand-crafted assembly implementations on 16-bit MCU, which are most frequently used in many embedded applications. An additional important point about embedded MCUs is that, in many applications, the MCU would be integrated in a custom SoC, where it can be customized with instruction set extensions to the application. The basic idea behind such extensions is to enhance the datapath of an MCU so that an application can be executed with fewer number of instructions increasing operation speed, reducing energy consumption, and saving resources (RAM and ROM). Such new additions could add some overhead (increased circuit size, or reduced maximum operating speed due to additions to the datapath), but in many cases could still offer significant advantages. Most IP vendors already offer a wide variety of options for their embedded MCUs. Once an algorithm has been selected as a standard, processors and MCUs with specifically tailored instruction set extensions are expected to appear as IP blocks or stand-alone components, similar to the case of AES in modern CPUs. Furthermore, many companies offer solutions [16] that facilitate design of application specific processors and the customization of commonly used MCUs. Starting from a high-level description, such solutions allow changes to be made to a processor architecture, and not only produce the corresponding hardware description for integration into a SoC, but also to generate the tool chain (compiler, assembler, linker) that will support this enhanced processor, allowing it to be used with relative ease. Contributions In this paper, we compare the five remaining SHA-3 candidates with respect to their suitability for software implementation on present and future MCUs and embedded SoCs. To this end, we start from a small, general purpose 16-bit MCU, based on the PIC24 series of microcontrollers, and evaluate the performance of all five SHA-3 candidates. To match the requirements of resource constrained embedded systems, we consider only hand-crafted as- sembly implementations of the corresponding kernels, which are expected to become available as library components for the different MCUs once a winner has been selected. Then, for each candidate we examine this reference imple- mentation and identify promising instruction set extensions. We report the performance gain, the memory reduction, and the overhead associated with these extensions. In this way, we aim to uncover if some of the SHA-3 candidates could achieve a better performance when instruction set extensions are used. Outline The paper is organized as follows: In section 2 we describe the reference MCU used throughout this evalua- tion. Section 3 defines the performance metrics used in this study. The design and evaluation flow used to determine the instruction set extension for each candidate algorithm, how they were added to the base MCU is explained in section 4. The following section 5 contains details of the implementation for each candidate algorithm. The combined results ob- tained from this study are presented and compared with results from other publications in section 6 and the sources of errors in this work are briefly described in section 7. Finally section 8 provides concluding remarks. Appendix A contains the tables with the detailed implementation results of the five individual candidates. 2 The Embedded Microcontroller, PIC24 In this work we use our custom implementation of the Microchip PIC24 16-bit architecture [12], as reference imple- mentation. This microcontroller was selected because of its common features for 16-bit MCU architectures, and due to the fact that a functionally verified description was available through an earlier unrelated project. 2.1 Instruction Set Architecture Summary The chosen reference MCU is a 16-bit Harvard architecture with a total of 87 base instructions encoded in a 24-bit instruction word [11]. The majority of the instructions allow for a variety of different addressing schemes and operand modes, resulting in 190 different effective instructions. All operands and destinations can be either addressed in byte or word mode, for most instructions. Word mode represents native 16 bit data addressing, while byte mode only operates on the least significant byte of the corresponding data word. This feature is not necessarily supported by all 16-bit MCU architectures, and can be an important factor for the overall performance of an algorithm implementation (cf. Grøstl). The ALU uses a 16-entry general purpose register array named W0-W15, with a stack pointer assigned to register W15. In addition there are several status and control registers, and a 16-bit repeat loop counter used in conjunction with the REPEAT instruction. This command repeats the following instruction as often as specified, allowing for very simple hardware loops, thus reducing the program size. 2.2 Micro Architecture Summary The micro architecture of our PIC24 implementation is inspired by and hence very similar to that of the original PIC24. It comprises three pipeline stages and executes almost all instructions in a single cycle.