Exploring Voltage Scaling Techniques in Embedded Processors Hardware Monitors

Arman Pouraghily, Padmaja Duggisetty, Thiago Teixeira VLSI Design Principles Final Report - ECE 658 Fall 2014 Department of Electrical and Computer Engineering University of Massachusetts, Amherst, MA, USA Email: {apouraghily,pduggisetty,tteixeira}@umass.edu

Abstract—The Internet is a very important communication is taken into account. Our goal to estimate power consumption infrastructure in modern life, with applications varying from overhead – imposed by the introduction of hardware monitor banking transactions, transfer of copyrighted material, compa- – and by applying power saving techniques, minimizing this nies’ assets, to the most simple web surfing. The role of Internet is expected to grow exponentially with cloud computing and overhead is achieved by using voltage scaling techniques. Internet of Things. In this context, the router will continue to The basic idea behind our approach is the simple fact that be the equipment carrying most of the Internet traffic, with the logical complexity of the modules used in the hardware an increasing number of routers’ packet forwarding application monitor is by far less that the ones used in the main CPU being deployed in the form of programmable network processor. Many security algorithms have been developed for the network which means that our hardware monitor could run with much security as a whole in order to increase reliability of com- higher . However, running the hardware monitor at a munications. The network infrastructure security has received higher frequency than the Network Processor leads to resource less attention, even though attacks in the data plane that can wastage. This is because the hardware monitor should track trigger changes in the network processor have been successfully and monitor the behaviour of the main processor; so running demonstrated. Hardware monitors are the proposed solution for securing the network infrastructure, by comparing the network it at a higher frequency means that it would overtake the main processor correct instructions with running instructions, resetting processor which defeats the whole purpose of monitoring. and recovering the processor if the instruction has any deviation This simple fact gives rise to an added advantage; being that do not match the monitoring graph. Hardware monitors able to run at higher frequency means having much more have to execute instructions at a very high speed. However, our slack time for a predefined time budget (which is imposed challenge in this work is to look at techniques to reduce the voltage level, which turns the hardware monitor slower, while by the critical path delay of the main processor). The main maintaining its functionality. Hence, consuming less power. objective of our work is to make a good use of this slack Index Terms—hardware monitor, voltage scaling. time. We also know that the delay of a digital circuit is inversely affected by supply voltage value which means that I.INTRODUCTION decreasing the supply voltage will increase the delay of the With the ubiquitous presence of Internet in today’s soci- circuit. But the power consumption of a circuit increases ety, ensuring a trustworthy communication is key. Financial quadratically with the supply voltage, meaning that a small transactions, private data of companies flow from one office change in supply voltage has a significant impact of power to another, and private user data are a few examples of where consumption of the circuit. This work deals predominantly the Internet requires to work correctly. with the library characterization of digital CMOS logic gates An important part of the Internet infrastructure is the with respect to different voltage levels in order to predict delay router. An increasing number of routers are shipped with and power consumption. Hence, power consumption of the programmable packet processing, used by vendors to extend hardware monitor is reduced since it now operates at a lower system functionality. The packet processing applications are voltage level than the nominal voltage. With the results from implemented in the form of network processor (NP). our work we see that voltage scaling has a huge impact in Furthermore, [1] have showed that the network processor power savings and the added circuit still satisfies the timing can be exploited using an integer overflow attack. Hardware budget. monitors are the standard solution for preventing attacks on network processors. [2] have showed a solution that uses a single memory read per instruction to compare the malicious II.LITERATURE REVIEW code with the code being executed. If the monitor detects a malicious pattern (an invalid state for instance), the packet is dropped. This work comprises of two main topics: the hardware mon- The hardware monitor is an extra hardware that has to be itor operation and the voltage scaling. The first one concerns accommodated on the chip. We desire it to be small and about the device security, while the second one concerns about consume less power. Hence, in this work, power consumption reducing the power consumption as much as possible. A. Hardware Monitor this the solution is to represent the DFA states with varying numbers of outgoing edges by encoding all the necessary Modern high-performance routers no longer use application- information in a single table entry and to group states by the specific integrated circuits (ASICs), instead, they use pro- number of outgoing edges and by the same previous state. The grammable network processors [3]. Network processors are memory contains tuples and is logically divided into groups. multi-core high-performance embedded systems that imple- The base addresses for each group are stored in register file ment packet forwarding and other network functions pro- with 16 entries. This makes accessing the memory for hash grammed with software. While programmable network proces- code of the next instruction faster. An increase in speed was sors offer router vendors and network providers the flexibility observed with such an implementation of memory usage over to remotely reprogrammed the equipment, it also exposes po- previous approaches. tential risks to the Internet infrastructure. Defence mechanisms Code injection attacks are feasible on a Harvard architecture against data plane attacks on network processors have been processor using a return-oriented programming technique. In proposed. Specifically, hardware monitors that operate in par- such attacks the attacker takes control of return instructions in allel with network processors, monitoring the processor core the stack to chain attack code from an existing function.Since and comparing the with monitoring graph. If the behaviour the code is already in the executable memory, the attack can deviates from the monitoring graph, the processor core is reset not be prevented. One such attack which is possible by integer (e.g. drop the packet) and recovered. overflow vulnerability is presented in the paper. An attacker An effective network processor monitoring system needs sends a UDP packet with a maximum size i.e., 65534. But this to verify every instruction that is executed by the processor. passes the maximum packet size since 65534 + 12 = 10 due Due to this reason the monitors need to run at very high to integer overflow. The packet payload is made sure that the speeds to match up with the processor speed. This instruction- return address is overwritten and all the ports are over flooded based monitoring can be viewed as a finite automaton with with the attack packets. As a result the system crashes. As soon a fixed number of acceptable paths. A deterministic finite as the control flow changes, the hash values reported by the automaton (DFA) has been used to perform instruction level processor no longer match the monitoring graph information monitoring as opposed to non-deterministic finite automaton and the system is reset. These kind of attacks are hence (NFA). Using DFA reduces the requirement of a high memory detected in the developed hardware monitor since there is bandwidths when compared to an NFA used for monitoring. no valid edge between the states. All the above work was All the previous work in embedded security had been done implemented in fixed logic and prototyped on a stratix IV on a Von Neumann processor architecture. But the network GX230 FPGA located on a Altera DE4 board. processors use a Harvard architecture. So an example attack In [4] the authors have extended the work to multi-core was presented to prove the existence of attacks on Harvard network processors, implemented in a field-programmable gate architecture and how the processor is prevented from such an array (FPGA) platform. attack. These two key problems were addressed in the paper. In [2], the authors developed a high performance hardware B. Voltage Scaling Technique monitor that takes a single memory read per instruction, oper- Moore’s law has been driving the semiconductor industry ating at speeds sufficient to maintain the network data transfer advances since 1965. However, with the miniaturization of rate. A deterministic monitoring graph is implemented in the the transistor and consequently the increasing transistor count, form of a state machine derived from the packet processing power consumption have became a barrier for the advancing code. For each instruction executed on the processor core, of Moore’s law. Multi-core processors was the solution en- a hash value of the executed operation is reported to the countered by industry to increase computation power. monitor. The monitor uses the comparison logic to compare The most advanced system on chip (SoC) on the market the reported hash value to the information that is stored in the can easily reach billions of transistors. For instance, the monitoring graph. The monitoring graph used by the monitor last Apple’s A8 processor has a dual-core CPU and have is a state machine, where each state is represented by a specific approximately two billion transistor, fabricated in a 20 nm processor instruction. The monitoring system uses 4-bit hash process [5]. With the provision of 14 nm process or less to of the next instruction to label edges in the monitoring graph. reach the market very soon, these dense chips cannot turn on The monitor verifies each instruction with the comparison all the transistors at the same time. The term is known as dark logic. In case of an attack, the system changes the operation silicon, referring to the power constraints, such as leakage and of the processor core, leading to a deviation. This deviation dynamic power. produces hash values that do not match the monitoring graph. The authors in [6] explored near-threshold computing Upon detecting a deviation reported by the comparison logic, (NTC), where the supply voltage is approximately equal to the monitor resets the processor core. the threshold voltage of transistors. By reducing the supply In order to match the speeds of the processor core, the voltage from the super threshold voltage to the near-threshold comparison logic needs to be able to retrieve the information voltage the authors observed a gain on the order of 10X in about next state transitions for every instruction in no more the power consumption, with a the performance degradation than one memory access per instruction. In order to achieve of 10X. Reducing the supply voltage reduces the power consump- not relevant, but the ratio of them is important. Hence, our tion, but there are trade-offs. When the supply voltage drops, library encompasses the minimum sized gates, making our delay increases exponentially, reducing overall performance. synthesized circuit very slow; however, the absolute value of Therefore, the optimum point should be where we gain in delay is not our main concern. power consumption without compromising the circuit delay. Since we did not have access to the HSpice models of Another trade-off is the performance variation, as the depen- NangateOpenCellLibrary, our first step was to obtain these dencies of drive current, Vth, Vdd, and temperature approach models. Therefore, our first step was to draw the layout for exponential. Hence, when supply voltage decreases, the per- these gates using Cadence virtuoso [9] and extracting the formance uncertainty increases. Last, NTC increases device HSpice model for them. functional failure due to variations in processes, temperature, Figures 4-11 show the schematics and layouts for the and voltage. Current research to overcome these barriers is inverter, NAND, NOR and D flip-flop. After clearing the DRC presented in [6]. and LVS checks we extract the netlist from the layout for each of the gates. The next step was extracting the characteristics III.IMPLEMENTATION of the cells. By characteristics, we mean power and delay The following subsections describe the flow to reach the behavior. As we know, the power and delay behavior of an voltage level scaling techniques. We started by designing the electronic gate depends on two factors: capacitive load which schematics, then we drew the layouts, customized technology is being driven by it, and the slew of the inputs when changing. library, synthesize, validated, and power and time evaluation. By looking at liberty file of NangateOpenCellLibrary, we On the next section, we show the results. could see that each gate has a list of ports with a defined capacitive load. We assume that the port definition is the same A. Gate Characterization and Library Creation in our library and left them untouched. After port definition, Our primary goal is to investigate the impact of supply we are required to define the leakage power behavior of the voltage scaling on critical path delay and power consump- cell. In order to do so, we applied all possible input vectors tion. To accomplish that, we need to a synthesized library to the inputs of the cell and let the output settle down and characterized for each different voltage levels. The 45 nm after that, without changing the inputs, we measured the power NangateOpenCellLibrary [7] is a free low-power library, there- consumed by the circuit in a steady state. fore, we used in our project. The library defines the behavior After characterizing the leakage power, we then need to and characterization of its cells under 1.1 V supply voltage, characterize the timing behavior of the cell. Timing behavior nominal for this technology. includes rise/fall time of the output and tphl/tplh propagation Both delay and power consumption of a library are char- delay. Since the delay depends on both slew of the input and acterized and stored in an industry format (Liberty) file. For capacitive load of the output, the delay behavior of the cell delay, both pin to pin delays and the corresponding output is summarized in a 2-dimensional table. The row index is the slopes are typically characterized for identified timing arcs as slew of the input and the column index is the capacitive load of a function of load and/or input slope. In general, this allows the output. Each dimension has seven different index values. slews to propagate during delay and timing analysis and be If the slew and the capacitive load match with one of those used to characterize and analyze power consumption. entries, the delay values will be used, otherwise the output For power, both static and dynamic sources of power are values would be estimated by interpolating the values in the characterized. Dynamic power is made up of internal power closest entries.We have the same tables for the rising dynamic and switching power. The former is dissipated by the cell power and falling dynamic power. in the absence of a load capacitance and the latter is the Since the circuit shows different behavior by changing each component that is dissipated while charging/discharging a input, we have another set of those tables for the other input load capacitance. Dynamic power is measured per timing arc too. As we are going to compare the impact of voltage scaling, (as with delay). Static dissipation is due to leakage currents we should make four copies of the liberty file with values through OFF transistors and can be significant when the extracted from HSpice simulation for four different voltage circuit is in the idle state (there is no switching activity). values (1.1 V, 0.9 V, 0.8 V, and 0.7 V). After filling all the Since we did not have access to Liberty NCX [8], which was tables, our characterization procedure is completed. In the next mentioned in our proposal, we had to do the characterization step, we compile the liberty file and build the synthesis library manually, using HSpice simulation and updating the liberty (.db file). This process has been done using Design Compiler file manually. As the number of primitive cells in the library and four different libraries have been created, corresponding is more than 100 and the characterization procedure is very to four different voltage levels. time consuming, we needed to narrow down our cell library. We chose 2-input NAND, 2-input NOR, Inverter, and D- B. Synthesis and Validation type flip flop as our library, which is sufficient to implement The RTL synthesis is one of the main parts in an ASIC any combinational or sequential circuit. Since our goal is design flow. It is a technology dependent process that per- to compare the behavior of the synthesized circuit using forms translation, optimization, and mapping of the design these libraries, the exact delay or power consumption is it files. These files comprise of a set of constraints, high-level hardware design, technology library files, and the RTL source. Table 1 - Data arrival time at each voltage level. In our work we used Synopsys Design Compiler as the RTL synthesizer, adapting the set of constraints from [10]. Voltage levels (V) 0.7 0.8 0.9 1.1 High-level hardware design files were written in Ver- Data arrival time (ns) 8.8926 6.4291 5.5986 3.9002 ilog. They comprise of a top level hardware monitor module (HWMonitor.v) and six lower level modules (base_addr.v, controller.v, CurrentPointer.v, Table 1 shows the different critical delay paths for each GID_Table.v, my_register.v, PID2Addr.v)that are voltage level, simulated using our customized version of the used for synthesis. Another important set of files are the NangateOpenCellLibrary, with the design compiler aiming at tool command language scripts (.tcl). TCL is used to drive delay and power optimization. As seen in figure 1, reducing the Synopsys tools. Among the TCL files, it is worth mentioning Vdd level increases the critical path delay, which is expected. a few. However, the question is how far can we reduce the voltage Setup.tcl script is used to set various parameters like the level and the circuit still be functional? clock name, top module, RTL directory, and Gate directory. Read.tcl script is used to read the Verilog files corresponding to each module in the hardware monitor. Constraints to optimize the power are included in the Constraints.tcl file. CompileAn- alyze.tcl is used to synthesize the design based on constraints. At the end of these four steps we extracted various reports which include critical delay path, leakage power, and dynamic power. Technology libraries are used by the design compiler for different voltage levels. These operations were performed using the following commands. As a result, it compiles the .lib, which is the technology library source file, into .db file, the Synopsys database format. $read_lib \$PATH\NangateOpenCellLibrary $write_lib \$PATH\NangateOpenCellLibrary Synthesis was performed using the four different custom Fig. 1. Data arrival time for different power levels libraries , one for each of the voltage level. For each voltage level, we extracted timing reports for the typical process The synthesis results of our circuit using the customized corner, as well as power reports. library and Vdd voltage of 1.1 V shows delay of 3.9 ns and Customizing the Liberty file (.lib) is a time-consuming task we assume that the delay of the processor would be scaled with (a two input gate has approximately 400 entries). We extracted the same factor (2.25/3.9). With this assumption, the critical the values from several HSpice simulations, translated to a path delay is now of 6.24 ns. Since the hardware monitor is spreadsheet, and finally inputted on the Liberty file. Since supposed to work synchronized with the processor, its clock Spice simulates by solving differential equations, the results frequency would be the same and it would have a delay slack we obtained are accurate. We took care of adapting the of 2.25 ns. In order to save power, we could reduce the supply measurement points for every different slope, capacitance, and voltage as long as the timing requirement is met. voltage level. According to the table 1, if we reduce the voltage to 0.8 V, the delay of our Hardware monitor would be 6.42 ns which IV. RESULTS is violating the timing requirement. But with the increase in The simulation results of our synthesis using synopsys delay, we observe that the desired voltage might be slightly design compiler with our customized library are compared higher than 0.8 V. By decreasing the supply voltage of the with a typical ARM9 embedded processor [11]. hardware monitor, we could save 57 uW according to figure ARM processors are predominant processors in the embed- 2, which translates to a 50 percent reduction of it’s total power ded systems in the recent era. A typical ARM9 processor it consumption at 1.1V. runs at a frequency of 275 MHz. Since we do not have access to the HDL code of the processor, we make a few assumptions We ran several simulations, in order to evaluate the mini- to enable the comparison. mum critical path for each case. Figure 3 shows the extreme The synthesis using 45 nm NangateOpenCellLibrary yielded cases, where leakage and dynamic power are optimized, while a critical path of 2.25 ns. We assume that the results of the input voltage was kept constant. In both cases, clock was synthesis using TSMC library and NangateOpenCellLibrary optimized by the design compiler. Keeping the input constant, are close (in fact, TSMC is by far the fastest). As mentioned dynamic power decreases at a slower pace than if the voltage earlier, we assume that the processor runs at 275 MHz which level is also scaled. means a critical path delay of 3.6 ns. [4] K. Hu, H. Chandrikakutty, R. Tessier, and T. Wolf, “Scalable hardware monitors to protect network processors from data plane attacks,” in Communications and Network Security (CNS), 2013 IEEE Conference on, Oct 2013, pp. 314–322. [5] Chipworks, “Inside the iphone 6 and iphone 6 plus (part 2),” http:// www.chipworks.com/en/technical-competitive-analysis/resources/blog/ inside-the-iphone-6-and-iphone-6-plus-part-2/?lang=en&Itemid=815, accessed: 2014-11-17. [6] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, “Near-threshold computing: Reclaiming moore’s law through energy efficient integrated circuits,” Proceedings of the IEEE, vol. 98, no. 2, pp. 253–266, Feb 2010. [7] Nangate, “Nangate freepdk45 open cell library,” http://www.nangate. com/?page id=2325, accessed: 2014-12-19. [8] Synopsys, “Tutorial:liberty ncx,” http://www.synopsys.com/Tools/ Implementation/SignOff/Pages/LibertyNCX.aspx, accessed: 2014-12- 19. [9] Cadence, “Tutorial:custom ic design,” http://www.cadence.com/ Fig. 2. Static and dynamic power measures products/cic/Pages/default.aspx, accessed: 2014-12-19. [10] N. C. S. University, “Tutorial:place & route tutorials,” http://www.eda. ncsu.edu/wiki/Tutorial:Place %26 Route Tutorials, accessed: 2014-12- 19. [11] ARM, “Tutorial:arm926 processor,” http://www.arm.com/products/ processors/classic/arm9/arm926.php, accessed: 2014-12-19.

Fig. 3. A comparison between optimized and non-optimized power

V. FUTURE WORK The effectiveness of scaling the voltage in hardware monitor is demonstrated through the reduced power consumption. Hence the hardware monitor consumes less power than before and still maintain its functionality. We encountered many difficulties due to limited access to the tools proposed at the previous report. These tools, such as Primetime, CACTI, and ModelSim were not accessible, making the work much more challenging, with several hours of Spice simulations and library customization. We made use of Synopsys design compiler for most of our work. In the future, we would like to integrate the processor and monitor and measure the power consumption using the mentioned tools to obtain more accurate results.

REFERENCES

[1] D. Chasaki and T. Wolf, “Attacks and defenses in the data plane of networks,” Dependable and Secure Computing, IEEE Transactions on, vol. 9, no. 6, pp. 798–810, Nov 2012. [2] H. Chandrikakutty, D. Unnikrishnan, R. Tessier, and T. Wolf, “High- performance hardware monitors to protect network processors from data plane attacks,” in Design Automation Conference (DAC), 2013 50th ACM / EDAC / IEEE, May 2013, pp. 1–6. [3] W. Eatherton, “The push of network processing to the top of the pyra- mid,” in keynote address at Symposium on Architectures for Networking and Communications Systems, 2005, pp. 26–28. VI.APPENDIX Please refer to this section for a the list of figures used in the text.

Fig. 6. NAND schematic

Fig. 4. Inverter schematic

Fig. 5. Inverter layout Fig. 8. NOR schematic

Fig. 7. NAND layout Fig. 10. Flip-Flop schematic

Fig. 11. Flip-Flop layout

Fig. 9. NOR layout