<<

ECE 545

Digital System Design with VHDL

Fall 2014 Kris Gaj Research and teaching interests: • • computer arithmetic • cryptography • network security Contact: The Engineering Building, room 3225 [email protected] Office hours: Thursday, 7:30-8:30 PM, Tuesday, 6:00-7:00 PM, and by appointment Course Web Page

ECE web page → Courses →

Digital System Design with VHDL

(or “Kris Gaj”) ECE 545 Part of:

MS in Computer Engineering One of five core courses (must be passed with B or better) Fundamental course for the specialization areas: Digital Systems Design Digital Signal Processing Elective course in the remaining specialization areas

MS in Electrical Engineering Elective ECE 545

Part of:

PhD in Electrical and Computer Engineering

Knowledge tested at the Technical Qualifying Exam (TQE) Topic 2: Digital Design and Computer Organization Recommended I am interested I want to specialize program & in… primarily in… specialization

CAD tools & Design Automation MS CpE

VLSI Digital Systems Design Hardware Description Languages

Digital Systems Design FPGAs & Reconfigurable computing

ASICs & FPGAs Computer Arithmetic

VHDL/ Front-end ASIC Design (algorithmic downto gate level) CAD Tools Back-end ASIC Design (circuit and mask layout levels) Reconfigurable Computing Analog & Digital Circuit Design

Microelectronics VLSI Fabrication

VLSI Fabrication Microelectronics MS EE Nanoelectronics Nanoelectronics Microelectronics/

Semiconductor Devices Nanoelectronics Design level Courses Digital System Computer VLSI Design VLSI Test Design with VHDL Arithmetic for ASICs Concepts algorithmic

ECE ECE register-transfer 645 545 ECE ECE gate 681 682 ECE 586 transistor Digital ECE Integrated 680 Circuits layout Physical VLSI Design Semiconductor MOS Device ECE 584 ECE684 devices Device Fundamentals Electronics CpE CpE Digital Systems Design Microprocessors and Embedded Systems

ECE 545 Digital System Design ECE 510 Real-Time Concepts Pre- with VHDL ECE 511 Microprocessors Approved ECE 586 Digital Integrated Circuits ECE 611 Advanced Microprocessors ECE 645 Computer Arithmetic ECE 612 Real-Time Embedded Electives ECE 681 VLSI Design for ASICs Systems ECE 682 VLSI Test Concepts ECE 641 Computer System ECE 699 DSP HW Architectures Architecture

CS 540, 583 (languages, algorithms) Suggested ECE 584, 684, … (technology) CS 635 (parallel machines) Electives ECE 511, 611, … (microprocessors) ECE 542, 642, 742 (networks) ECE 537, 646, 746, …(applications) ECE 645, 681 (digital design) ECE 548 (sequential mach. theory)

K. Gaj, H. Homayoun, J-P. Kaps H. Homayoun, J. Kaps, P. Pachowicz, Professors T. Storey, A. Cohen C. Sabzevari DIGITAL SYSTEMS DESIGN

Concentration advisors: Kris Gaj, Houman Homayoun, Jens-Peter Kaps

1. ECE 545 Digital System Design with VHDL – K. Gaj, project, FPGA design with VHDL,

2. ECE 645 Computer Arithmetic – K. Gaj, project, FPGA design with VHDL

3. ECE 681 VLSI Design for ASICs – H. Homayoun, project/lab, front-end and back-end ASIC design with tools

4. ECE 586 Digital Integrated Circuits – D. Ioannou, R. Mulpuri,

5a. ECE 682 VLSI Test Concepts – T. Storey 5b. ECE 699 Digital Signals Processing Hardware Architectures – A. Cohen, project, FPGA design with VHDL and Matlab/Simulink

DIGITAL SIGNAL PROCESSING

Concentration advisors: Aaron Cohen, Kris Gaj, Ken Hintz, Jill Nelson, Kathleen Wage

1. ECE 535 Digital Signal Processing – L. Griffiths, J. Nelson, Matlab

2. ECE 545 Digital System Design with VHDL – K. Gaj, project, FPGA design with VHDL

3. ECE 645 Computer Arithmetic – K. Gaj, project, FPGA design with VHDL

4. ECE 699 Digital Signals Processing Hardware Architectures – A. Cohen, project, FPGA design with VHDL and Matlab/Simulink

5a. ECE 537 Introduction to Digital Image Processing – K. Hintz 5b. ECE 738 Advanced Digital Signal Processing – K. Wage

Two New Classes in Spring 2015

ECE 699 Software/Hardware Codesign

Instructor: Kris Gaj TA: Umar Sharif Prerequisites: ECE 511 and ECE 545 Recommended for students specializing in: Microprocessors and Embedded Systems Digital System Design

ECE 699 Green Computing and Heterogeneous Architectures

Instructor: Houman Homayoun Prerequisites: ECE 511 Recommended for students specializing in: Microprocessors and Embedded Systems TA

Malik Umar Sharif • help with the installation and configuration of CAD tools

• help with understanding of tutorials and the operation of tools

• help with VHDL and tool-oriented homework assignments

• limited help with debugging your project codes PhD Student in the Cryptographic Engineering Research Group (CERG)

Getting Help Outside of Office Hours

• System for asking questions 24/7 • Answers can be given by students and instructors • Student answers endorsed (or corrected) by instructors • Average response time in Spring 2013 = 2.1 hour • You can submit your questions anonymously • You can ask private questions visible only to the instructors Grading Scheme

• Homework - 15%

• Project - 35%

• Midterm Exam - 20%

• Final Exam - 30%

• Class Activity - Bonus 5% Bonus Points for Class Activity

• Based on class exercises during lecture • “Small” points earned each week posted on BlackBoard • Up to 5 “big” bonus points • Scaled based on the performance of the best student

For example: Small points Big points 1. Alice 40 5 2. Bob 36 4.5 … … … 28. Charlie 8 1 Midterm exam 1 ü 2 hours 40 minutes

ü in class

ü design-oriented

ü open-books, open-notes

ü practice exams available on the web

Tentative date: Last week of October Final exam ü 2 hours 45 minutes

ü in class

ü design-oriented

ü open-books, open-notes

ü practice exams available on the web

Date: Thursday, December 11, 4:30-7:15pm Textbooks

18 Required Textbook Pong P. Chu, RTL Hardware Design Using VHDL, Wiley-Interscience, 2006.

K?<JB@CCJ8E;>L@;8E:<E<<;<;KF D8JKE

K_`j Yffb k\XZ_\j i\X[\ij _fn kf jpjk\dXk`ZXccp [\j`^e \]ÔZ`\ek# gfikXYc\# Xe[ jZXcXYc\ I\^`jk\i KiXej]\i C\m\c IKC  [`^`kXc Z`iZl`kj lj`e^ k_\ M?;C _Xi[nXi\ [\jZi`gk`fe cXe^lX^\ Xe[ jpek_\j`j jf]knXi\% =fZlj`e^ fe k_\ df[lc\$c\m\c [\j`^e# n_`Z_ `j Zfdgfj\[ f] :?L ]leZk`feXc le`kj# iflk`e^ Z`iZl`k# Xe[ jkfiX^\# k_\ Yffb `ccljkiXk\j k_\ i\cXk`fej_`g Y\kn\\e k_\M?;CZfejkilZkjXe[k_\le[\icp`e^_Xi[nXi\Zfdgfe\ekj#Xe[j_fnj_fnkf[\m\cfg IKC?8I;N8I<;E Zf[\jk_Xk]X`k_]lccpi\Õ\Zkk_\df[lc\$c\m\c[\j`^eXe[ZXeY\jpek_\j`q\[`ekf\]ÔZ`\ek ^Xk\$c\m\c`dgc\d\ekXk`fe%

J\m\iXcle`hl\]\Xkli\j[`jk`e^l`j_k_\Yffb1 ›:f[`e^jkpc\k_Xkj_fnjXZc\Xii\cXk`fej_`gY\kn\\eM?;CZfejkilZkjXe[  _Xi[nXi\Zfdgfe\ekj ›:feZ\gklXc[`X^iXdjk_Xk`ccljkiXk\k_\i\Xc`qXk`fef]M?;CZf[\j ›

 gifZ\[li\j#Xe[k\Z_e`hl\j LJ@E>M?;C ›KnfZ_Xgk\ijfei\Xc`q`e^j\hl\ek`XcXc^fi`k_dj`e_Xi[nXi\ ›KnfZ_Xgk\ijfejZXcXYc\Xe[gXiXd\k\i`q\[[\j`^ejXe[Zf[`e^ IKC ?8I;N8I<;E ›Fe\Z_Xgk\iZfm\i`e^k_\jpeZ_ife`qXk`feXe[`ek\i]XZ\Y\kn\\edlck`gc\  ZcfZb[fdX`ej

8ck_fl^_k_\]fZljf]k_\Yffb`jIKCjpek_\j`j#`kXcjf\oXd`e\jk_\jpek_\j`jkXjb]ifdk_\ LJ@E>M?;C g\ijg\Zk`m\f]k_\fm\iXcc[\m\cfgd\ekgifZ\jj%I\X[\ijc\Xie^ff[[\j`^egiXZk`Z\jXe[ ^l`[\c`e\jkf\ejli\k_XkXeIKC[\j`^eZXeXZZfddf[Xk\]lkli\j`dlcXk`fe#m\i`ÔZXk`fe#Xe[ k\jk`e^e\\[j#Xe[ZXeY\\Xj`cp`eZfigfiXk\[`ekfXcXi^\ijpjk\dfii\lj\[%;`jZljj`fe`j`e$ [\g\e[\ekf]k\Z_efcf^pXe[ZXeY\Xggc`\[kfYfk_8J@:Xe[=G>8[\m`Z\j%

N`k_ X YXcXeZ\[ gi\j\ekXk`fe f] ]le[Xd\ekXcj Xe[ giXZk`ZXc \oXdgc\j# k_`j `j Xe \oZ\c$ c\ekk\okYffb]filgg\i$c\m\cle[\i^iX[lXk\fi^iX[lXk\Zflij\j`eX[mXeZ\[[`^`kXccf^`Z%

8[\m`Z\j JZXcXY`c`kp GfikXY`c`kp#Xe[ :f[`e^]fi<]ÔZ`\eZp# j_flc[Xcjfi\]\ikfk_`jYffb%

GFE>G%:?L#G?;#`j8jjfZ`Xk\Gif]\jjfi`ek_\;\gXikd\ekf]

GFE>G%:?L Supplementary Textbook – Basics Refresher Stephen Brown and Zvonko Vranesic, Fundamentals of Digital Logic with VHDL Design, McGraw-Hill, 3rd or 2nd Edition Supplementary Textbook – Advanced Hubert Kaeslin, Digital Integrated Circuit Design: From VLSI Architectures to CMOS Fabrication, Cambridge University Press; 1st Edition, 2008. Technology & Tools

22 I/O Blocks Block RAMs Logic Blocks (CLB) / Blocks/ (CLB) Logic Logic Adaptive (ALM) Modules Configurable Configurable

Block RAMs Block RAMs What is an FPGA? isanFPGA? What Modern FPGA

RAMRAM bblockslocks Multipliers/DSPMultipliers units LogicLogic b resourceslocks (CLBs or ALMs)

(#Logic resources, #Multipliers/DSP units, #RAM_blocks)

Graphics based on The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Corp. (www.mentor.com) 24 General structure of an FPGA

Programmable interconnect

Programmable logic blocks

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

ECE 448 – FPGA and ASIC Design with VHDL 25 4-input LUT (Look-Up Table) (used in earlier families of FPGAs)

• Look-Up tables x1 x 2 y x x x x y x3 LUT x x x x y are primary 1 2 3 4 x 1 2 3 4 0 0 0 0 1 4 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 elements for 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 0 logic 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 0 implementation 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 1 0 0 1 1 • Each LUT can 1 0 1 0 1 1 0 1 0 0 1 0 1 1 1 1 0 1 1 0 implement any 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 1 1 x x x x function of 1 1 1 0 0 1 2 3 4 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 4 inputs

x1 x2

y

y

26 6-Input LUT of Spartan-6

ECE 448 – FPGA and ASIC Design with VHDL 27 Two competing implementation approaches

ASIC FPGA Application Specific Field Programmable Integrated Circuit Gate Array

• designed all the way • no physical layout design; from behavioral description design ends with to physical layout a bitstream used to configure a device • designs must be sent for expensive and time • bought off the shelf consuming fabrication and reconfigured by in semiconductor foundry designers themselves FPGAs vs. ASICs

ASICs FPGAs

Off-the-shelf High performance

Low development costs Low power Short time to the market Low cost (but only in high volumes) Reconfigurability Major FPGA Vendors

SRAM-based FPGAs • , Inc. ~ 51% of the market ~ 85% • Corp. ~ 34% of the market • • Tabula

Flash & antifuse FPGAs • Microsemi SoC Products Group (formerly Actel Corp.) • Quick Logic Corp.

ECE 448 – FPGA and ASIC Design with VHDL 30 Xilinx FPGA Families Technology Low-cost High- performance 220 nm Virtex 180 nm Spartan-II, Spartan-IIE 120/150 nm Virtex-II, Virtex-II Pro 90 nm Spartan-3 Virtex-4 65 nm Virtex-5 45 nm Spartan-6 40 nm Virtex-6 28 nm Arx-7 Virtex-7 Altera FPGA Devices

Technology Low-cost Mid-range High- performance

130 nm Cyclone Strax

90 nm Cyclone II Strax II

65 nm Cyclone III Arria I Strax III

40 nm Cyclone IV Arria II Strax IV

28 nm Cyclone V Arria V Strax V FPGA Family

33 Spartan-6 FPGA Family

ECE 448 – FPGA and ASIC Design with VHDL 34 FPGA Design process (1)

Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 8031 microcontroller. Unlike in the experiment 5, this time your unit has to be able Specification / Pseudocode to perform an encryption algorithm by itself, executing 32 rounds…..

On-paper hardware design (Block diagram & ASM chart)

VHDL description (Your Source Files)

Library IEEE; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; Functional simulation entity RC5_core is port( clock, reset, encr_decr: in std_logic; data_input: in std_logic_vector(31 downto 0); data_output: out std_logic_vector(31 downto 0); out_full: in std_logic; key_input: in std_logic_vector(31 downto 0); key_read: out std_logic; ); end AES_core;

Synthesis Post-synthesis simulation FPGA Design process (2)

Implementation Timing simulation

Results Configuration On chip testing Levels of design description Levels supported by HDL

Algorithmic level Level of description Register Transfer Level most suitable for synthesis

Logic (gate) level

Circuit (transistor) level

Physical (layout) level Register Transfer Level (RTL) Design Description

Combinational Combinational Logic Logic

Registers

38 Synthesis

George Mason University Logic Synthesis

VHDL description Circuit netlist architecture MLU_DATAFLOW of MLU is signal A1:STD_LOGIC; signal B1:STD_LOGIC; signal Y1:STD_LOGIC; signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC; begin A1<=A when (NEG_A='0') else not A; B1<=B when (NEG_B='0') else not B; Y<=Y1 when (NEG_Y='0') else not Y1;

MUX_0<=A1 and B1; MUX_1<=A1 or B1; MUX_2<=A1 xor B1; MUX_3<=A1 xnor B1;

with (L1 & L0) select Y1<=MUX_0 when "00", MUX_1 when "01", MUX_2 when "10", MUX_3 when others; end MLU_DATAFLOW;

40 Circuit netlist (RTL view)

41 Implementation

George Mason University Mapping

LUT0

FF1

LUT1

FF2 LUT2

43 Placing FPGA CLB SLICES

44 Routing FPGA

Programmable Connections

45 Configuration

• Once a design is implemented, you must create a file that the FPGA can understand • This file is called a bit stream: a BIT file (.bit extension)

• The BIT file can be downloaded directly to the FPGA, or can be converted into a PROM file which stores the programming information

46 Simulation Tools

FPGA Synthesis Tools

XST Logic Synthesis

VHDL description Circuit netlist architecture MLU_DATAFLOW of MLU is signal A1:STD_LOGIC; signal B1:STD_LOGIC; signal Y1:STD_LOGIC; signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC; begin A1<=A when (NEG_A='0') else not A; B1<=B when (NEG_B='0') else not B; Y<=Y1 when (NEG_Y='0') else not Y1;

MUX_0<=A1 and B1; MUX_1<=A1 or B1; MUX_2<=A1 xor B1; MUX_3<=A1 xnor B1;

with (L1 & L0) select Y1<=MUX_0 when "00", MUX_1 when "01", MUX_2 when "10", MUX_3 when others; end MLU_DATAFLOW; FPGA Implementation

• After synthesis the entire implementation process is performed by FPGA vendor tools Design Process control from Active-HDL Xilinx FPGA Tools ECE Labs

Aldec Active-HDL Xilinx ISE Design Flow Design Flow

Aldec Active-HDL (IDE) ISim or ModelSim Xilinx XST Xilinx XST or or Synopsys Synplify Premier Synopsys Synplify Premier

Xilinx ISE Design Suite Xilinx ISE Design Suite (IDE)

simulation synthesis implementation Xilinx FPGA Tools Home

Aldec Active-HDL Xilinx ISE Design Flow Design Flow Aldec Active-HDL ISim Student Edition (IDE) Xilinx XST Xilinx XST (restricted) (restricted)

Xilinx ISE WebPACK Xilinx ISE WebPACK (IDE) (restricted) (restricted) simulation synthesis implementation Altera FPGA Tools ECE Labs

Altera Design Flow

Mentor Graphics ModelSim-Altera

Altera Quartus II Subscription Edition

simulation synthesis & implementation Altera FPGA Tools Home

Altera Design Flow

Mentor Graphics ModelSim-Altera Starter (restricted)

Altera Quartus II Web Edition (restricted)

simulation synthesis & implementation Lab Access Rules and Behavior Code

Please refer to

ECE Labs website and in particular to

Access rules & behavior code

ATHENa – Automated Tool for Hardware EvaluaoN

Supported in part by the National Institute of Standards & Technology (NIST)59 GMU ATHENa Team

Venkata Ekawat Marcin John Rajesh Michal “Vinny” “Ice” PhD exchange MS CpE PhD CpE PhD ECE MS CpE PhD ECE student from student student student student student Slovakia ATHENa – Automated Tool for Hardware EvaluatioN http://cryptography.gmu.edu/athena

Benchmarking open-source tool, written in Perl, aimed at an AUTOMATED generation of OPTIMIZED results for MULTIPLE hardware platforms

Currently under development at George Mason University.

61 Why Athena?

"The Greek goddess Athena was frequently called upon to settle disputes between the gods or various mortals. Athena Goddess of Wisdom was known for her superb logic and intellect. Her decisions were usually well-considered, highly ethical, and seldom motivated by self-interest.”

from "Athena, Greek Goddess of Wisdom and Craftsmanship"

62 Basic Dataflow of ATHENa

User FPGA Synthesis and Implementation 6 5 Ranking 2 3 Database of designs query HDL + scripts + Result Summary configuration files + Database Entries

ATHENa 1 Server Download scripts HDL + FPGA Tools and configuration files8

4

Database Designer Entries Interfaces 0 + Testbenches 63

configuraon constraint files files

testbench synthesizable source files

database result entries summary (machine- (user-friendly) friendly) 64 ATHENa Major Features (1) • synthesis, implementaon, and ming analysis in batch mode • support for devices and tools of mulple FPGA vendors:

• generaon of results for mulple families of FPGAs of a given vendor

• automated choice of a best-matching device within a given family

65 ATHENa Major Features (2)

• automated verificaon of designs through simulaon in batch mode

OR

• support for mul-core processing • automated extracon and tabulaon of results • several opmizaon strategies aimed at finding – opmum opons of tools – best target clock frequency – best starng point of placement

66 Generation of Results Facilitated by ATHENa

• batch mode of FPGA tools

vs.

• ease of extraction and tabulation of results • Text Reports, Excel, CSV (Comma-Separated Values) • optimized choice of tool options • GMU_optimization_1 strategy 67 Relative Improvement of Results from Using ATHENa Virtex 5, 256-bit Variants of Hash Functions

2.5

2

1.5 Area 1 Thr Thr/Area 0.5

0

Ratios of results obtained using ATHENa suggested options

vs. default options of FPGA tools 68 Other (Somewhat) Similar Tools

ExploreAhead (part of PlanAhead)

Design Space Explorer (DSE)

Boldport Flow

EDAx10 Cloud Platform

69 Distinguishing Features of ATHENa

• Support for multiple tools from multiple vendors

• Optimization strategies aimed at the best possible performance rather than design closure

• Extraction and presentation of results

• Seamless integration with the ATHENa database of results

70 Benchmarking Goals Facilitated by ATHENa

Comparing multiple: 1. cryptographic algorithms 2. hardware architectures or implementations of the same cryptographic algorithm 3. hardware platforms from the point of view of their suitability for the implementation of a given algorithm, (e.g., choice of an FPGA device or FPGA board) 4. tools and languages in terms of quality of results they generate (e.g. Verilog vs. VHDL, Synplicity Synplify Premier vs. Xilinx XST, ISE v. 13.1 vs. ISE v. 12.3)

71 Project

72 Cryptography Project

ü related to the research project conducted by Cryptographic Engineering Research Group (CERG) at GMU

ü supporting NIST (National Institute of Standards and Technology) in the evaluation of candidates for a new cryptographic standard Cryptography Project ü RTL VHDL implementation of an authenticated cipher based on the • algorithm specification • reference implementation in C • interface specification.

ü a different cipher for each student

ü two students working on the similar ciphers can work closely together, and exchange the source codes

ü each student graded based on the deliverables for his/her own cipher Combining Projects from Two Different Courses • ECE 545 & ECE 646 • ECE 545 project can be extended into an ECE 646 hardware project by adding additional ciphers, architectures, key sizes, modes of operation, etc. • ECE 646 students must write a final report and submit deliverables (one submission per group) ECE 545 submit only deliverables (separate for each member of a group) • Students forming a two-member group in ECE 646 will receive the same score for the ECE 646 project, but possibly different scores for their respective ECE 545 projects • ECE 545 & ECE 797/798/799/998 • ECE 545 project can be extended into a Scholarly Paper, Research Project, Master’s Thesis, PhD Thesis Project Organization • Project divided into phases

• Deliverables for each phase submitted using Blackboard at selected checkpoints and evaluated by the instructor and/or TA

• Feedback provided to the students on the best effort basis

• Periodical individual/group meetings devoted to the discussion of each phase deliverables and encountered difficulties

• Final deliverables submitted using Blackboard at the end of the semester

• Final project score based only on the final deliverables Honor Code Rules

• All students are expected to write and debug their project codes individually or in groups of two • All homework assignments should be done individually • Students are encouraged to help and support each other in all problems related to the - operation of the CAD tools - understanding of an investigated algorithm and existing implementations - understanding of the project tasks ECE 545 Questionnaire

Project Background

79 Crypto 101 Cryptography is Everywhere

Buying a book on-line Withdrawing cash from ATM

Teleconferencing Backing up files over Intranets on remote server Alice: I love you! Bob Alice: I love you! Bob Basic Security Services (1) 1. Confidentiality Bob Alice

Charlie 2. Message integrity Bob Alice

Charlie 3. Message authentication Bob Alice

Charlie Confidentiality Ciphers Bob Alice IV Message IV Ciphertext

KAB Cipher KAB Cipher

IV Ciphertext Message

KAB - Secret key of Alice and Bob IV – Initialization Vector Authentication Message Authentication Code - MAC

Bob Alice

IV Message Tag IV Message Tag

K KAB MAC AB MAC Generate Verify

Tag’

valid/invalid =

Tag

KAB - Secret key of Alice and Bob IV – Initialization Vector Confidentiality & Authentication Authenticated Ciphers

Bob Alice IV Message IV Ciphertext Tag

K KAB Authenticated AB Authenticated Cipher Cipher Encryption Decryption

valid/invalid

IV Ciphertext Tag Message

KAB - Secret key of Alice and Bob IV – Initialization Vector Confidentiality & Authentication Authenticated Ciphers with Associated Data

Bob Alice IV AD Message IV AD Ciphertext Tag

K KAB Authenticated AB Authenticated Cipher Cipher Encryption Decryption

valid/invalid

IV AD Ciphertext Tag AD Message

KAB - Secret key of Alice and Bob IV – Initialization Vector, AD – Associated Data Cryptographic Transformations Most Often Implemented in Practice

Secret-Key Ciphers Hash Functions

Block Ciphers Stream Ciphers

message & user encryption authentication

Public-Key Cryptosystems

digital signatures key agreement key exchange Hash Function

arbitrary length m message

hash h function Collision Resistance: It is computationally infeasible to find such m and m’ that h(m) hash value h(m)=h(m’) fixed length Hash Functions in Digital Signature Schemes Alice Bob Message Signature Message Signature

Hash Hash function function

Hash value 1 Hash value yes no Hash value 2 Public key Public key cipher cipher

Alice’s private key Alice’s public key Cryptographic Standards Before 1997

Secret-Key Block Ciphers

1977 1999 2005 IBM DES – Data Encryption Standard & NSA Triple DES

Hash Functions 1993 1995 2003 NSA SHA-1–Secure Hash Algorithm SHA SHA-2

1970 1980 1990 2000 2010

time Why a Contest for a Cryptographic Standard?

• Avoid back-door theories • Speed-up the acceptance of the standard • Stimulate non-classified research on methods of designing a specific cryptographic transformation • Focus the effort of a relatively small cryptographic community Cryptographic Standard Contests

IX.1997 X.2000 AES 15 block ciphers → 1 winner

NESSIE I.2000 XII.2002 CRYPTREC XI.2004 V.2008 34 stream 4 HW winners eSTREAM ciphers → + 4 SW winners XI.2007 X.2012 51 hash functions → 1 winner SHA-3 IV.2013 XII.2017

56 authenticated ciphers → multiple winners CAESAR

97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 time Cryptographic Contests - Evaluation Criteria

Security

Software Efficiency Hardware Efficiency

µProcessors µControllers FPGAs ASICs

Flexibility Simplicity Licensing

95 Specific Challenges of Evaluations in Cryptographic Contests

• Very wide range of possible applications, and as a result performance and cost targets throughput: single Mbits/s to hundreds Gbits/s cost: single cents to thousands of dollars • Winner in use for the next 20-30 years, implemented using technologies not in existence today • Large number of candidates • Limited time for evaluation • Only one winner and the results are final Mitigating Circumstances

• Security is a primary criterion • Performance of competing algorithms tend to very significantly (sometimes as much as 500 times) • Only relatively large differences in performance matter (typically at least 20%) • Multiple groups independently implement the same algorithms (catching mistakes, comparing best results, etc.) • Second best may be good enough AES Contest 1997-2000 Rules of the Contest

Each team submits

Detailed Justification Tentative cipher of design results specification decisions of cryptanalysis

Source Source Test code code vectors in C in Java AES: Candidate Algorithms

2 8 4

Canada: Germany: Korea: CAST-256 Magenta Crypton Deal Belgium: Japan: USA: Mars Rijndael E2 RC6 Twofish France: 1 Safer+ DFC HPC Israel, UK, Australia: Costa Rica: Norway: LOKI97 Frog Serpent AES Contest Timeline June 1998 15 Candidates Round 1 CAST-256, Crypton, Deal, DFC, E2, Frog, HPC, LOKI97, Magenta, Mars, Security RC6, Rijndael, Safer+, Serpent, Twofish, Software efficiency

August 1999 Round 2 5 final candidates Mars, RC6, Twofish (USA) Security Rijndael, Serpent (Europe) Software efficiency Hardware efficiency October 2000 1 winner: Rijndael Belgium NIST Report: Security & Simplicity Security

MARS High Serpent Twofish

Rijndael Adequate RC6

Complex Simple Simplicity Efficiency in software: NIST-specified platform 200 MHz Pentium Pro, Borland C++ Throughput [Mbits/s] 128-bit key 192-bit key 30 256-bit key 25 20 15 10 5 0 Rijndael RC6 Twofish Mars Serpent NIST Report: Software Efficiency Encryption and Decryption Speed

32-bit 64-bit DSPs processors processors

RC6 Rijndael Rijndael high Twofish Twofish Rijndael Mars Mars Mars medium RC6 Twofish RC6 low Serpent Serpent Serpent Efficiency in FPGAs: Speed Xilinx Virtex XCV-1000 Throughput [Mbit/s] 500 431 444 George Mason University 450 414 University of Southern California 400 353 Worcester Polytechnic Institute 350 294 300 250 177 200 173 149 143 150 104 112 102 88 100 62 61 50 0 Serpent Rijndael Twofish Serpent RC6 Mars x8 x1 Efficiency in ASICs: Speed MOSIS 0.5µm, NSA Group Throughput [Mbit/s] 700 606 128-bit key scheduling

600 3-in-1 (128, 192, 256 bit) key scheduling

500 443

400

300 202 202 200 105 105 103 104 57 57 100

0 Rijndael Serpent Twofish RC6 Mars x1 Lessons Learned Results for ASICs matched very well results for FPGAs, and were both very different than software

FPGA ASIC

x8

x1 x1

GMU+USC, Xilinx Virtex XCV-1000 NSA Team, ASIC, 0.5µm MOSIS

Serpent fastest in hardware, slowest in software Lessons Learned Hardware results matter!

Final round of the AES Contest, 2000

Speed in FPGAs Votes at the AES 3 conference GMU results Limitations of the AES Evaluation

• Optimization for maximum throughput

• Single high-speed architecture per candidate

• No use of embedded resources of FPGAs (Block RAMs, dedicated multipliers)

• Single FPGA family from a single vendor: Xilinx Virtex eSTREAM Contest 2004-2008 Hardware Efficiency in FPGAs Xilinx Spartan 3, GMU SASC 2007 Throughput [Mbit/s] 12000 x64

10000

Trivium 8000

x32 6000

4000 x16 x16 2000 Grain AES-CTR x1 0 Mickey-128 0 200 400 600 800 1000 1200 1400 Area [CLB slices] Lessons Learned Very large differences among 8 leading candidates

~30 x in terms of area ~500 x in terms of the throughput to area ratio SHA-3 Contest 2007-2012 NIST SHA-3 Contest - Timeline

51 Round 1 Round 2 Round 3 candidates 14 5 1 July 2009 Dec. 2010 Oct. 2012 Oct. 2008 SHA-3 Round 2

115 Throughput vs. Area Normalized to Results for SHA-256 and Averaged over 11 FPGA Families – 256-bit variants

116 Throughput vs. Area Normalized to Results for SHA-512 and Averaged over 11 FPGA Families – 512-bit variants

117 Performance Metrics

Primary Secondary

1. Throughput 2. Area

3. Throughput / Area 4. Hash Time for Short Messages (up to 1000 bits)

118 256-bit variants 512-bit variants Thr/Area Thr Area Short msg. Thr/Area Thr Area Short msg. BLAKE BMW CubeHash ECHO Fugue Groestl Hamsi JH Keccak Luffa Shabal SHAvite-3 SIMD Skein 119

SHA-3 Round 3

120 SHA-3 Contest Finalists New in Round 3

• Multiple Hardware Architectures

• Effect of the Use of Embedded Resources (Block RAMs, DSP units)

• Low-Area Implementations

BLAKE-256 in Virtex 5

x1 – basic iterative architecture /k(h) – horizontal folding by a factor of k xk – unrolling by a factor of k /k(v) – vertical folding by a factor of k

xk-PPLn – unrolling by a factor of k with n pipeline stages 123 256-bit variants in Virtex 5

124 512-bit variants in Virtex 5

125 256-bit variants in 4 high-performance FPGA families

126 512-bit variants in 4 high-performance FPGA families

127 FPGA Evaluations

AES eSTREAM SHA-3

Multiple FPGA families No No Yes

Multiple architectures No Yes Yes

Use of embedded No No Yes resources Primary optimization Throughput Area Throughput/ target Throughput/ Area Area Experimental results No No Yes

Availability of source No No Yes codes Specialized tools No No Yes

CAESAR Contest 2013-2017 Contest Timeline

• 2014.03.15: Deadline for first-round submissions • 2014.04.15: Deadline for first-round software • 2015.01.15: Announcement of second-round candidates • 2015.04.15: Deadline for second-round Verilog/VHDL • 2015.12.15: Announcement of third-round candidates • 2016.12.15: Announcement of finalists • 2017.12.15: Announcement of final portfolio Cryptographic Standard Contests

IX.1997 X.2000 AES 15 block ciphers

NESSIE I.2000 XII.2002 CRYPTREC XI.2004 V.2008 34 stream ciphers eSTREAM X.2007 X.2012 51 hash functions SHA-3

56 authenticated ciphers CAESAR

97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 time Difficulties of Hardware Benchmarking

• Growing number of candidates • Long time necessary to develop and verify RTL (Register Transfer Level) VHDL or Verilog code • Multiple variants of algorithms (e.g., 3 different key sizes in the AES Contest, 4 different output sizes in the SHA-3 Contest) • Multiple hardware architectures (based on folding, unrolling, pipelining, etc.) • Dependence on skills of the designers

132 Ekawat Homsirikamol a.k.a “Ice”

Working on the PhD Thesis entitled “A New Approach to the Development of Cryptographic Standards Based on the Use of High-Level Synthesis Tools”

Potential Solution: High-Level Synthesis (HLS)

High Level Language (e.g. C, C++, Matlab, Cryptol)

High-Level Synthesis

Hardware Description Language (e.g., VHDL or Verilog)

134 Short History of High-Level Synthesis

Generation 1 (1980s-early 1990s): research period Generation 2 (mid 1990s-early 2000s): • Commercial tools from Synopsys, Cadence, Mentor Graphics, etc. • Input languages: behavioral HDLs Target: ASIC Outcome: Commercial failure Generation 3 (from early 2000s): • Domain oriented commercial tools: in particular for DSP • Input languages: C, C++, C-like languages (Impulse C, Handel C, etc.), Matlab + Simulink, Bluespec • Target: FPGA, ASIC, or both Outcome: First success stories 135

Cinderella Story

AutoESL Design Technologies, Inc. (25 employees) Flagship product: AutoPilot, translating C/C++/System C to VHDL or Verilog • Acquired by the biggest FPGA company, Xilinx Inc., in 2011 • AutoPilot integrated into the primary Xilinx toolset, Vivado, as Vivado HLS, released in 2012

“High-Level Synthesis for the Masses”

136 Our Hypothesis

• Ranking of candidate algorithms in cryptographic contests in terms of their performance in modern FPGAs will remain the same independently whether the HDL implementations are developed manually or generated automatically using High-Level Synthesis tools

• The development time will be reduced by at least an order of magnitude

137 Traditional Development and Benchmarking Flow

Informal Specificaon Test Vectors

Manual Design Functional HDL Code Verification

Post Manual Optimization Place & Route FPGA Tools Results Timing Netlist Verification Extended Traditional Development and Benchmarking Flow

Informal Specificaon Test Vectors

Manual Design Functional HDL Code Verification

Post Option Optimization Place & Route ATHENa FPGA Tools Results Timing Netlist Verification HLS-Based Development and Benchmarking Flow

Reference Implementaon in C

Manual Modifications (pragmas, tweaks) Test Vectors HLS-ready C code

High-Level Synthesis

Functional HDL Code Verification Post Option Optimization ATHENa Place & Route FPGA Tools Results Timing Netlist Verification Example of Source Code Modifications

for (i = 0; i < 4; i ++) #pragma HLS UNROLL for (j = 0; j < 4; j ++) #pragma HLS UNROLL b[i][j] = s[i][j];

141 Our Test Case

• 5 final SHA-3 candidates • Most efficient sequential architectures • GMU RTL VHDL codes developed during SHA-3 contest • Reference software implementations in C included in the submission packages

Hypotheses: • Ranking of candidates will remain the same • Performance ratios RTL/HLS similar across candidates

142 Manual RTL vs. HLS-based Results: Altera III

RTL HLS

143 Manual RTL vs. HLS-based Results: Altera Stratix IV

RTL HLS

144 Ratios of Major Results RTL/HLS for Altera Stratix III

145 Ratios of Major Results RTL/HLS for Altera Stratix IV

146 Lack of Correlation for Xilinx Virtex 6

RTL HLS

147 Datapath vs. Control Unit

Data Inputs Control Inputs

Control Signals Control Datapath Unit Status Signals

Data Outputs Control Outputs

Determines Determines • Area • Number of clock cycles • Clock Frequency 148 Encountered Problems

Datapath inferred correctly • Frequency and area within 30% of manual designs Control Unit suboptimal • Difficulty in inferring an overlap between completing the last round and reading the next input block • One additional clock cycle used for initialization of the state at the beginning of each round • The formulas for throughput:

RTL: Throughput = Block_size / (#Rounds * TCLK) HLS: Throughput = Block_size / ((#Rounds+2) * TCLK)

149 Hypothesis Check

Hypothesis I: • Ranking of candidates in terms of throughput, area, and throughput/ area ratio will remain the same TRUE for Altera Stratix III and Stratix IV FALSE for Xilinx Virtex 5 and Virtex 6 Hypothesis II: • Performance ratios RTL/HLS similar across candidates

Stratix III Stratix IV Frequency 0.99-1.30 0.98-1.19 Area 0.71-1.01 0.68-1.02 Throughput 1.10-1.33 1.09-1.27 Throughput/ 1.14-1.55 1.17-1.59 Area 150 Correlation Between Altera FPGA Results and ASICs

Stratix III FPGA ASIC

151 Proposed Interface for Authenticated Ciphers

clk rst

clk rst Cipher w Core w pdi do PDI DO Public Data Input pdi_ready do_ready Data Output Ports Ports pdi_read do_write w sdi error SDI Error Notification sdi_ready Secret Data Input 8 Ports ecode Ports sdi_read

152 Typical External Circuit

clk rst

clk rst clk rst clk rst Cipher w w w w epdi ipdi Core ido edo pdi do DO pfifo_full PDI pfifo_empty ofifo_full ofifo_empty FIFO pdi_ready do_ready FIFO pfifo_write pfifoin_read pdi_read ofifo_write ofifo_read do_write w w esdi isdi sdi error sfifo_full SDI sfifo_empty FIFO sdi_ready 8 ecode sfifo_write sfifo_read sdi_read

clk rst 153 Format of Secret Data Input

w bits

instruction seg_0_header. . seg_0 . = Key

154 Format of Public Data Input: Encryption

w bits

instruction seg_0_header

seg_0 = IV . seg_1_header . seg_1. = AD

seg_2_header

seg_2 = Message

155 Format of Segment Header w-1 0 – – 1 8 4 2 1 1 w-16 Input ID LS Segment [0..255] Length Segment [0..2w-16-1 bytes] Type 0000 – Reserved 0001 – Initialization Vector 0010 – Associated Data 0011 – Message 0100 – Ciphertext 0101 – Tag 0110 – Key LS = 1 if the last segment of input 0 otherwise 156 ATHENa Database of Results for Authenticated Ciphers

• Already available at http://cryptography.gmu.edu/athena

• Similar to the database of results for hash functions, filled with ~1600 results during the SHA-3 contest

• Results can be entered by designers themselves. If you would like to do that, please contact me regarding an account.

• The ATHENa Option Optimization Tool supports automatic generation of results suitable for uploading to the database

157 Ordered Listing with a Single-Best (Unique) Result per Each Algorithm

158 159 160 161 Implementation of CAESAR Round 1 Candidates

• 30+ Round 1 CASER candidates to be implemented manually in VHDL as a part of ECE 545 in Fall 2014. One cipher per student.

• One PhD student, Ice, will implement the same 30+ ciphers in parallel using HLS.

• Preliminary results in mid-December 2014, about a month before the announcement of Round 2 candidates.

• Deadline for second-round Verilog/VHDL: April 15, 2014.

162 Most Promising Methodology & Toolset

Reference Implementaon in C

Manual Modifications

HLS-ready C code Frequency & Throughput decrease High-Level Synthesis Area increases HLS by no more than 30% HDL Code compared to manual RTL

Option Optimization GMU ATHENa

FPGA Tools Altera Quartus II

Results 163 Expected by the end of Fall 2014

30+ RTL results 30+ HLS results generated by 30+ ECE 545 students generated by “Ice” alone

164 Questions? Suggestions?

ATHENa: http:/cryptography.gmu.edu/athena CERG: http://cryptography.gmu.edu

165