<<

ECE 448 Lecture 7

FPGA Devices & FPGA Design Flow

George Mason University Required reading

• P. Chu, RTL Hardware Design using VHDL

Chapter 1, Introduction to Digital System Design

• Spartan-6 FPGA CLB, User Guide

§ CLB Overview § Slice Description

2 Two competing implementation approaches

ASIC FPGA Application Specific Field Programmable Integrated Circuit Gate Array

• designed all the way • no physical layout design; from behavioral description design ends with to physical layout a bitstream used to configure a device • designs must be sent for expensive and time • bought off the shelf consuming fabrication and reconfigured by in semiconductor foundry designers themselves

3 Which Way to Go? ASICs FPGAs

Off-the-shelf High performance Low development cost Low power Short time to market Low cost in high volumes Reconfigurability

4 What is an FPGA?

Configurable Logic Blocks Block RAMs Block RAMs Block I/O Blocks

Block RAMs

5 Modern FPGA

RAMRAM bblockslocks Multipliers/DSPMultipliers units LogicLog resourcesic blocks

(#Logic resources, #Multipliers/DSP units, #RAM_blocks)

Graphics based on The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com) 6 Major FPGA Vendors

SRAM-based FPGAs • , Inc. ~ 51% of the market ~ 85% • Corp. ~ 34% of the market •

Flash & antifuse FPGAs • Corp. (Microsemi SoC Products Group) • Quick Logic Corp.

7 Xilinx u Primary products: FPGAs and the associated CAD software

Programmable Logic Devices ISE Alliance and Foundation Series Design Software u Main headquarters in San Jose, CA u Fabless* Semiconductor and Software Company

u UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996}

u Seiko Epson (Japan)

u TSMC (Taiwan)

u Samsung (Korea)

8 Xilinx FPGA Families Technology Low-cost High- performance 220 nm Virtex 180 nm Spartan II, Spartan IIE 120/150 nm Virtex II, Virtex II Pro 90 nm Spartan 3 Virtex 4 65 nm Virtex 5 45 nm Spartan 6 40 nm Virtex 6 28 nm Arx 7 Virtex 7 Altera FPGA Families

Technology Low-cost Mid-range High- performance

130 nm Cyclone Strax

90 nm Cyclone II Strax II

65 nm Cyclone III Arria I Strax III

40 nm Cyclone IV Arria II Strax IV

28 nm Cyclone V Arria V Strax V FPGA Family

11 Spartan 6 FPGA Family

12 CLB Structure

George Mason University General structure of an FPGA

Programmable interconnect

Programmable logic blocks

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

14 Xilinx Spartan 6 CLB

15 Row & Column Relationship Between CLBs & Slices

16 Three Different Types of Slices

50% 25% 25%

17 SLICEX

18 SLICEL

19 Xilinx Multipurpose LUT (MLUT)

132-bit6-bit SRSR

1646 x x 1 1 RRAMAM

464-in px u1t ROMLUT (logic)

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

20 4-input LUT (Look-Up Table) in the Basic ROM Mode

• Look-Up tables x1 x 2 y x x x x y x3 LUT x x x x y are primary 1 2 3 4 x 1 2 3 4 0 0 0 0 1 4 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 elements for 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 0 logic 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 0 implementation 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 1 0 0 1 1 • Each LUT can 1 0 1 0 1 1 0 1 0 0 1 0 1 1 1 1 0 1 1 0 implement any 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 1 1 x x x x function of 1 1 1 0 0 1 2 3 4 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 4 inputs

x1 x2

y

y

21 6-Input LUT of Spartan 6

22 23 Reset and Set Configurations

• No set or reset • Synchronous set • Synchronous reset • Asynchronous set (preset) • Asynchronous reset (clear)

24 MLUT as a 32-bit Shift Register (SRL32)

25 Fast Carry Logic

u Each CLB contains separate logic and routing for the fast generation of sum & carry MSB signals

• Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters Routing Carry Logic Carry u Carry logic is independent of LSB normal logic and routing resources

26 Accessing Carry Logic

u All major synthesis tools can infer carry logic for arithmetic functions • Addition (SUM <= A + B) • Subtraction (DIFF <= A - B) • Comparators (if A < B then…) • Counters (count <= count +1)

27

Full-adder

x cout FA y s cin

x y cin cout s 0 0 0 0 0 2 1 x + y + c = ( c s ) 0 0 1 0 1 in out 2 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 Carry & Control Logic in Xilinx FPGAs

x y COUT 0 0 y x y 0 1 CIN 1 0 CIN 1 1 y

Propagate = x ⊕ y Generate = y Sum= Propagate ⊕ CIN = x ⊕ y ⊕ CIN Carry & Control Logic in Spartan 6 FPGAs

x y

LUT

Hardwired (fast) logic Examples:

Determine the amount of Spartan 6 resources needed to implement a given circuit

George Mason University m Circuit 1: 0 1 run w Top level R0 R1

R2

R3

R4

R5

R6

R7 a b F R8 c y d R9

R10

R11

R12

R13

R14

R15 clk a Circuit 1: b y3 0 a w1 F – function 1 y2 b w0 2 c y1 3 En 4 1 d y0 5 e 1 y 2-to-4 Decoder 0 6 f 7 0

a x3 y3 3 e b x2 y2 f <<<3 c x1 y1 g d x0 y0 s h cout cin c Full x y Adder

g h d Circuit 2: 0 1 run z Top level R0 R1

R2

R3

R4

R5

R6

R7 a b R8 c F y d R9 e R10

R11

R12

R13

R14

R15 clk a Circuit 2: e a w3 0 F – function y1 1 b w2 2 y0 c 3 w1 1 4 d z 1 5 g y Priorityw0 Encoder 0 6 h 0 7

a x3 y3 3 f

b x2 y2 >>2 g c x1 y1 h d s i x0 y0 cout Half x y Adder

e i Circuit 3: Top level Input/Output Blocks (IOBs)

George Mason University Basic I/O Block Structure

Three-State D Q EC FF Enable Three-State Clock SR Control Set/Reset

Output D Q FF Enable EC Output Path SR

Direct Input FF Enable Input Path Registered Q D Input EC SR

39 IOB Functionality

• IOB provides interface between the package pins and CLBs • Each IOB can work as uni- or bi-directional I/O • Outputs can be forced into High Impedance • Inputs and outputs can be registered • advised for high-performance I/O • Inputs can be delayed

40 Clock Management

George Mason University A simple clock tree

Clock Flip-flops tree

Special clock pin and pad

Clock signal from outside world

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

42 Clock Manager

Clock signal from outside world Daughter clocks Clock used to drive internal clock trees Manager or output pins etc.

Special clock pin and pad

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

43 Jitter

1 2 3 4

Ideal clock signal

Real clock signal with jitter Cycle 1 Cycle 2 Cycle 3 Cycle 4 Superimposed cycles

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

44 Removing Jitter

Clock signal from outside world with jitter “Clean” daughter Clock clocks used to drive internal clock trees Manager or output pins etc.

Special clock pin and pad

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

45 Frequency Synthesis

1.0 x original clock frequency

2.0 x original clock frequency

.5 x original clock frequency

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

46 Phase shifting

0o Phase shifted

90o Phase shifted

180o Phase shifted

270o Phase shifted

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Figure 4-20 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

47 Clock Management Tiles

DCM – Digital Clock Manager PLL - Phase Locked Loop

48 Spartan-6 Family Attributes

George Mason University Spartan-6 FPGA Family Members

50 FPGA device present on the Digilent Nexys 3 board

XC6SLX16-CSG324C

Size Spartan 6 324 pins family Logic Package type Optimized (Ball Chip-Scale) Commercial temperature range 0° C – 85° C

51 FPGA Design Flow

George Mason University FPGA Design process (1)

Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 8031 microcontroller. Unlike in the experiment 5, this time your unit has to be able Specification / Pseudocode to perform an encryption algorithm by itself, executing 32 rounds…..

On-paper hardware design (Block diagram & ASM chart)

VHDL description (Your Source Files)

Library IEEE; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; Functional simulation entity RC5_core is port( clock, reset, encr_decr: in std_logic; data_input: in std_logic_vector(31 downto 0); data_output: out std_logic_vector(31 downto 0); out_full: in std_logic; key_input: in std_logic_vector(31 downto 0); key_read: out std_logic; ); end AES_core;

Synthesis Post-synthesis simulation FPGA Design process (2)

Implementation Timing simulation

Configuration On chip testing Tools used in FPGA Design Flow

Functionally verified VHDL code Design

VHDL code

Xilinx XST Synplify Premier Synthesis

Netlist

Implementation Xilinx ISE

Bitstream 55 Synthesis

George Mason University Synthesis Tools

Xilinx XST Synplify Premier

… and others

57 Logic Synthesis

VHDL description Circuit netlist architecture MLU_DATAFLOW of MLU is signal A1:STD_LOGIC; signal B1:STD_LOGIC; signal Y1:STD_LOGIC; signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC; begin A1<=A when (NEG_A='0') else not A; B1<=B when (NEG_B='0') else not B; Y<=Y1 when (NEG_Y='0') else not Y1;

MUX_0<=A1 and B1; MUX_1<=A1 or B1; MUX_2<=A1 xor B1; MUX_3<=A1 xnor B1;

with (L1 & L0) select Y1<=MUX_0 when "00", MUX_1 when "01", MUX_2 when "10", MUX_3 when others; end MLU_DATAFLOW;

58 Circuit netlist (RTL view)

59 Mapping

LUT0 LUT4

LUT1 FF1 LUT5

LUT2

FF2 LUT3

60 Xilinx XST Inputs/Outputs

61 Xilinx XST Inputs

• RTL VHDL and/or files • Constraints – XCF Xilinx constraints file in which you can specify synthesis, timing, and specific implementation constraints that can be propagated to the NGC file. • Core files These files can be in either NGC or EDIF format. XST does not modify cores. It uses them to inform area and timing optimization surrounding the cores.

62 Xilinx XST Outputs

• NGC Netlist file with constraint information • NGR This is a schematic representation of the pre-optimized design shown at the Register Transfer Level (RTL). This representation is in terms of generic symbols, such as adders, multipliers, counters, AND gates, and OR gates, and is generated after the HDL synthesis phase of the synthesis process. • LOG This report contains the results from the synthesis run, including area and timing estimation.

63 RTL view in Synplify Premier

General logic structures can be recognized in RTL view

comparator incrementer MUX Crossprobing between RTL view and code

Each port, net or block can be chosen by mouse click from the browser or directly from the RTL View By double-clicking on the element its source code can be seen:

Reverse crossprobing is also possible: if section of code is marked, appropriate element of RTL View is marked too: Technology View in Synplify Pro

Technology view is a mapped RTL view. It can be seen by pressing button or by double-click on “.srm” file As in case of “RTL View”, buttons can be used here

Two additional buttons are enabled: - show critical path - open timing analyst

Pay attention: technology view is Technology view is usually large and presented using device presented on Ports, nets and primitives number of sheets blocks browser Viewing critical path

Critical path can be viewed by pressing on

Delay values are written near each component of the path Implementation

George Mason University Implementation

• After synthesis the entire implementation process is performed by FPGA vendor tools

69 Implementation

70 Translation

Synthesis

Circuit Timing Constraint Editor Netlist Constraints or Text Editor

UCF User Constraint File

Translation

NGD Native Generic Database file

71 Mapping

LUT0 LUT4

LUT1 FF1 LUT5

LUT2

FF2 LUT3

72 Placing FPGA CLB SLICES

73 Routing FPGA

Programmable Connections

74 Configuration

• Once a design is implemented, you must create a file that the FPGA can understand • This file is called a bit stream: a BIT file (.bit extension)

• The BIT file can be downloaded directly to the FPGA, or can be converted into a PROM file which stores the programming information

75 Two main stages of the FPGA Design Flow Synthesis Implementation Technology Technology dependent independent

RTL Map Place & Route Configure Synthesis

- Code analysis - Mapping of extracted logic - Placement of generated - Bitstream - Derivation of main logic structures to device primitives netlist onto the device generation constructions - Technology dependent - Choosing best interconnect - Burning device - Technology independent optimization structure for the placed optimization - Application of “synthesis design - Creation of “RTL View” constraints” - Application of “physical - Netlist generation constraints” - Creation of “Technology View” Synthesis Report Example – Resource Utilization (1) Device utilization summary: ------

Selected Device : 6slx4tqg144-3

Slice Logic Utilization: Number of Slice Registers: 53 out of 4800 1% Number of Slice LUTs: 163 out of 2400 6% Number used as Logic: 163 out of 2400 6%

Slice Logic Distribution: Number of LUT Flip Flop pairs used: 198 Number with an unused Flip Flop: 145 out of 198 73% Number with an unused LUT: 35 out of 198 17% Number of fully used LUT-FF pairs: 18 out of 198 9% Number of unique control sets: 7

77 Synthesis Report Example – Resource Utilization (2)

IO Utilization: Number of IOs: 43 Number of bonded IOBs: 43 out of 102 42%

Specific Feature Utilization: Number of BUFG/BUFGCTRLs: 1 out of 16 6% Number of DSP48A1s: 5 out of 8 62%

78 Synthesis Report Example – Timing

Timing Summary: ------Speed Grade: -3

Minimum period: 6.031ns (Maximum Frequency: 165.817MHz)

79 Map Report Example – Resource Utilization (1) Design Summary ------Slice Logic Utilization: Number of Slice Registers: 54 out of 4,800 1% Number used as Flip Flops: 53 Number used as Latches: 0 Number used as Latch-thrus: 0 Number used as AND/OR logics: 1 Number of Slice LUTs: 149 out of 2,400 6% Number used as logic: 148 out of 2,400 6% Number using O6 output only: 133 Number using O5 output only: 0 Number using O5 and O6: 15 Number used as ROM: 0 Number used as Memory: 0 out of 1,200 0% Number used exclusively as route-thrus: 1

80 Map Report Example – Resource Utilization (2)

Slice Logic Distribution: Number of occupied Slices: 58 out of 600 9% Number of MUXCYs used: 32 out of 1,200 2% Number of LUT Flip Flop pairs used: 162 Number with an unused Flip Flop: 109 out of 162 67% Number with an unused LUT: 13 out of 162 8% Number of fully used LUT-FF pairs: 40 out of 162 24% Number of unique control sets: 7 Number of slice register sites lost to control set restrictions: 35 out of 4,800 1%

IO Utilization: Number of bonded IOBs: 43 out of 102 42%

81 Map Report Example – Resource Utilization (3)

Specific Feature Utilization:

Number of RAMB16BWERs: 0 out of 12 0% Number of RAMB8BWERs: 0 out of 24 0% ……. Number of DSP48A1s: 5 out of 8 62% …….

82 Post-PAR Static Timing Report

Clock to Setup on destination clock clk_i ------+------+------+------+------+ | Src:Rise| Src:Fall| Src:Rise| Src:Fall| Source Clock |Dest:Rise|Dest:Rise|Dest:Fall|Dest:Fall| ------+------+------+------+------+ clk_i | 7.530| | | | ------+------+------+------+------+

83 PAR Report

------Constraint | Check | Worst Case | Best Case | Timing | Timing | | Slack | Achievable | Errors | Score ------Autotimespec constraint for clock net clk | SETUP | N/A| 7.530ns| N/A| 0 _i_BUFGP | HOLD | 0.457ns| | 0| 0 ------

84 Timing Report (1)

Timing constraint: Default period analysis for net "clk_i_BUFGP" 3354 paths analyzed, 309 endpoints analyzed, 0 failing endpoints 0 timing errors detected. (0 setup errors, 0 hold errors) Minimum period is 7.530ns. ------Delay (setup path): 7.530ns (data path - clock path skew + uncertainty) Source: a_register/q_o_4 (FF) Destination: x_reg_inst/q_o_3 (FF) Data Path Delay: 7.453ns (Levels of Logic = 2) Clock Path Skew: -0.042ns (0.513 - 0.555) Source Clock: clk_i_BUFGP rising Destination Clock: clk_i_BUFGP rising Clock Uncertainty: 0.035ns

85 Timing Report (2)

Maximum Data Path at Slow Process Corner: a_register/q_o_4 to x_reg_inst/q_o_3 Location Delay type Delay(ns) Physical Resource Logical Resource(s) ------SLICE_X4Y36.AQ Tcko 0.447 a_register/q_o<4> a_register/q_o_4 DSP48_X0Y3.B4 net (fanout=21) 1.194 a_register/q_o<4> DSP48_X0Y3.M3 Tdspdo_B_M 3.364 Mmult_mult_unsigned Mmult_mult_unsigned SLICE_X8Y39.C4 net (fanout=1) 2.050 mult_unsigned<3> SLICE_X8Y39.CLK Tas 0.398 x_reg_inst/q_o<3> Mmux_x_57 Mmux_x_4_f7_2 Mmux_x_2_f8_2 x_reg_inst/q_o_3 ------Total 7.453ns (4.209ns logic, 3.244ns route) (56.5% logic, 43.5% route)

86 Timing Report (3)

------Delay (setup path): 7.484ns (data path - clock path skew + uncertainty) Source: a_register/q_o_7_1 (FF) Destination: x_reg_inst/q_o_3 (FF) Data Path Delay: 7.391ns (Levels of Logic = 2) Clock Path Skew: -0.058ns (0.513 - 0.571) Source Clock: clk_i_BUFGP rising Destination Clock: clk_i_BUFGP rising Clock Uncertainty: 0.035ns

Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE Total System Jitter (TSJ): 0.070ns Total Input Jitter (TIJ): 0.000ns Discrete Jitter (DJ): 0.000ns Phase Error (PE): 0.000ns

87 Timing Report (4)

Maximum Data Path at Slow Process Corner: a_register/q_o_7_1 to x_reg_inst/ q_o_3 Location Delay type Delay(ns) Physical Resource Logical Resource(s) ------SLICE_X2Y33.AQ Tcko 0.447 a_register/q_o_7_2 a_register/q_o_7_1 DSP48_X0Y3.B7 net (fanout=13) 1.132 a_register/q_o_7_1 DSP48_X0Y3.M3 Tdspdo_B_M 3.364 Mmult_mult_unsigned Mmult_mult_unsigned SLICE_X8Y39.C4 net (fanout=1) 2.050 mult_unsigned<3> SLICE_X8Y39.CLK Tas 0.398 x_reg_inst/q_o<3> Mmux_x_57 Mmux_x_4_f7_2 Mmux_x_2_f8_2 x_reg_inst/q_o_3 ------Total 7.391ns (4.209ns logic, 3.182ns route) (56.9% logic, 43.1% route)

88 Xilinx FPGA Memories Recommended reading

• Spartan-6 FPGA Block RAM Resources: User Guide Google search: UG383 • Spartan-6 FPGA Configurable : User Guide Google search: UG384 • Xilinx FPGA Embedded Memory Advantages: White Paper Google search: WP360 • ISE In-Depth Tutorial, Section: Creating a CORE Generator Tool Module Google search: ISE In-Depth Tutorial

90 Memory Types

91 Memory Types

Memory

RAM ROM

Memory

Single port Dual port

Memory

With asynchronous With synchronous read read

92 Memory Types specific to Xilinx FPGAs

Memory

Distributed Block RAM-based (MLUT-based) (BRAM-based)

Memory

Inferred Instantiated

Manually Using CORE Generator

93 CORE Generator CORE Generator FPGA Distributed Memory

96 Location of Distributed RAM

Logic resources RAM blocks (CLB slices) RAM blocks MDSPultip lunitsiers LogicLog iresourcesc blocks

(#Logic resources, #Multipliers/DSP units, #RAM_blocks)

Graphics based on The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com) 97 Three Different Types of Slices

50% 25% 25%

98 Spartan-6 Multipurpose LUT (MLUT)

132-bit6-bit SRSR

1646 x x 1 1 RRAMAM

464-in px u1t ROMLUT (logic)

The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

99 Single-port 64 x 1-bit RAM

100 Memories Built of Neighboring MLUTs

Memories built of 2 MLUTs:

• Single-port 128 x 1-bit RAM: RAM128x1S • Dual-port 64 x 1-bit RAM : RAM64x1D

Memories built of 4 MLUTs:

• Single-port 256 x 1-bit RAM: RAM256x1S • Dual-port 128 x 1-bit RAM: RAM128x1D • Quad-port 64 x 1-bit RAM: RAM64x1Q • Simple-dual-port 64 x 3-bit RAM: RAM64x3SDP (one address for read, one address for write)

101 Dual-port 64 x 1 RAM

• Dual-port 64 x 1-bit RAM : 64x1D • Single-port 128 x 1-bit RAM: 128x1S

ECE 448 – FPGA and ASIC Design with VHDL 102 Total Size of Distributed RAM

103 FPGA Block RAM

104 Location of Block RAMs

Logic resources RAM blocks (CLB slices) RAM blocks MDSPultip lunitsiers LogicLog iresourcesc blocks

(#Logic resources, #Multipliers/DSP units, #RAM_blocks)

Graphics based on The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com) 105 Spartan-6 Block RAM Amounts

106 Block RAM can have various configurations (port aspect ratios)

1 2 0 4 0 0

8k x 2 4k x 4

4,095

16k x 1 8,191 8+1 0 2k x (8+1) 2047

16+2 0 1024 x (16+2) 1023 16,383

107 108 109 Block RAM Port Aspect Ratios

110 Block RAM Interface

111 Block RAM Ports

112 Block RAM with synchronous read in Read-First Mode

ENCE

113 Features of Block RAMs in Spartan-6 FPGAs

114