L2: FPGA HARDWARE
18-545: ADVANCED DIGITAL DESIGN PROJECT FALL 2011 BILL NACE Administrivia
Team assignments are done
Lab 1 is due Monday
Project Proposals happen on Monday
Reading Assignment #1 due today 13/14 students got it in to Blackboard on time David's attempt didn't get saved (?) Submit a PDF, don't fill in the web form
18-545: FALL 2011 2 Game Plan
Overview
Why use FPGAs?
FPGA Internals
Caveat: I will use Xilinx specific terminology since that’s the FPGA company you will be using. Beware that other companies use different terms
18-545: FALL 2011 3 FPGA Overview
Field Programmable Gate Array Array of generic logic gates Gates where logic function can be programmed Programmable interconnection between gates Fielded systems can be programmed i.e. post-fabrication Xilinx Virtex-5 FPGA
18-545: FALL 2011 5 Design Platform
Virtex-5 Development System Xilinx XC5VLX110T FPGA 17280 slices of CLB goodness 256MB DDR2 (SODIMM) DVI Video port VGA port is for input 10/100/1000 Ethernet port Audio Codec (AC97) USB2 port 16x2 LCD, RS-232 Compact Flash card slot Expansion connectors
18-545: FALL 2011 6 Game Plan
Overview Why use FPGAs?
FPGA Internals
18-545: FALL 2011 7 Why use FPGAs?
System designers have a Goldilocks problem Off-the-shelf parts are not efficient enough Custom ASICs cost too much Need a “just right” solution ASIC Design
Difficult to design Large and complex Issues in advanced processes Interconnect delay Device leakage Power density constraints
Expensive to design / fabricate Mask set costs Non-recurring engineering costs
Need a high-volume, high-profit market to justify costs! Energy Efficiency (MOPS/mW) Area Efficiency (MOPS/mm2) 10000
1000 Microprocessors
100
10
1
0.1 DSPs ASICs 0.01 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Efficiency View An efficiency gap exists between ASICs and CPUs
N. Zhang, et. al, “The Cost of Flexibility in Systems on a Chip Design for Signal Processing Applications” Decreasing FPGA unit cost pushing crossover point to the right ASIC solution has a lower total cost ASIC Trend Additional ASIC costs: •Increasing NRE charge •58% are late to market --
FPGA solution has impacts total volumes shipped a lower total cost •ASIC cycle longer than some
Development Cost + Device Cost Cost + Development Device market windows •Over 50% need to be respun FPGA Trend Total Units
(Courtesy Xilinx, Inc.)
Economic View FPGAs: High package costs ($300+), low NRE costs ASICs: Low package costs (pennies), high NRE costs ($600K+) FPGA Advantages
Faster than CPU solution
Lower power than CPU solution (usually)
Low NRE costs
Off-the-shelf part designed by FPGA vendor
You are sharing NRE costs with all other customers
Fast design time
Low time-to-market
Fast re-design / re-fabrication time
Easy to correct an error, to add functionality, in response to spec change
Can even change product after deployment
18-545: FALL 2011 12 FPGA Disadvantages
High per-part costs Good for low to middle volume applications High volume applications should consider ASICs Perhaps use FPGA for prototyping
Lower performance than ASIC
Higher power than ASIC
More specialized design skills than CPU
18-545: FALL 2011 13 Example uses of FPGAs
Rapid Prototyping Emulation of ASIC design Design exploration Verification
Shipping product Networking Military
Reconfigurable Computing Game Plan
Overview
Why use FPGAs? FPGA Internals
18-545: FALL 2011 15 FPGA Breakdown
3 Basic components Configurable Logic Blocks General purpose interconnect I/O Blocks
Advanced components Hard macros CPUs Block RAM Multipliers Specialized components VIRTEX-II PRO XILINX XC3020 I/O BLOCK CLB (64 TOTAL) (64 TOTAL)
GENERAL PURPOSE IOBS HAVE DIRECT INTERCONNECT ACCESS TO ADJACENT CLBS
SWITCH MATRIX
(COURTESY XILINX, INC.) ROUTING EVEN MORE ZOOMED IN VIEW
ZOOMED IN VIEW OF THE CLB MATRIX OF THE FPGA
SPECIFIC INGRESS AND EGRESS CONNECTION OPTIONS (BLACK DOTS) ARE AVAILABLE
(COURTESY XILINX, INC.) ROUTING: THE SWITCH MATRIX EACH MATRIX HAS 5 CONNECTIONS PER SIDE
(COURTESY XILINX, INC.) ROUTING: THE SWITCH MATRIX EACH MATRIX HAS 5 CONNECTIONS PER SIDE
ONLY CERTAIN CONNECTION PATTERNS ARE POSSIBLE
(COURTESY XILINX, INC.) Hierarchical Routing
Spartan-2 and more recent have different length connections between switch matrices
Local roads, limited access roads, interstate highways Routes across entire chip don’t burn lots of short connections
18-545: FALL 2011 21 Detailed Routing (Spartan 2) Configurable Logic Blocks
CLBs get more and more stuff crammed in them over time XC3K family had LUT (5 variable input, 2 FF values, 2 outputs), 2 FFs, clock enable, FF reset (direct / global) and 9 muxes ~51 bits of configuration SRAM per CLB
(COURTESY XILINX, INC.) Look Up Table
. Direct implementation of logic truth table using a memory ♦Data inputs = memory address Another View of LUTs
. Can view LUT as 16:1 mux . Inputs are mux select . Config sets mux data inputs . Logically same as 16x1 memory . Can compact logic if you can route inputs to mux data inputs 5-Input Functions
. Slice can implement any function of 5 inputs . Logic function is partitioned between two LUTs . F5 multiplexer selects LUT 5-Input Functions (cont.)
LUT
OUT
LUT Fast Carry Logic
COUT
YB
G4 Y S G3 Look-Up D Q G2 O Carry G1 Table & CK Control Logic EC R F5IN BY SR
XB
X S F4 D Q F3 Look-Up Carry F2 Table O F1 & CK Control Logic EC R
CIN CLK CE SLICE Look Up Table Additional Functionality
. Can be configured as: ♦Shift register (16 regs) ♦Small memory (16 bits) • “Distributed RAM”
. Some other FPGAs use muxes instead of memories to implement the core combinational logic Spartan-2 CLB
Spartan-2 has 2 LUTs (4 input each) feeding a 3rd LUT, 2 FFs (with Preset/Reset, Enable, posedge or negedge clocks) and 16 muxes 12 inputs (plus clock), 4 outputs
18-545: FALL 2011 (COURTESY XILINX,30 INC.) Spartan-3
CLBs are composed of 4 slices Organized as 2 pairs, one of which is optimized for memory access
Each slice has 2 FFs and 2 LUTs
(COURTESY XILINX, INC.) FPGA Families extend Architecture
❏Devices are built, with more capability, but around the same basic architecture
❏Some additional capabilities ◆ Low voltage versions ◆ Faster clock rates ◆ Different packaging options
(Courtesy Xilinx, Inc.) The need for more stuff
❏CompEs cannot design on logic, routing, I/O alone ❏Extreme case from early 90s ◆ 16 port ATM switch, designed on a single board
FPGAs (XC3Ks)
FIFO memory chips
◆ Design is limited by I/O to memory chips--bring them on-chip 33 Other “Stuff”
❏Clock managers ◆ Global clock buffering, distribution ◆ DCM: eliminate skew, phase shifts, multiply or divide clock ❏Memory ◆ Block RAM ◆ Distributed RAM (repurposed LUTs) ❏Shift Registers ❏Dedicated Multiplexers ❏Carry Look-Ahead Generators ❏I/O Blocks ◆ SelectIO supports 18 standards (single, differential, various voltage levels, ....) ❏Embedded Multipliers
34 Hard Macros
. Hard macros ♦Block RAMs ♦Multipliers ♦CPUs
. Firm macros ♦HDL ♦Placed ♦Routed
. Soft macros ♦HDL Block RAMs
. Distributed RAM ♦Use LUTs as memories ♦Low density ♦Poor performance
. Block RAM ♦Large-ish dedicated memory blocks • Xilinx BRAMs = 18Kb ♦Some configurability • Dual-port • Data width / depth • FIFO, CAM, etc. Multipliers
18x18 signed 2’s-complement multiplier . Two 18b inputs . One 36b output . 18b enough for many DSP applications . Can gang multiple units together for wider data . Faster and lower power than multiplier from CLBs CPUs – PowerPC 405
XC2VP30 has 2 Embedded PowerPC 405 cores . Embedded L1 I and D caches . No FPU CPU Connectivity: PLB and OPB
IBM Core Connect . Processor Local Bus (PLB) . On-Chip Peripheral Bus (OPB) . Device Control Register bus (DCR) CPU Connectivity: PLB and OPB (cont.) CPU Connectivity: OCM
On-Chip Memory controller . CPU block RAM . 2 OCMs – I and D . Direct, fast interface . Can use dual-port BRAMs for producer-consumer link to FPGA fabric CPU Links
A lot more details on the embedded CPU
. http://www.xilinx.com/bvdocs/userguides/ppc_ref_guide.pdf
. http://direct.xilinx.com/bvdocs/userguides/ug018.pdf
. http://www-3.ibm.com/chips/techlib/techlib.nsf/productfamilies/ CoreConnect_Bus_Architecture
18-545: FALL 2011 42 Configuration Storage
WL Lots of configuration bits LUTs, routing, I/O configuration Xilinx XC2VP30 has >11Mb
Configuration storage technologies bit bit_b Volatile 6T SRAM cell SRAM cells Non-volatile FLASH, EEPROM Anti-fuse
Actel anti-fuse Configuration
How to load (scan) configuration bits (bitstream) Connect all configuration registers into single long shift register Serially clock in configuration bits Most designs use standard scan interface (JTAG) developed for test
Bitstream source Non-volatile memory On-board FLASH, EEPROM, serial memory External media (CF card) Attached workstation
Can encrypt bitstream to conceal configuration
18-545: FALL 2011 44 Major FPGA Vendors
SRAM-based FPGAs Xilinx Share over 60% of the market Altera Atmel Lattice Semiconductor
Flash & antifuse FPGAs Actel Corp. Quick Logic Corp. Lattice Semiconductor Xilinx (system-in-a-package solution)
18-545: FALL 2011 45 FPGA Information Links
Virtex-II Pro Datasheet http://direct.xilinx.com/bvdocs/publications/ds083.pdf
Virtex-II Pro User’s Guide www.xilinx.com/bvdocs/userguides/ug012.pdf
XUP Board Guide www.xilinx.com/univ/XUPV2P/Documentation/ XUPV2P_User_Guide.pdf
18-545: FALL 2011 46