Computer Design
Computer Laboratory
Part Ib in Computer Science
Copyright c Simon Moore, University of Cambridge, Computer Laboratory, 2011 Contents:
• 17 lectures of the Computer Design course
• Paper on Chuck Thacker’s Tiny Computer 3
• White paper on networks-on-chip for FPGAs
• Exercises for supervisions and personal study
• SystemVerilog “cheat” sheet which briefly covers much of what was learnt using the Cam- bridge SystemVerilog web tutor
Historic note: this course has had many guises. Until last year the course was in two parts: ECAD and Computer Design. Past paper questions on ECAD are still mostly relevant though there has been a steady move toward SystemVerilog rather than Verilog 2001. This year the ECAD+Arch labs have been completely rewritten for the new tPad hardware and making use of Chuck Thacker’s Tiny Computer 3 (TTC). The TTC is being used instead of our Tiger MIPS core since it is simple enough to be easily altered/upgraded as a lab. exercise. As a conse- quence some of the MIPS material has been removed in favour of a broader introduction to RISC processors; and the Manchester Baby Machine in SystemVerilog has been replaced by an introduction to the TTC and its SystemVerilog implementation. A new lecture on system’s design on FPGA using Altera’s Qsys tool has been added. Lecture 1— Introduction and Motivation 1
Computer Design — Lecture 1 Revising State Machinesy 5
Introduction and Motivationy 1 Then do Boolean minimisation, e.g. using K-maps Overview of the course How to build a computer:
4 × lectures introducing Electronic Computer Aided Design (ECAD) and the Verilog/SystemVerilog language
Cambridge SystemVerilog web tutor (home work + 1 lab. session equivalent to 4 lectures worth of material) ′ ′ 7 × ECAD+Architecture Labs n0 = n0 ⊕ go n1 = go n1 + n0 n1 + n0 n1 go n1′ = go (n0 ⊕ n1) + go n1 14 × lectures on computer architecture and implementation Contents of this lecture Revising State Machinesy 6 Aims and Objectives
Implementation technologies And the final circuit diagram
Technology trends
The hardware design gap
Recommended books and web materialy 2
Recommended book (both ECAD and computer architecture):
D.M. Harris & S.L. Harris. Digital Design and Computer Architecture, Morgan Kaufmann 2007
General Computer Architecture:
J.L. Hennessey and D.A. Patterson, “Computer Architecture — A Quantitative Approach”, Morgan Kaufmann, 2002 (1996 edition also good, 1990 edition is only slightly out of date, 2006 edition good but compacted down) D.A. Patterson and J.L. Hennessey, “Computer Organization & now implement... Design — The Hardware/Software Interface”, Morgan Kaufmann, 1998 (1994 version is also good, 2007 version now available) Revising PLAsy 7 Web:
http://www.cl.cam.ac.uk/Teaching/current/CompDesign/ PLA = programmable logic array advantages — cheap (cost per chip) and simple to use Revising State Machinesy 3 disadvantages — medium to low density integrated devices (i.e. not many gates per chip) so cost per gate is high Start with a state transition graph
i.e. it is a 2-bit counter with enable (go) input Field Programmable Gate Arraysy 8 Revising State Machinesy 4 a sea of logic elements made from small SRAM blocks
Then produce the state transition table e.g. a 16×1 block of SRAM to provide any boolean function of 4 current next variables input state state go n1 n0 n1’ n0’ often a D-latch per logic element 0 0 0 0 0 programmable interconnect 0 0 1 0 1 0 1 0 1 0 program bits enable/disable tristate buffers and transmission gates 0 1 1 1 1 1 0 0 0 1 advantages: rapid prototyping, cost effective in low to medium volume 1 0 1 1 0 disadvantages: 15× to 25× bigger and slower than full custom CMOS 1 1 0 1 1 ASICs 1 1 1 0 0 2 Computer Design
CMOS ASICsy 9 Zooming in on CMOS chip (2)y 12
CMOS = complementary metal oxide semiconductor
ASIC = application specific integrated circuit
full control over the chip design
some blocks might be full custom
i.e. laid out by hand
other blocks might be standard cell
cells (gates, flip-flops, etc) are designed by hand but with inputs and outputs in a standard configuration designs are synthesised to a collection of standard cells...... then a place & route tool arranges the cells and wires them up
blocks may also be generated by a macro
typically used to produce regular structures which need to be packed together carefully, e.g. a memory block
this material is covered in far more detail in the Part II VLSI course
Example CMOS chipy 10 Trends — Moore’s law & transistor densityy 13
In 1965 Gordon Moore (Intel) identified that transistor density was increasing exponentially, doubling every 18 to 24 months
Predicted > 65,000 transistors by 1975!
Moore’s law more recentlyy 14
1,000,000,000
100,000,000 Pentium 4 very small and simple test chip from mid 1990s Pentium III 10,000,000 Pentium II
T Pentium Pro r
a Pentium
see interactive version at: n Intel486
s 1,000,000
i
s t http://www.cl.cam.ac.uk/users/swm11/testchip/ o Intel386
r 80286 s 100,000
8086
Zooming in on CMOS chip (1)y 11 10,000 8080 8008 4004 1,000
1970 1975 1980 1985 1990 1995 2000 Year Lecture 1— Introduction and Motivation 3
Moore’s law and clock speedy 15 Moore’s law and transistor costy 18
10,000
1,000 4004
8008
Clock Speed (MHz) Speed Clock 8080
100 8086
80286
Intel386
10 Intel486
Pentium
Pentium Pro/II/III
1 Pentium 4
1970 1975 1980 1985 1990 1995 2000 2005 Amazing — faster, lower power, and cheaper at an exponential rate! Year
Moore’s Law in Perspective 1y 19 ITRS — gate lengthy 16
ITRS = International Technology Roadmap for Semiconductors
http://public.itrs.net/
2003 ITRS Technology Trends - Gate Length
1000
MPU Hi-Performance Gate Length - Printed
MPU Hi-Performance Gate Length - Physical
100 )mn( htgneL etaG htgneL )mn(
2-year Node Cycle
10
3-year Node Cycle
Nano-technology (<100nm) Era Begins - 1999 1 1995 2000 2005 2010 2015 2020 Year 2003 ITRS Period: Near-term: 2003-2009; Long-term: 2010-2018 Moore’s Law in Perspective 2y 20 Figure 8 2003 ITRS—Gate Length Trends
ITRS — delaysy 17
100
Gate Delay (Fan out 4)
Metal 1 (Scaled)
10 Global with Repeaters (Scaled Die Edge) yaleD evita yaleD
Global w/o Repeaters (Scaled Die Edge) l eR
1
Where is the end point?y 21 0.1 250 180 130 90 65 45 32
Process Technology Node (nm) what will limit advancement? Figure 54 Delay for Metal 1 and Global Wiring versus Feature Size fabrication cost (cost per transistor remaining static)? From ITRS 2003 feature sizes hit atomic levels? power density/cooling limits? design and verification challenge?
will silicon be replaced, e.g. by graphine? 4 Computer Design
Danger of Predictionsy 22
“I think there’s a world market for about five computers.”, Thomas J Watson, Chairman of the Board, IBM
“Man will never reach the moon regardless of all future scientific advances.” Dr. Lee De Forest, inventor of the vacuum tube and father of television.
“Computers in the future may weigh no more than 1.5 tons”, Popular Mechanics, 1949
ITRS — Design Costy 23
$ 1 00,000,000,000 sloot noitatnemelpmI CI noitatnemelpmI sloot esueR kcolB egraL yreV egraL kcolB esueR reenignE nihT llaT nihT reenignE ygolodohteM leveL SE leveL ygolodohteM esueR kcolB egraL kcolB esueR esueR kcolB llamS kcolB esueR hcnebtseT tnegilletnI hcnebtseT R&P esuoh nI esuoh R&P
$ 10,000,000,000
$1,000,000,000 tsoC ngiseD latoT ngiseD tsoC
629,769,273
$ 100 , 0 0 0 , 0 0 0 20 ,1 52 , 6 17
R TL M e t h o dology O n ly W ith A ll Future Improvements $10 , 0 0 0 , 0 0 0 1985 1990 1995 2000 2005 2010 2015 2 020 Y ear
Figure 13 Impact of Design Technology on SOC LP-PDA Implementation Cost From ITRS 2003 - Design
Comments on Design Costy 24
design complexity
exponential increase in design size transistors and wires are becoming harder to analyse and becoming less predictable
engineering efficiency
the number of engineers in the world is not going up exponentially! but numerous ways to improve matters, e.g. better design reuse parallels with the “software crisis”
verification and test
verifying functional correctness and identifying manufacturing defects is getting harder
Improving productivity: Verilogy 25
the Verilog hardware description language (HDL) improves productivity
example — counter with enable from earlier:
if(go) n <= n+1; // do the conditional counting
wrapped up into a module + clock assignment + declarations:
module counter2b(clk, go, n); input clk, go; // inputs: clock and go (count enable) signal) output [1:0] n; // output: 2-bit counter value reg [1:0] n; // n is stored in two D flip-flops
always @(posedge clk) // when the clock ticks, do the following... if(go) n <= n+1; // do the conditional counting endmodule
implement via automated synthesis (compilation) and download to FPGA
now digital electronics is about writing parallel algorithms so Computer Scientists are needed to design hardware Lecture 2 — Logic modelling, simulation and synthesis 5
Computer Design — Lecture 2 Verilog casex and casez statementsy 5
Logic Modeling, Simulation and Synthesisy 1 casez — z values are considered to be “don’t care” Lectures so far casex — z and x values are considered to be “don’t care” the last lecture looked at technology and design challenges example:
Overview of this lecture reg [3:0] r; this lecture introduces: casex(r) modeling 4’b000x : statement1; 4’b10x1 : statement2; simulation techniques 4’b1x11 : statement3; endcase synthesis techniques for logic minimisation and finite state machine (FSM) optimisation some values of r the outcome 4’b0001 matches statement 1 4’b0000 matches statement 1 Four valued logicy 2 4’b000x matches statement 1 4’b000z matches statement 1 value meaning 4’b1001 matches statement 2 0 false 4’b1111 matches statement 3 1 true x undefined z high impedance Problems when modeling a tristate busy 6
AND 0 1 x z OR 0 1 x z 0 0 0 0 0 0 0 1 x x 1 0 1 x x 1 1 1 1 1 x 0 x x x x x 1 x x z 0 x x x z x 1 x x
BUFT enable NOT output data 0 1 x z 0 1 0 z 0 x x 1 0 1 z 1 x x x x x z x x x z x z z x x x
Modeling a tri-state buffery 3 scenario:
start: dataA=dataB=enableB=0 and enableA=1 module BUFT( step 1: enableA- output reg out, step 2: enableB+ input in, input enable); ? what is that state of the bus between steps 1 and 2? 0 or z?
// behavioural use of always which activates whenever ? should the output ever go to x and what might the implications be if it // enable or in changes (i.e. both positive and negative edges) does? always @(enable or in) if(enable) Further logic levelsy 7 out = in; else out = 1’bz; // assign high-impedance to simulate charge holding elements (capacitors) we could introduce two endmodule additional logic levels:
Such models are useful for simulation purposes but cannot usually be weak low — indicates that the capacitor has been discharged to low synthesised. Rather, such a model would have to be replaced by a but is not being driven by anything predefined cell/component. weak high — indicates that the capacitor has been charged high but is not being driven by anything
Verilog and logic levelsy 4 further logic levels can be added to account for larger capacitances, rise and fall times, etc. ’z’ can be used to indicate tri-state (see previous example) the standard VHDL (another HDL) logic library uses a 46 state logic called ’===’ can be used to compare wires to see if they are tri-state t logic
similarly, !== is to === what != is to ==
these operators are only useful for simulation (not supported for synthesis)
during simulation x and z are treated as false for conditional expressions
== 0 1 x z === 0 1 x z 0 1 0 x x 0 1 0 0 0 1 0 1 x x 1 0 1 0 0 x x x x x x 0 0 1 0 z x x x x z 0 0 0 1 6 Computer Design
Modeling delaysy 8 Introduction to discrete event simulationy 11
adding delays to behavioural Verilog (typically for simulation only), e.g. to sometimes called Delta simulation delay by 10 simulator time units: only changes in state cause gates to be reevaluated
assign #10 delayed_value = none_delayed_value; data structures:
gates are modeled as objects (including current state information) pure delay — signals are delayed by some time constant state changes passed as (virtual) timed events or messages pending events are inserted into a time ordered event queue
simulation loops around:
pick event with least virtual time off the event queue inertial delay — models capacitive delay pass event to appropriate gate gate evaluates and produces an event if its output has changed
issues:
cancelling runt pulses (event removal) modeling wire delays (add wires as simple gates)
SPICEy 12 Obtaining delay informationy 9 SPICE = Simulation Program with Integrated Circuit Emphasis
a design can be synthesised to gates and approximations of the gate used for detailed analog transistor level simulation delays can be used models nonlinear components as a set of differential equations gates in the netlist can be placed and wires routed so that wire lengths representing the circuit voltages, currents and resistances and capacitances can be determined simulation is highly accurate (accuracy dependent upon models provided) models of gates can vary but simulation is computationally expensive and only practical for small e.g. input to output delay depends upon which input is changing circuits (even on trivial gates like a 2 input NOR) and this variation may be modeled but often some simplifying assumption is made like taking sometimes SPICE is used to analyse standard cells in order to determine the average case typical delays, etc, to enable reasonably accurate digital simulation to be undertaken back annotation there are various versions of commercial and free SPICE which vary in delay information can be back annotated, e.g. adding hidden performance and accuracy depending on implementation details information to a schematic diagram delay information and netlist can be converted back into a low level Synthesis — Design metricsy 13 structural Verilog model
many digital logic circuits achieve the same function, but vary in the: Na¨ıve simulationy 10 number of gates
simplifying assumptions for our na¨ıve simulation: total wiring length
each gate has unit delay (1 virtual time unit) timing properties through various paths netlist held as a simple list of gates with enumerated wires to indicate power consumption interconnect an array of wires holds the state Karnaugh maps and Boolean n-cubesy 14 simulation takes the current state (wire information) and evaluates each gate to in turn to produce a new set of state. This process is then repeated.
problems:
many parts of the circuit are likely to be inactive at a given instant in time, so having to reevaluate each gate for every simulation cycle is expensive delays have to be implemented as long strings of buffers which is likely to slow things down
implicants are sub n-cubes Lecture 2 — Logic modelling, simulation and synthesis 7
Quine-McCluskey minimisation — step 1y 15 QM example cont...y 19
find all prime implicants using x y + x y¯ = x repeatedly Find terms which differ by one bit Code Notes first produce a truth table for the function and extract the minterms ABCD X required (i.e. rows in the truth table where the output is 1) 5 0101 10 1010 X exhaustively compare pairs of terms which differ by 1 bit to produce a 11 1011 X new term where the 1 bit difference is marked by a don’t care X 13 1101 X tick those terms which have been selected since they are covered by 14 1110 X the new term 15 1111 X New terms: repeat with new set of terms (X must match X) until no more terms A X101 combining 5,13 can be produced B 101X combining10,11 terms which are unticked are the prime implicants C 1X10 combining 10,14 D 1X11 combining 11,15 Quine-McCluskey minimisation — step 2y 16 E 11X1 combining 13,15 F 111X combining14,15
select smallest set of prime implicants to cover function: where X indicates that a term is covered
prepare prime-implicants chart QM example cont...y 20 select essential prime implicants for which one or more of its minterms are unique (only once in the column) obtain a new reduced PI chart for remaining prime-implicants and the Find terms which differ by one bit and have X’s in the same remaining minterms place select one or more of remaining prime implicants which will cover all Code ABCD Notes the remaining minterms A X101 B 101X X computationally very expensive for large equations C 1X10 X D 1X11 X tools like Espresso use heuristics to improve performance but at the E 11X1 expense of not being exact F 111X X there are better but far more complex algorithms for exact Boolean New terms: minimisation G 1X1X combining B,F H 1X1X combiningC,D—duplicatesoremove QM exampley 17 QM example cont...y 21 An example truth table Code ABCD Term Output The prime implicant chart 0 0000 ABCD 0 active minterms terms 1 0001 ABCD 0 Code ABCD A: X101 E: 11X1 G: 1X1X 2 0010 ABCD 0 5 0101 X 3 0011 ABCD 0 10 1010 X 4 0100 ABCD 0 11 1011 X 5 0101 ABCD 1 13 1101 XX 6 0110 ABCD 0 14 1110 X 7 0111 ABCD 0 15 1111 XX 8 1000 ABCD 0 9 1001 ABCD 0 where X indicates that a term covers a minterm 10 1010 ABCD 1 so terms A and G cover all minterms (i.e. they are essential terms) and 11 1011 ABCD 1 term E is not required 12 1100 ABCD 0 13 1101 ABCD 1 therefore the minimised equation is B C D + A C 14 1110 ABCD 1 15 1111 ABCD 1 Further comments on logic minimisationy 22
QM example cont...y 18 if optimising over multiple equations (i.e. multiple outputs) with shared terms then the Putnam and Davis algorithm can be used (but not much The active minterms more sophisticated than QM) Code ABCD sum of products form may not be simplest logic structure: multi-level logic 5 0101 structures are often more compact (ongoing research in this area) 10 1010 11 1011 sometimes simplification in terms of XOR gates (rather than AND and OR 13 1101 gates) is more appropriate — called Read-Muller logic 14 1110 e.g. see “Hug an XOR gate today: an introduction to Read-Muller logic”: 15 1111 http://www.reed-electronics.com/ednmag/archives/1996/030196/05df4.htm
don’t-cares are very important for efficient logic minimisation so when writing Verilog it is important to indicate undefined values, e.g.:
wire [31:0] my_alu = (opcode==’add) ? a + b : (opcode==’not) ? ~a : (opcode==’sub) ? a - b : 32’bx; 8 Computer Design
Adding redundant logicy 23
adding redundant logic can reduce the amount of wiring required
Wire minimisationy 24
wire minimisation is largely the job of the place & route tool and not the synthesis tool
the synthesis tools may include:
redundant logic placement hints to guide place & route
manual floor planning may also be performed to force the hand of the place & route tools
increasingly important as chips get larger
Finite state machine minimisationy 25
state minimisation
remove duplicate states remove redundant/unreachable states
state assignment
assign a unique binary code to each state the logic structure depends on the assignment, thus this needs to be done optimally (e.g. algorithms: NOVA, JEDI)
BUT much Verilog code is explicit about register usage so little optimisation possible
higher level behavioral Verilog introduces implicit state machines which the synthesis tool is in a better position to optimise
Retiming and D-latch migrationy 26
if a path through some logic is too long then it is sometimes possible to move the flip-flops to compensate without altering the functional behaviour
similarly, it is sometimes possible to add extra D-latches to make a pipeline longer Lecture 3 — Chip, board and system testing 9
Computer Design — Lecture 3 Fault reductionsy 5
Testingy 1 checkpoints Overview of this lecture points where faults could occur Introduce: fault equivalence production test (fault models, test metrics, test pattern generation, scan remove test for least significant fault path testing) fault dominance functional test (in simulation and on FPGA)
if every test for fault f1 detects f2 then f1 dominates f2 introduction to ECAD lab sessions ⇒ only have to generate test for f1 Objectives of production testingy 2
check that there are no manufacturing defects
NOT to check whether you have designed the device correctly
economics: cost of detecting a faulty component is lowest before it is packaged and embedded in a system Test patterns and path sensitisationy 6 for consumer products:
faulty goods cost a lot to replace so require low defect rate (e.g. less test patterns are sequences of (input values, expected result) pairs that 0.1%) but testing costs money so don’t exhaustively test — 98% fault path sensitisation coverage probably acceptable inputs required in order to make a fault visible on the outputs for medical, aerospace and military (i.e. safety-critical) products:
must have 100% coverage
in some instances a design will contain redundant units (e.g. on DRAM) which can be selected, thereby improving yield
Fault modelsy 3
logical faults
stuck-at (most common) CMOS stuck-open Test effectivenessy 7 CMOS stuck-on bridging faults undetectable fault — no test exists for fault
parametric faults redundant fault — undetectable fault whose occurrence does not affect circuit operation low/high voltage/current levels testability = number of detectable faults gate or path delay-faults number of faults
testing methods: effective faults = number of faults − redundant faults fault coverage number of detectable faults parametric (electrical) tests also detect stuck-on faults = number of effective faults logical tests detect stuck-at faults test set size = number of test patterns transition tests detect stuck-open faults goal is 100% fault coverage (not 100% testability) with a minimum sized timed transition tests detect delay faults test set
Testabilityy 4
controllability
the ability to set and clear internal signals it is particularly useful to be able to change the state in registers inside a circuit
observability
the ability to detect internal signals 10 Computer Design
Automatic Test Pattern Generation (ATPG)y 8 Functional testing in simulationy 12
Many algorithms developed to do ATPG, e.g. D-Algorithm, PODEM, advantages: PODEM-X, etc. General idea: allows test benches to be written to check many cases gives good visibility of state generate a sequence of test vectors to identify faults is quick to do some tests since no place & route required (unless you test vector = input vector, correct output vector are simulating a post-layout design to check timing)
input and output vectors are typically 3 valued: (0, 1,X) disadvantages: simple combinatoric circuits (no feedback) can be supplied with test slow if simulations need to be for billions of clock cycles (e.g. 10s or vectors in any order more of real-time) difficult to test if complex input/output behaviour is required (e.g. if circuits with memory (latches/combinatoric feedback) require sequences your test bench had to look like a VGA monitor, mouse or Ethernet of test vectors in order to modify internal state, e.g. consider having to test: link)
Verilog test benchesy 13
Test benches can be written in Verilog (models in other languages, e.g. C, are also possible). Example showing Verilog test bench constructs: module testsim ( output reg reg clk , r s t ; // clock and reset signals output reg [3:0] counter); / / counter
Scan path testingy 9 initial begin // start of sequence of initialisation steps c l k = 0; // initialise registers in sequence using blocking assigment r s t = 1; #100 // wait 100 simulation cycles, then... make all the D flip-flops (DFFs) in a circuit “scannable” r s t = 0; // release the reset signal #1000 // wait 1000 simulation cycles, then... i.e. add functionality to every DFF to enable data to be shifted in and out of $stop ( ) ; // ...stop simulation the circuit end
always #5 clk = !clk; // make clock oscillate at 10 simulation step rate
// Now illustrate two approaches to monitoring signals
always @( counter ) // when ever counter changes $display ( ” at time %d: counter changed to = %d” ,$time ,counter );
always @( negedge c l k ) // after clock edge $display ( ” at time %d: counter updated to %d” ,$time ,counter );
// design under test (usually instatitated module) always ff @( posedge c l k or posedge r s t ) simple scan flip-flop i f (rst) counter <= 0; else counter <= counter+1;
this is a significant aid to ATPG since testing latches is now easy and the endmodule flip-flops have sliced the circuit into small combinatoric blocks which are usually nice and simple to test
boundary scan — just have a scan path around I/O pads of chip or Functional testing on FPGAy 14 macrocell advantages: JTAG standardy 10 fast implementation connected to real I/O the IEEE 1149 (JTAG) international standard defines 4 wires for boundary scan: disadvantages:
tms = test mode select (high for boundary scan) lack of visibility on signals tdi = test data input (serial data in) partly solved with SignalTap — see later tck = test clock (clock serial data) difficlut to test side cases tdo = test data output (read back old data whilst new is shifted in) have to wait for complete place & route between changes also used for other things, e.g.:
in-circuit programming of FPGAs (e.g. Altera) single step debugging of embedded microprocessors
Functional testingy 11
objective: ensure that the design is functionally correct
simulation provides great visibility and testability
implementation on FPGA allows rapid prototyping and testing of I/O Lecture 3 — Chip, board and system testing 11
Key debouncer exampley 15 Modelsim simulation outputy 18
module debounce ( input clk , // clock at 50MHz # 0: bouncy = 0 clean = x numbounces = x input r s t , / / reset # 0: rst = 1 input bouncy , // bouncy signal # 0: bouncy = 0 clean = 0 numbounces = 0 output reg clean , // clean signal # 100: rst = 0 output reg [15:0] numbounces); reg prev syncbouncy ; # 10100: bouncy = 1 clean = 0 numbounces = 0 reg [20:0] counter; # 10125: bouncy = 1 clean = 0 numbounces = 1 wire counterAtMax = &counter; // N.B. vector AND of the bits # 20100: bouncy = 0 clean = 0 numbounces = 1 wire syncbouncy ; # 20125: bouncy = 0 clean = 0 numbounces = 2 synchroniser dosync(. clk(clk ) ,.async(bouncy) ,.sync(syncbouncy )); always ff @( posedge c l k or posedge r s t ) # 120100: bouncy = 1 clean = 0 numbounces = 2 i f ( r s t ) begin # 120125: bouncy = 1 clean = 0 numbounces = 3 counter <= 0; # 220100: bouncy = 0 clean = 0 numbounces = 3 numbounces <= 0; # 220125: bouncy = 0 clean = 0 numbounces = 4 prev syncbouncy <= 0; clean <= 0; # 1220100: bouncy = 1 clean = 0 numbounces = 4 end else begin # 1220125: bouncy = 1 clean = 0 numbounces = 5 prev syncbouncy <= syncbouncy; # 2220100: bouncy = 0 clean = 0 numbounces = 5 i f (syncbouncy != prev syncbouncy) begin // detect change # 2220125: bouncy = 0 clean = 0 numbounces = 6 < counter = 0; # 12220100: bouncy = 1 clean = 0 numbounces = 6 numbounces <= numbounces+1; end else i f (! counterAtMax) // no bouncing, so keep counting # 12220125: bouncy = 1 clean = 0 numbounces = 7 counter <= counter+1; # 33191645: bouncy = 1 clean = 1 numbounces = 7 else // output clean signal since input stable for # 42220100: bouncy = 0 clean = 1 numbounces = 7 // 2ˆ21 clock cycles (approx. 42ms) # 42220125: bouncy = 0 clean = 1 numbounces = 8 clean <= syncbouncy; end # 63191645: bouncy = 0 clean = 0 numbounces = 8 endmodule # Break in Module testbench at ...
On FPGA testy 19 Synchronisery 16
module keybounce ( module synchroniser( input CLOCK 50 , input clk , input [ 1 7 : 0 ] SW, input async , input [ 3 : 0 ] KEY, output reg sync ) ; output [17:0] LEDR, output [7:0] LEDG); reg metastable ; always @( posedge c l k ) begin (∗ preserve ,noprune ∗) reg [31:0] timer; metastable <= async ; (∗ preserve ,noprune ∗) reg [2:0] delayed triggercondition; sync <= metastable; (∗ preserve ,noprune ∗) reg bouncy ; end (∗ keep ,noprune ∗) wire r s t , clean ; endmodule (∗ keep ,noprune ∗) wire triggercondition = SW[0] ˆ clean; (∗ keep ,noprune ∗) wire [15:0] numbounces; always @( posedge CLOCK 50) begin bouncy <= SW[ 0 ] ; Test bench for key debouncery 17 d e l a y e d triggercondition <= { d e l a y e d triggercondition[1:0] , triggercondition } ; end always @( posedge CLOCK 50 or posedge r s t ) i f ( r s t ) timer <=0; else timer <=timer +1; module testbench ( synchroniser syncrst (. clk(CLOCK 50) ,.async(!KEY[0]) ,.sync(rst )); output reg clk , debounce DUT(. clk (CLOCK 50) ,. rst(rst ) ,.bouncy(bouncy) ,.clean(clean), output reg r s t , .numbounces(numbounces ) ) ; output reg bouncy , assign LEDG[1:0] = clean ? 2’b11 : 2’b00; output clean ) ; assign LEDG[3:2] = SW[0] ? 2’b11 : 2’b00; assign LEDG[ 7 : 4 ] = { d e l a y e d triggercondition[2] ,bouncy,triggercondition ,timer[3 1 ] } ; wire [15:0] numbounces; assign LEDR[17:0] = {2’b00,numbounces[15:0] } ; endmodule debounce DUT(. clk(clk ) ,. rst(rst ) ,.bouncy(bouncy) ,.clean(clean), .numbounces(numbounces ) ) ;
initial begin 20 c l k =0; SignalTapy r s t =1; bouncy =0; #100 r s t =0; motivation: provide analysis of state of running system #10000 bouncy=1; / / 1e3 t i c k s #10000 bouncy=0; approach: automatically add extra hardware to capture signals of interest #100000 bouncy=1; / / 1e4 t i c k s #100000 bouncy=0; SignalTap: Altera’s tool to make this easier #1000000 bouncy=1; / / 1e5 t i c k s #1000000 bouncy=0; for further details see the Lab. web pages #10000000 bouncy=1; / / 1e6 t i c k s #30000000 bouncy=0; // 3e6 ticks (should have gone high) #30000000 $stop; end SignalTap exampley 21
always #5 clk = !clk; probe the debouncer degin always @( r s t ) $display ( ”%09d : r s t = %d” ,$time, rst ); difficulty: bounce events happen infrequently and there is not enough always @( bouncy or clean or numbounces[15:0]) memory on the FPGA to store a long trace of events every clock cycle $display ( ”%09d : bouncy = %d clean = %d numbounces = %d” , $time ,bouncy , clean ,numbounces); solution: start and stop signal capture to observe the interesting bits (trigger signals added to the design to achieve this) endmodule 12 Computer Design
SignalTap example setupy 22
SignalTap example tracey 23 Lecture 4 — Verilog systems design 13
Computer Design — Lecture 4 Specifying pin assignmentsy 5
Design for FPGAsy 1 Import a qsf file (tPad_pin_assignments.qsf) which tells the Quartus tools an assignment of useful names to physical pins. Overview of this lecture this lecture: set_location_assignment PIN_Y2 -to CLOCK_50 set_location_assignment PIN_H22 -to HEX0[6] introduces the tPad FPGA board set_location_assignment PIN_J22 -to HEX0[5] gives some more SystemVerilog design examples ... set_location_assignment PIN_AC27 -to SW[2] presents some design techniques set_location_assignment PIN_AC28 -to SW[1] set_location_assignment PIN_AB28 -to SW[0] explains some design pitfalls ... set_location_assignment PIN_E19 -to LEDR[2] ECAD+Architecture lab. teaching boards (tPads)y 2 set_location_assignment PIN_F19 -to LEDR[1] set_location_assignment PIN_G19 -to LEDR[0] ... set_location_assignment PIN_R6 -to DRAM_ADDR[0] set_location_assignment PIN_V8 -to DRAM_ADDR[1] set_location_assignment PIN_U8 -to DRAM_ADDR[2] ...
Example top level module: lightsy 6
module lights( input CLOCK_50, output [17:0] LEDR, input [17:0] SW);
logic [17:0] lights; assign LEDR = lights;
// do things on the rising edge of the clock always_ff @(posedge CLOCK_50) begin lights <= SW[17:0]; end Front of the tPady 3 endmodule
Advantages of simulationy 7
simulation is main stay of debugging on FPGA and even more so for ASIC
new ECAD labs emphasise simulation
better design practise — test driven development better for you — avoid lots of long synthesis runs
old ECAD labs relied on generating video which is difficult to get helpful results from in simulation
Problems with simulationy 8
simulation is of an abstracted world which may hide horrible side cases!
emulation of external input/output devices can be tricky and time FPGA design flowy 4 consuming e.g. video devices or an Ethernet controller Lab 1 shows you how to use the following key components of Quartus: semantics of SystemVerilog are not well defined Text editor results in discrepancies between simulation and synthesis Simulator simulation and synthesis tools typically implement different a subset Compiler of the language Timing analyser but even with these weaknesses, simulation is still very powerful Programmer simulation can even model things that the real hardware will not The web pages guide you through some quite complex commercial-grade model, e.g. D flip-flops that are not reset properly will start in ’x’ state tools in simulation so can easily be found vs. real-world where some “random” state will appear 14 Computer Design
The danger of asynchronous inputs and bouncing Resetting your circuitsy 12 buttonsy 9 the Altera PLDs default to all registers being zero after programming asynchronous inputs cause problems if they change state near a clock edge but in other designs you may need to add a reset signal and it is often handy to have your FPGA design resettable from a button metastability can arise most Verilog synthesis tools support the following to perform an sampling multiple inputs particularly hazardous asynchronous reset: where possible, sample inputs at least twice to allow metastability to resolve (resynchronisation) always @(posedge clk or posedge reset) never use an asynchronous input as part of a conditional without if(reset) resynchronising it first begin // registers which need to be assigned a reset value go here end Two flop synchronisery 10 else begin // usual clocked behaviour goes here end
note that distribution of reset is an issue handled by the tools like clock distribution
Signalling protocolsy 13
often need a go or done or new data signal
pulse
mean time between failure (MTBF) can be calculated: pulse high for one clock cycle to send an event
t problem: the receiving circuit might be a multicycle state machine eτ and may miss the event (e.g. traffic light controller which stops only at MTBF = red if sent a “stop” event) fdfcTw 2-phase
where: every edge indicates an event can send an event every clock cycle t is the time allowed for first DFF to resolve, so allowing just a small amount of extra time to resolve makes a big difference since it is an 4-phase exponential term level sensitive — rising and falling edges may have a different τ=gain of DFF meaning f and f are the frequencies of the data and clock respectively d c or falling edges might be ignored T is the metastability time window w can’t send an event every clock cycle
Example: Two DFF Synchronisery 11 Example: 2-phase signallingy 14 module synchroniser( input clk, module delay( req, ack, clk ); input asyncIn, output reg syncOut); input req; // input request output ack; // output acknowledge reg metastableFF; input clk; // clock
always @(posedge clk) begin reg [15:0] dly; metastableFF <= asyncIn; reg prev_req; syncOut <= metastableFF; reg ack; end endmodule always @(posedge clk) begin prev_req <= req; Quartus understands this construct and can optimise placement of the DFFs — see Chapter 11 of the Quartus manual if(prev_req != req) dly <= -1; // set delay to maximum value else if(dly != 0) dly <= dly-1; // dly>0 so count down
if(dly == 1) ack <= !ack; end endmodule Lecture 4 — Verilog systems design 15
Example: 4-phase signallingy 15 Example: FIFO with combination reverse control pathy 18 module loadable_timer(count_from, load_count, busy, clk); input [15:0] count_from; module f i f o o n e B( input load_count; // clock & reset input clk , output busy; input r s t , input clk; // input side reg busy; input logic [ 7 : 0 ] din , input logic d i n v a l i d , reg [15:0] counter; output logic din ready , always @(posedge clk) // output side if(counter!=0) output logic [ 7 : 0 ] dout , output logic d o u t v a l i d , counter <= counter - 1; input logic dout ready ) ; else begin logic f u l l ; always comb begin busy <= load_count; // N.B. wait for both edges of load_count f u l l = d o u t v a l i d ; if(load_count && !busy) din ready = !full | | dout ready ; counter <= count_from; end end always ff @( posedge c l k or posedge r s t ) i f ( r s t ) begin endmodule d o u t v a l i d <= 1 ’ b0 ; dout <= 8 ’ hxx ; end else i f (full && dout ready && !din v a l i d ) Combination control pathsy 16 d o u t v a l i d <= 1 ’ b0 ; else i f ( din ready && d i n v a l i d ) begin dout <= din ; d o u t v a l i d <= 1 ’ b1 ; typically data-paths have banks of DFFs inserted to pipeline the design end endmodule sometimes it can be slow to add DFFs in the control path, especially flow-control signals
examples: fifo one A which latches all control signals and fifo one B which Example: FIFO test benchy 19 has a combinational backward flow-control path
design B is faster except when lots of FIFO elements are joined together module t e s t f i f o o n e A(); logic clk , r s t ; and the combinational path becomes long initial begin c l k = 1; r s t = 1; Example: FIFO with latched control signalsy 17 #15 r s t = 0; end always #5 clk = !clk; module f i f o o n e A( logic [7:0] din, dmiddle, dout; // clock & reset logic d i n v a l i d , din ready ; input clk , logic dmiddle valid , dmiddle ready ; input r s t , logic d o u t v a l i d , dout ready ; // input side f i f o o n e A stage0(.clk, .rst, .din, .din v a l i d , . din ready , input logic [ 7 : 0 ] din , .dout(dmiddle), .dout valid(dmiddle v a l i d ) , input logic d i n v a l i d , . dout ready(dmiddle ready ) ) ; output logic din ready , f i f o o n e A stage1(.clk, .rst, // output side .din(dmiddle), .din valid(dmiddle v a l i d ) , output logic [ 7 : 0 ] dout , . din ready(dmiddle ready ) , output logic d o u t v a l i d , .dout, .dout valid , .dout ready ) ; input logic dout ready ) ; logic [7:0] state; logic f u l l ; always ff @( posedge c l k or posedge r s t ) always comb begin i f ( r s t ) begin f u l l = d o u t v a l i d ; s t a t e <= 8 ’ d0 ; din ready = !full; d i n v a l i d <= 1 ’ b0 ; end dout ready <= 1 ’ b1 ; end else begin always ff @( posedge c l k or posedge r s t ) i f ( d o u t valid && dout ready ) i f ( r s t ) begin $display ( ”%05t : f i f o output = %1d”, $time, dout); < d o u t v a l i d = 1 ’ b0 ; i f ( din ready ) begin < dout = 8 ’ hxx ; din <= ( s t a t e +1)∗3; end else i f (full && dout ready ) d i n v a l i d <= 1 ’ b1 ; < d o u t v a l i d = 1 ’ b0 ; s t a t e <= s t a t e +1; else i f ( din ready & d i n v a l i d ) begin i f ( state >8) $stop ; < dout = din ; end else < d o u t v a l i d = 1 ’ b1 ; d i n v a l i d <= 1 ’ b0 ; end end // else: !if(rst) endmodule endmodule 16 Computer Design
Example: one hot state encodingy 20 FPGA architecturey 23
states can just have an integer encoding (see last lecture) FPGAs are made up of several different reconfigurable components:
but this can result in quite complex if expressions, e.g.: LUTs — look-up tables (typically 4 inputs, one output) — can implement any 4-input logic function if(current_state==‘red) //...stuff to do when in state ‘red LABs — LUT + DFF + muxes if(current_state==‘amber) programmable wiring (N.B. significant delay in interconnect) //...stuff to do when in state ‘amber memory blocks (e.g. 9-kbits supporting different data widths, one or an alternative is to use one register per state vis: two ports, etc.) DSP — digital signal processing blocks, e.g. hard (i.e. very fast) reg [3:0] four_states; multipliers wire red = four_states[0] || four_states[1]; I/O — input/output blocks for normal signals, high speed serial (e.g. wire amber = four_states[1] || four_states[3]; Ethernet, SATA, PCIe, ...), etc. wire green = four_states[2];
always @(posedge clk or posedge reset) FPGA blocks — LABsy 24 if(reset) four_states <= 4’b0001;
else Register Chain Register Bypass four_states <= {four_states[2:0],four_states[3]}; // rotate bits Routing from LAB-Wide previous LE Synchronous LAB-Wide Programmable Load Synchronous Register LE Carry-In Clear SystemVerilog pitfalls 1: passing buses aroundy 21 data 1 Row, Column, data 2 Synchronous And Direct Link Look-Up Table Carry data 3 Load and D Q Routing (LUT) Chain automatic bus resizing Clear Logic data 4 ENA buses of wrong width get truncated or padded with 0’s CLRN Row, Column, And Direct Link Routing wire [14:0] addr; // oops got width wrong, but no error generated labclr1 wire [15:0] data; labclr2 Asynchronous Chip-Wide Local CPU the_cpu(addr,data,rw,ce,clk); Clear Logic Reset Routing MEM the_memory(addr,data,rw,ce,clk); Register Feedback (DEV_CLRn)
Clock & Register Chain wires that get defined by default Clock Enable Output Select if you don’t declare a wire in a module instance then it will be a single LE Carry-Out labclk1 bit wire, e.g.: labclk2 wire [15:0] addr; labclkena1 labclkena2 // oops, no data bus declared so just 1 bit wide CPU the_cpu(addr,data,rw,ce,clk); MEM the_memory(addr,data,rw,ce,clk); FPGA blocks — interconnecty 25 you pass too few parameters to a module but no error occurs!
SystemVerilog pitfalls 2: naming and parameter or- deringy 22
modules have a flat name space
can be difficult to combine modules from different projects because A BCD they may share some identical module names with different functions
parameter ordering is easy to get wrong
but you can specify parameter mapping vis: Direct link Direct link interconnect interconnect from adjacent loadable_timer lt1(.clk(clk), .count_from(timerval), from adjacent block .load_count(start), .busy(busy)); block which is identical to:
loadable_timer lt2(timerval, start, busy, clk); Direct link Direct link interconnect interconnect provided nobody has changed loadable_counter since you last to adjacent to adjacent block looked! block
in SystemVerilog there is a short hand for variables passing with identical E E FB names, e.g. .clk(clk) = .clk Lecture 4 — Verilog systems design 17
FPGA blocks — multipliersy 26 Final words on ECADy 30
signa signb Programmable hardware is here to stay, and it likely to become even more aclr widespread in products (i.e. not just for prototyping) clock ena SystemVerilog is an improvement over Verilog, but a long way to go
active research being undertaken into higher level HDLs
Data A DQ e.g. more recent languages like Bluespec add channel ENA communication mechanisms Data Out DQ CLRN ENA Hope you enjoy the Lab’s and learning about hardware/software codesign
CLRN Data B DQ
ENA BC A CLRN A Embedded Multiplier Block
FPGA blocks — simplified single-port memoryy 27
Data Out DQ DQ EN ENA AB CLRN