<<

Overview

n Getting Started with Lab 2 Lecture 8: n Just get a single pixel calculating at one n Then look into filling your pipeline n Multipliers More Pipelining n Different options for pipelining: what do you need? n 3 Multipliers or put x*x, y*y, and x*y through sequentially? David Black-Schaffer n Pipelining [email protected] n If it won’t fit in one clock cycle you have to divide it EE183 Spring 2003 up so each stage will fit n The control logic must be designed with this in mind n Make sure you need it EE183 Lecture 8 - Slide 2

Public Service Announcement Logistics

n n Xilinx Programmable World Lab 2 Prelab due Friday by 5pm n Tuesday, May 6th n http://www.xilinx.com/events/pw2003/index.htm n Guest lecture next Monday n Guest Lectures Synchronization and Metastability n Monday, April 28th These are critical for high-speed systems and Ryan Donohue on Metastability and anything where you’ll be connecting across Synchronization clock domains. n Wednesday, May 7th Gary Spivey on ASIC & FPGA Design for Speed n The content of these lectures will be on the Quiz SHOW UP! (please)

EE183 Lecture 8 - Slide 3 EE183 Lecture 8 - Slide 4

1 Easier FSMs Data Path always @(button or current_state) Do this if nothing else is begin Xn Yn write_en <= 0; specified. Mandel X Julia X Mandel X Julia Y output <= 1; next_state <= `START_STATE; Mux * * * Mux case(current_state) Note that the else is not `START_STATE: specified. begin write_en <= 1; - <<1 + if (button) next_state <= `WAIT_STATE; What is the next state? end What do we+ do if the+ wholeIf>4 dataoutput 1path `WAIT_STATE: doesn’t fit in one clock cycle? begin X Y output <= 0; Be careful! n+1 n+1 end end Easy way to for next iteration EE183 Lecture 8 - Slide 5 infer latches! EE183 Lecture 8 - Slide 6

Pipelining Example 1 Pipelining Example 2

Mandel X Julia X Xn Yn Mandel X Julia Y Mandel X Julia X Xn Yn Mandel X Julia Y

Critical path 2ns Mux * 12ns * 12ns * 12nsMux 2ns 12ns2ns Mux * 12ns * 12ns * 12nsMux 2ns Pipeline Registers - 6ns <<1 1ns + 6ns - 6ns <<1 1ns + 6ns

Critical path + 6ns + If>4 output 1 3ns Critical path + 6ns + If>4 output 1 3ns 24ns 12ns12ns+??ns X n+1 Y n+1 X n+1 Y n+1 How long does this wire take? How long does this wire take? Feedback for next iteration Feedback for next iteration EE183 Lecture 8 - Slide 7 EE183 Lecture 8 - Slide 8

2 Latency and Throughput Pipelining Example 3 Not Pipelined Xn Yn Time Input Output Mandel X Julia X Mandel X Julia Y t=0 A1 Input Output t=35ns A2 A1 Critical path Comb. Logic t=70ns A3 A2 12ns2ns Mux * 12ns * 12ns * 12nsMux 2ns t=105ns A3 Pipeline First output takes longer because 35ns critical path Registers it has to go through all the stages Critical path 1 stage but subsequent results can come 6ns - 6ns <<1 1ns + 6ns out every clock cycle. Pipeline Pipelined Registers S1 Time Input S1 Output Critical path + 6ns + If>4 output 1 3ns Input Output t=0 A1 6ns + ??ns A1A2A3x A1A2A3x A1A2A3 t=20ns A2 A1 t=40ns A3 A2 A1 X n+1 Y n+1 t=60ns A3 A2 <20ns critical path t=80ns A3 Feedback for next iteration 3 stages EE183 Lecture 8 - Slide 9 EE183 Lecture 8 - Slide 10

Key points on Pipelining Multipliers n Increases utilization for operators n CoreGen gives you several pipelining n You can do multiple calculations at once so you can use everything maximally (ideally) options n This is the point! Store the results from smaller calculations to n Which is best? make the overall calculation faster. n the next data item into the datapath before the n Depends on your design previous one has finished n Data size will determine speed and pipelining n The pipe registers keep the computation separate n You will have a lot of pipe registers if you’re doing a lot of calculations (I.e., Lab 3!) n Design is an iterative process so you won’t n What is the effect of the algorithm feeding back on itself? be able to choose the best approach at first n Do all points have the same number of iterations? control (i.e., get started early!) n Is the data dependent between pipeline stages? hazards

EE183 Lecture 8 - Slide 11 EE183 Lecture 8 - Slide 12

3 Xn Yn Multiplier Issues Mandel X Julia X Mandel X Julia Y

Mux * * * Mux n Multipliers are BIG Pipeline Registers n * * * How can we get away with fewer multipliers? Pipeline Registers * * * n Multipliers may be SLOW Pipeline Registers n How can we utilize them maximally? - <<1 + Pipeline Registers + + If>4 output 1

X n+1 Y n+1

EE183 Lecture 8 - Slide 13 EE183 Lecture 8 - Slide 14

Xn Yn Now we have… Mandel X Julia X Mandel X Julia Y

Mux M1a* M2a* *M3a Mux n With a 3-stage multiplier you’ve now got 5 Pipeline Registers pipeline stages M1b* M2b* *M3b Pipeline n How can you keep the pipeline full? Registers M1c* M2c* *M3c n How many things do you need to Pipeline Registers calculate at once? Add1 - <<1 + Pipeline Registers + + If>4 output 1 n What is full? Will you ever get 100% Add2 utilization? What is good enough? X n+1 Y n+1

EE183 Lecture 8 - Slide 15 EE183 Lecture 8 - Slide 16

4 Pipeline Performance Analysis Pipeline Performance Analysis n With the bad data path (3, 3 stage multipliers and 2 n With the bad data path (3, 3 stage multipliers and 2 stages after that; one pixel at a time) stages after that; multiple pixels at a time)

Clk M1a M2a M3a M1b M2b M3b M1c M2c M3c Add1 Add2 Clk M1a M2a M3a M1b M2b M3b M1c M2c M3c Add1 Add2 1 • • • 1 • • • 2 • • • 2 s s s • • • 3 • • • 3 • • • s s s • • • 4 • 4 v v v • • • s s s • 5 • 5 • • • v v v • • • s • n In 5 cycles we used 11 units out of 55 available: n We approach 100% utilization if there are no stalls or 20% average utilization dependencies and we can keep getting new data

EE183 Lecture 8 - Slide 17 EE183 Lecture 8 - Slide 18

What performance is required? What do we expect n Replication and Pipelining are not trivial n The previous data path is terribly inefficient if to implement—make sure you need them you only put one pixel through at a time, but doing multiple pixels at once is very n Is either needed for Lab #2? complicated n How would you tell? n Hint: each Julia image takes at most n As an alternative you can use one multiplier and (64*64*64*7*1/50e6) = 0.036s to create. put your x*x, y*y, and x*y through it in a n Is this “real-time” enough for an animation? pipelined manner. n Other issues? Need to meet timing for the n What’s the efficiency? Is it a good tradeoff for VGA. area/speed? This analysis is critical!

EE183 Lecture 8 - Slide 19 EE183 Lecture 8 - Slide 20

5 Pipeline Performance Pipeline Performance Analysis Conclusions n Single multiplier, put x2 (•) through first, then n You need to know your algorithm and y2 (s), then x•y(•). what tradeoffs you are making Clk Ma Mb Mc Add1 Add2 n In 7 cycles we use 11/35 1 • functional units = 31% n What do you care about? 2 s n But we only have 1 • n Speed? 3 s multiplier • • n Area? n How much space do we 4 s • save? What is most n Both. 5 • important? (Power is a function of speed and number of 6 • transistors, i.e., area.) 7 • EE183 Lecture 8 - Slide 21 EE183 Lecture 8 - Slide 22

Zooming n An arbitrary portion of the screen can be described in many ways. Here are two: n Xmin, Xmax, Ymin, Ymax n Requires dividing by the number of pixels n Xorigin, Yorigin, scale n Requires a fixed number of pixels n Hints for Zooming: n Have registers with the X, Y origins and increment/decrement them with the up/down, left/right buttons n Have a scale register which goes up/down with the zooming in/out n When converting form 0..63 to -2.00 to 2.00 use the scale and origin to calculate the new value

EE183 Lecture 8 - Slide 23 EE183 Lecture 8 - Slide 24

6 Lecture 6 Key Points n Pipelining increases the clock speed but decreases the amount of work per clock n Parallelism is easy except for resource conflicts n Logistics n Lab 2 Prelab due Friday by 5pm email URL to Joel n Visiting lecturer next Monday – contents will be on the quiz

EE183 Lecture 8 - Slide 25

7