Lecture 8: More Pipelining

Overview n Getting Started with Lab 2 n Just get a single pixel calculating at one time n Then look into filling your pipeline Lecture 8: More Pipelining n Multipliers n Different options for pipelining: what do you need? n 3 Multipliers or put x*x, y*y, and x*y through sequentially? David Black-Schaffer <a href="mailto:[email protected]" target="_blank">[email protected] </a>EE183 Spring 2003 n Pipelining n If it won’t fit in one clock cycle you have to divide it up so each stage will fit n The control logic must be designed with this in mind n Make sure you need it EE183 Lecture 8 - Slide 2 Logistics Public Service Announcement n Lab 2 Prelab due Friday by 5pm n Xilinx Programmable World n Tuesday, May 6th n <a href="/goto?url=http://www.xilinx.com/events/pw2003/index.htm" target="_blank">http://www.xilinx.com/events/pw2003/index.htm </a>n Guest lecture next Monday n Guest Lectures Synchronization and Metastability These are critical for high-speed systems and anything where you’ll be connecting across clock domains. n Monday, April 28th Ryan Donohue on Metastability and Synchronization n Wednesday, May 7th Gary Spivey on ASIC & FPGA Design for Speed SHOW UP! (please) n The content of these lectures will be on the Quiz <ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 3 </li><li style="flex:1">EE183 Lecture 8 - Slide 4 </li></ul>1<ul style="display: flex;"><li style="flex:1">Easier FSMs </li><li style="flex:1">Data Path </li></ul>always @(button or current_state) begin write_en <= 0; Do this if nothing else is specified. <ul style="display: flex;"><li style="flex:1">Xn </li><li style="flex:1">Yn </li></ul><ul style="display: flex;"><li style="flex:1">Mandel X Julia X </li><li style="flex:1">Mandel X Julia Y </li></ul>Mux output <= 1; next_state <= `START_STATE; <ul style="display: flex;"><li style="flex:1">Mux </li><li style="flex:1">*</li><li style="flex:1">*</li><li style="flex:1">*</li></ul>Note that the elseis not specified. case(current_state) `START_STATE: begin write_en <= 1; if (button) next_state <= `WAIT_STATE; end <ul style="display: flex;"><li style="flex:1">-</li><li style="flex:1"><<1 </li><li style="flex:1">+</li></ul>What is the next state? What do w+e do if t+he whole data path If>4 output 1 `WAIT_STATE: begin doesn’t fit in one clock cycle? Be careful! <ul style="display: flex;"><li style="flex:1">X n+1 </li><li style="flex:1">Y n+1 </li></ul>output <= 0; end Easy way to infer latches! end Feedback for next iteration <ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 5 </li><li style="flex:1">EE183 Lecture 8 - Slide 6 </li></ul><ul style="display: flex;"><li style="flex:1">Pipelining Example 1 </li><li style="flex:1">Pipelining Example 2 </li></ul><ul style="display: flex;"><li style="flex:1">Xn </li><li style="flex:1">Yn </li><li style="flex:1">Xn </li><li style="flex:1">Yn </li></ul><ul style="display: flex;"><li style="flex:1">Mandel X Julia X </li><li style="flex:1">Mandel X Julia Y </li><li style="flex:1">Mandel X Julia X </li><li style="flex:1">Mandel X Julia Y </li></ul>Critical path 12ns2ns <ul style="display: flex;"><li style="flex:1">12ns </li><li style="flex:1">12ns </li><li style="flex:1">12ns </li><li style="flex:1">2ns </li><li style="flex:1">12ns </li><li style="flex:1">12ns </li><li style="flex:1">12ns </li><li style="flex:1">2ns </li></ul><ul style="display: flex;"><li style="flex:1">Mux </li><li style="flex:1">Mux </li><li style="flex:1">*</li><li style="flex:1">*</li><li style="flex:1">*</li><li style="flex:1">Mux </li><li style="flex:1">Mux </li><li style="flex:1">*</li><li style="flex:1">*</li><li style="flex:1">*</li></ul>2ns Pipeline Registers 6ns 6ns <ul style="display: flex;"><li style="flex:1">1ns </li><li style="flex:1">6ns </li><li style="flex:1">6ns </li></ul>6ns <ul style="display: flex;"><li style="flex:1">1ns </li><li style="flex:1">6ns </li></ul><ul style="display: flex;"><li style="flex:1">-</li><li style="flex:1"><<1 </li><li style="flex:1">+</li><li style="flex:1">-</li><li style="flex:1"><<1 </li><li style="flex:1">+</li></ul><ul style="display: flex;"><li style="flex:1">3ns </li><li style="flex:1">3ns </li></ul><ul style="display: flex;"><li style="flex:1">If>4 output 1 </li><li style="flex:1">+</li><li style="flex:1">+</li><li style="flex:1">If>4 output 1 </li><li style="flex:1">+</li><li style="flex:1">+</li></ul>Critical path 24ns Critical path 12ns+??ns <ul style="display: flex;"><li style="flex:1">X n+1 </li><li style="flex:1">Y n+1 </li><li style="flex:1">X n+1 </li><li style="flex:1">Y n+1 </li></ul><ul style="display: flex;"><li style="flex:1">How long does this wire take? </li><li style="flex:1">How long does this wire take? </li></ul><ul style="display: flex;"><li style="flex:1">Feedback for next iteration </li><li style="flex:1">Feedback for next iteration </li></ul><ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 7 </li><li style="flex:1">EE183 Lecture 8 - Slide 8 </li></ul>2Latency and Throughput Pipelining Example 3 Not Pipelined <ul style="display: flex;"><li style="flex:1">Time </li><li style="flex:1">Input </li><li style="flex:1">Output </li></ul><ul style="display: flex;"><li style="flex:1">Xn </li><li style="flex:1">Yn </li></ul><ul style="display: flex;"><li style="flex:1">Mandel X Julia X </li><li style="flex:1">Mandel X Julia Y </li></ul><ul style="display: flex;"><li style="flex:1">t=0 </li><li style="flex:1">A1 </li></ul>Input Output t=35ns A2 t=70ns A3 t=105ns A1 A2 A3 Critical path 12ns2ns Comb. Logic <ul style="display: flex;"><li style="flex:1">12ns </li><li style="flex:1">12ns </li><li style="flex:1">12ns </li><li style="flex:1">2ns </li></ul><ul style="display: flex;"><li style="flex:1">Mux </li><li style="flex:1">Mux </li><li style="flex:1">*</li><li style="flex:1">*</li><li style="flex:1">*</li></ul>Pipeline Registers First output takes longer because it has to go through all the stages but subsequent results can come out every clock cycle. 35ns critical path 1 stage Critical path 6ns 6ns 6ns <ul style="display: flex;"><li style="flex:1">1ns </li><li style="flex:1">6ns </li></ul><ul style="display: flex;"><li style="flex:1">-</li><li style="flex:1"><<1 </li><li style="flex:1">+</li></ul>Pipelined Pipeline Registers <ul style="display: flex;"><li style="flex:1">Time </li><li style="flex:1">Input </li><li style="flex:1">S1 </li><li style="flex:1">Output </li></ul>S1 Ax123 A123 3ns If>4 output 1 Output <ul style="display: flex;"><li style="flex:1">+</li><li style="flex:1">+</li></ul>Input Critical path 6ns + ??ns <ul style="display: flex;"><li style="flex:1">t=0 </li><li style="flex:1">A1 </li></ul>t=20ns A2 t=40ns A3 t=60ns A1 A2 A3 Ax123 A1 A2 A3 <ul style="display: flex;"><li style="flex:1">X n+1 </li><li style="flex:1">Y n+1 </li></ul><20ns critical path 3 stages t=80ns Feedback for next iteration <ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 9 </li><li style="flex:1">EE183 Lecture 8 - Slide 10 </li></ul><ul style="display: flex;"><li style="flex:1">Key points on Pipelining </li><li style="flex:1">Multipliers </li></ul>nIncreases utilization for operators n CoreGen gives you several pipelining options nYou can do multiple calculations at once so you can use everything maximally (ideally) nThis is the point! Store the results from smaller calculations to make the overall calculation faster. n Which is best? n Depends on your design n Data size will determine speed and pipelining nnInsert the next data item into the datapath before the previous one has finished The pipe registers keep the computation separate nYou will have a lot of pipe registers if you’re doing a lot of calculations (I.e., Lab 3!) n Design is an iterative process so you won’t be able to choose the best approach at first (i.e., get started early!) nWhat is the effect of the algorithm feeding back on itself? nDo all points have the same number of iterations? control nIs the data dependent between pipeline stages? hazards <ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 11 </li><li style="flex:1">EE183 Lecture 8 - Slide 12 </li></ul>3<ul style="display: flex;"><li style="flex:1">Xn </li><li style="flex:1">Yn </li></ul>*<ul style="display: flex;"><li style="flex:1">Mandel X Julia X </li><li style="flex:1">Mandel X Julia Y </li></ul>Multiplier Issues <ul style="display: flex;"><li style="flex:1">Mux </li><li style="flex:1">*</li><li style="flex:1">*</li><li style="flex:1">Mux </li></ul>Pipeline Registers n Multipliers are BIG n How can we get away with fewer multipliers? ** ** **Pipeline Registers Pipeline Registers n Multipliers may be SLOW n How can we utilize them maximally? <ul style="display: flex;"><li style="flex:1">-</li><li style="flex:1"><<1 </li><li style="flex:1">+</li></ul>Pipeline Registers <ul style="display: flex;"><li style="flex:1">+</li><li style="flex:1">+</li><li style="flex:1">If>4 output 1 </li></ul><ul style="display: flex;"><li style="flex:1">X n+1 </li><li style="flex:1">Y n+1 </li></ul><ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 13 </li><li style="flex:1">EE183 Lecture 8 - Slide 14 </li></ul><ul style="display: flex;"><li style="flex:1">Xn </li><li style="flex:1">Yn </li></ul><ul style="display: flex;"><li style="flex:1">Mandel X Julia X </li><li style="flex:1">Mandel X Julia Y </li></ul>Now we have… M1a *M2a *M3a <ul style="display: flex;"><li style="flex:1">*</li><li style="flex:1">Mux </li><li style="flex:1">Mux </li></ul>Pipeline Registers n With a 3-stage multiplier you’ve now got 5 pipeline stages M1b *M2b *M3b *Pipeline Registers n How can you keep the pipeline full? M1c *M2c *M3c *Pipeline Registers n How many things do you need to calculate at once? Add1 Add2 <ul style="display: flex;"><li style="flex:1">-</li><li style="flex:1"><<1 </li><li style="flex:1">+</li></ul>Pipeline Registers <ul style="display: flex;"><li style="flex:1">+</li><li style="flex:1">+</li><li style="flex:1">If>4 output 1 </li></ul>n What is full? Will you ever get 100% utilization? What is good enough? <ul style="display: flex;"><li style="flex:1">X n+1 </li><li style="flex:1">Y n+1 </li></ul><ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 15 </li><li style="flex:1">EE183 Lecture 8 - Slide 16 </li></ul>4<ul style="display: flex;"><li style="flex:1">Pipeline Performance Analysis </li><li style="flex:1">Pipeline Performance Analysis </li></ul>nWith the bad data path (3, 3 stage multipliers and 2 stages after that; one pixel at a time) nWith the bad data path (3, 3 stage multipliers and 2 stages after that; multiple pixels at a time) Clk 1 M1a M2a M3a M1b M2b M3b M1c • • • <ul style="display: flex;"><li style="flex:1">M2c M3c Add1 Add2 </li><li style="flex:1">Clk </li></ul>1<ul style="display: flex;"><li style="flex:1">M1a M2a M3a M1b M2b M3b M1c </li><li style="flex:1">M2c M3c Add1 Add2 </li></ul>• • • <ul style="display: flex;"><li style="flex:1">s</li><li style="flex:1">s</li><li style="flex:1">s</li></ul>2345 2345<ul style="display: flex;"><li style="flex:1">• • • </li><li style="flex:1">• • • </li></ul><ul style="display: flex;"><li style="flex:1">s</li><li style="flex:1">s</li><li style="flex:1">s</li></ul><ul style="display: flex;"><li style="flex:1">• • • </li><li style="flex:1">• • • </li><li style="flex:1">• • • </li></ul><ul style="display: flex;"><li style="flex:1">v</li><li style="flex:1">v</li><li style="flex:1">v</li><li style="flex:1">s</li><li style="flex:1">s</li><li style="flex:1">s</li></ul><ul style="display: flex;"><li style="flex:1">•</li><li style="flex:1">• • • </li><li style="flex:1">•</li></ul><ul style="display: flex;"><li style="flex:1">s</li><li style="flex:1">v</li><li style="flex:1">v</li><li style="flex:1">v</li></ul><ul style="display: flex;"><li style="flex:1">•</li><li style="flex:1">• • • </li><li style="flex:1">• • • </li><li style="flex:1">•</li></ul>nIn 5 cycles we used 11 units out of 55 available: 20% average utilization nWe approach 100% utilization if there are no stalls or dependencies and we can keep getting new data <ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 17 </li><li style="flex:1">EE183 Lecture 8 - Slide 18 </li></ul><ul style="display: flex;"><li style="flex:1">What performance is required? </li><li style="flex:1">What do we expect </li></ul>n The previous data path is terribly inefficient if you only put one pixel through at a time, but doing multiple pixels at once is very complicated n Replication and Pipelining are not trivial to implement—make sure you need them n Is either needed for Lab #2? n How would you tell? n Hint: each Julia image takes at most (64*64*64*7*1/50e6) = 0.036s to create. n Is this “real-time” enough for an animation? n As an alternative you can use one multiplier and put your x*x, y*y, and x*y through it in a pipelined manner. n What’s the efficiency? Is it a good tradeoff for area/speed? This analysis is critical! n Other issues? Need to meet timing for the VGA. <ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 19 </li><li style="flex:1">EE183 Lecture 8 - Slide 20 </li></ul>5Pipeline Performance Conclusions n You need to know your algorithm and what tradeoffs you are making Pipeline Performance Analysis n Single multiplier, put x2 (•) through first, then y2 (s), then x•y(•). <ul style="display: flex;"><li style="flex:1">Clk </li><li style="flex:1">Ma </li><li style="flex:1">Mb </li><li style="flex:1">Mc Add1 Add2 </li></ul>nnnIn 7 cycles we use 11/35 functional units = 31% 1•n What do you care about? s234567But we only have 1 multiplier •n Speed? n Area? n Both. s<ul style="display: flex;"><li style="flex:1">•</li><li style="flex:1">•</li></ul>How much space do we save? What is most important? s•(Power is a function of speed and number of transistors, i.e., area.) • • •<ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 21 </li><li style="flex:1">EE183 Lecture 8 - Slide 22 </li></ul>Zooming nnAn arbitrary portion of the screen can be described in many ways. Here are two: nXmin, Xmax, Ymin, Ymax Requires dividing by the number of pixels Xorigin, Yorigin, scale Requires a fixed number of pixels nnnHints for Zooming: nnnHave registers with the X, Y origins and increment/decrement them with the up/down, left/right buttons Have a scale register which goes up/down with the zooming in/out When converting form 0..63 to -2.00 to 2.00 use the scale and origin to calculate the new value <ul style="display: flex;"><li style="flex:1">EE183 Lecture 8 - Slide 23 </li><li style="flex:1">EE183 Lecture 8 - Slide 24 </li></ul>6Lecture 6 Key Points n Pipelining increases the clock speed but decreases the amount of work per clock n Parallelism is easy except for resource conflicts n Logistics n Lab 2 Prelab due Friday by 5pm email URL to Joel n Visiting lecturer next Monday – contents will be on the quiz EE183 Lecture 8 - Slide 25 7

Lecture 8: More Pipelining

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support