Understanding performance numbers in Integrated Circuit Design Oprecomp summer school 2019, Perugia Italy 5 September 2019
Frank K. G¨urkaynak [email protected]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 2/74 Who Am I?
Born in Istanbul, Turkey Studied and worked at: Istanbul Technical University, Istanbul, Turkey EPFL, Lausanne, Switzerland Worcester Polytechnic Institute, Worcester MA, USA
Since 2008: Integrated Systems Laboratory, ETH Zurich Director, Microelectronics Design Center Senior Scientist, group of Prof. Luca Benini
Interests: Digital Integrated Circuits Cryptographic Hardware Design Design Flows for Digital Design Processor Design Open Source Hardware
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 3/74 What Will We Discuss Today?
Introduction Cost Structure of Integrated Circuits (ICs)
Measuring performance of ICs Why is it difficult? EDA tools should give us a number
Area How do people report area? Is that fair?
Speed How fast does my circuit actually work?
Power These days much more important, but also much harder to get right
Integrated Systems Laboratory The performance establishes the solution space Finally the cost sets a limit to what is possible
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 4/74 System Design Requirements
System Requirements
Functionality
Functionality determines what the system will do
Integrated Systems Laboratory Finally the cost sets a limit to what is possible
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 4/74 System Design Requirements
System Requirements
Functionality Performance
Functionality determines what the system will do The performance establishes the solution space
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 4/74 System Design Requirements
System Requirements
Cost Functionality Performance
Functionality determines what the system will do The performance establishes the solution space Finally the cost sets a limit to what is possible
Integrated Systems Laboratory Microcontrollers Relatively easy to program, includes peripherals, cheap Poor performance General Purpose Processors Simple to program, no need to know hardware design Sequential processing, needs many peripherals, high cost Parallel/Graphics Processors Allow parallel processing High cost, no peripherals, not so easy to program
Programmable Logic, FPGAs Very flexible, simpler design flow than ASICs, many IPs High unit cost, not as fast as ASICs Processors
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 5/74 The Focus of this Talk is on Digital Systems Custom Integrated Circuits, ASICs Best performance, cheap for mass market Difficult to design, long manufacturing time
Integrated Systems Laboratory Microcontrollers Relatively easy to program, includes peripherals, cheap Poor performance General Purpose Processors Simple to program, no need to know hardware design Sequential processing, needs many peripherals, high cost Parallel/Graphics Processors Allow parallel processing High cost, no peripherals, not so easy to program
Processors
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 5/74 The Focus of this Talk is on Digital Systems Custom Integrated Circuits, ASICs Best performance, cheap for mass market Difficult to design, long manufacturing time Programmable Logic, FPGAs Very flexible, simpler design flow than ASICs, many IPs High unit cost, not as fast as ASICs
Integrated Systems Laboratory General Purpose Processors Simple to program, no need to know hardware design Sequential processing, needs many peripherals, high cost Parallel/Graphics Processors Allow parallel processing High cost, no peripherals, not so easy to program
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 5/74 The Focus of this Talk is on Digital Systems Custom Integrated Circuits, ASICs Best performance, cheap for mass market Difficult to design, long manufacturing time Programmable Logic, FPGAs Very flexible, simpler design flow than ASICs, many IPs High unit cost, not as fast as ASICs Processors Microcontrollers Relatively easy to program, includes peripherals, cheap Poor performance
Integrated Systems Laboratory Parallel/Graphics Processors Allow parallel processing High cost, no peripherals, not so easy to program
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 5/74 The Focus of this Talk is on Digital Systems Custom Integrated Circuits, ASICs Best performance, cheap for mass market Difficult to design, long manufacturing time Programmable Logic, FPGAs Very flexible, simpler design flow than ASICs, many IPs High unit cost, not as fast as ASICs Processors Microcontrollers Relatively easy to program, includes peripherals, cheap Poor performance General Purpose Processors Simple to program, no need to know hardware design Sequential processing, needs many peripherals, high cost
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 5/74 The Focus of this Talk is on Digital Systems Custom Integrated Circuits, ASICs Best performance, cheap for mass market Difficult to design, long manufacturing time Programmable Logic, FPGAs Very flexible, simpler design flow than ASICs, many IPs High unit cost, not as fast as ASICs Processors Microcontrollers Relatively easy to program, includes peripherals, cheap Poor performance General Purpose Processors Simple to program, no need to know hardware design Sequential processing, needs many peripherals, high cost Parallel/Graphics Processors Allow parallel processing High cost, no peripherals, not so easy to program
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 6/74 Digital Solutions are Cost Efficient
For example:
Main motivation to move from VHS to DVD was cost!
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 7/74
1 Introduction
2 Cost Cost Structure Practical Limits The Need for Test Moore’s Law
3 Design Flow
4 Area
5 Speed
6 Area/Speed Trade-offs
7 Power Integrated Systems Laboratory
8 Conclusions Recurring costs (per chip) Silicon cost Cost of wafer + processing Licensing cost Some of the IP require a licensing cost per chip sold Packaging The plastic package that holds the chip + costs of bonding Testing Each manufactured chip needs to be tested. Costs both money and time
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 8/74 How Much Does a Chip Cost?
Non-recurring costs Engineering Designers are not cheap (app. 100 k$/yr) IP costs Some of the IP are bought EDA tools + Infrastructure EDA tools are expensive, i.e. 100 k$/yr/seat Masks for production Chips may need 20-50 masks. Technology dependent, may cost 50 k$-5,000 k$
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 8/74 How Much Does a Chip Cost?
Non-recurring costs Recurring costs (per chip) Engineering Silicon cost Designers are not cheap Cost of wafer + processing (app. 100 k$/yr) Licensing cost IP costs Some of the IP require a Some of the IP are bought licensing cost per chip sold EDA tools + Infrastructure Packaging EDA tools are expensive, i.e. The plastic package that holds 100 k$/yr/seat the chip + costs of bonding Masks for production Testing Chips may need 20-50 masks. Each manufactured chip needs Technology dependent, may to be tested. Costs both cost 50 k$-5,000 k$ money and time
Integrated Systems Laboratory Packaging + Licensing + Testing Design dependent from a few cents, to several of dollars (1.00 $)
Non-recurring costs 200 MM of engineering, around 2 M$ Production mask set around 1 M$ IPs + EDA tools + administration + infrastructure another 1 M$ Volume dependent. i.e for 5,000,000 chips sold, around 0.80 $ per chip
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 9/74 A Simplified Example
A 4 mm × 4 mm chip in 90 nm technology, 300 mm wafers Silicon Cost About 4,000 dies per wafer At 85 % yield, around 3400 working dies About 2,500 $ per wafer (including processing) around 0.75 $ per working die
Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008
Integrated Systems Laboratory Non-recurring costs 200 MM of engineering, around 2 M$ Production mask set around 1 M$ IPs + EDA tools + administration + infrastructure another 1 M$ Volume dependent. i.e for 5,000,000 chips sold, around 0.80 $ per chip
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 9/74 A Simplified Example
A 4 mm × 4 mm chip in 90 nm technology, 300 mm wafers Silicon Cost About 4,000 dies per wafer At 85 % yield, around 3400 working dies About 2,500 $ per wafer (including processing) around 0.75 $ per working die
Packaging + Licensing + Testing Design dependent from a few cents, to several of dollars (1.00 $)
Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 9/74 A Simplified Example
A 4 mm × 4 mm chip in 90 nm technology, 300 mm wafers Silicon Cost About 4,000 dies per wafer At 85 % yield, around 3400 working dies About 2,500 $ per wafer (including processing) around 0.75 $ per working die
Packaging + Licensing + Testing Design dependent from a few cents, to several of dollars (1.00 $)
Non-recurring costs 200 MM of engineering, around 2 M$ Production mask set around 1 M$ IPs + EDA tools + administration + infrastructure another 1 M$ Volume dependent. i.e for 5,000,000 chips sold, around 0.80 $ per chip
Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008
Integrated Systems Laboratory If we sell 5’000 chips 4 M$ non-recurring costs, 800 $ per chip Total chip costs 801.75 $ (99.78% non-recurring)
If we sell 500’000’000 chips 4 M$ non-recurring costs, 0.008 $ per chip Total chip costs 1.76 $ (0.5% non-recurring)
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 10/74 The key to IC Design success is volume
Previous example, cost of one chip If we sell 5,000,000 chips 4 M$ non-recurring costs, 0.80 $ per chip Total chip costs 2.55 $ (31% non-recurring)
Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008
Integrated Systems Laboratory If we sell 500’000’000 chips 4 M$ non-recurring costs, 0.008 $ per chip Total chip costs 1.76 $ (0.5% non-recurring)
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 10/74 The key to IC Design success is volume
Previous example, cost of one chip If we sell 5,000,000 chips 4 M$ non-recurring costs, 0.80 $ per chip Total chip costs 2.55 $ (31% non-recurring)
If we sell 5’000 chips 4 M$ non-recurring costs, 800 $ per chip Total chip costs 801.75 $ (99.78% non-recurring)
Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 10/74 The key to IC Design success is volume
Previous example, cost of one chip If we sell 5,000,000 chips 4 M$ non-recurring costs, 0.80 $ per chip Total chip costs 2.55 $ (31% non-recurring)
If we sell 5’000 chips 4 M$ non-recurring costs, 800 $ per chip Total chip costs 801.75 $ (99.78% non-recurring)
If we sell 500’000’000 chips 4 M$ non-recurring costs, 0.008 $ per chip Total chip costs 1.76 $ (0.5% non-recurring)
Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008
Integrated Systems Laboratory Recurring costs (per chip) Silicon cost Reduce die area Licensing cost Do not buy IP (increases engineering cost) Packaging Simpler packages, reduce number of I/Os, thermal requirements Testing Reduce testing time, improve DFT.
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 11/74 How Can We Make Chips Cheaper
Non-recurring costs Sell more chips! Engineering Buy IP (increases IP costs) IP costs Redesign it yourself (increases engineering costs) EDA tools + Infrastructure Outsource to design houses Masks for production Use fewer masks. Use older (cheaper) technologies
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 11/74 How Can We Make Chips Cheaper
Non-recurring costs Recurring costs (per chip) Sell more chips! Silicon cost Engineering Reduce die area Buy IP (increases IP costs) Licensing cost IP costs Do not buy IP (increases Redesign it yourself (increases engineering cost) engineering costs) Packaging EDA tools + Infrastructure Simpler packages, reduce Outsource to design houses number of I/Os, thermal Masks for production requirements Use fewer masks. Use older Testing (cheaper) technologies Reduce testing time, improve DFT.
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 12/74 How are Individual Dies Cut From a Wafer?
Dicing is a mechanical process There needs to be some space (typically 50-200 µm) between dies on a wafer so that we can reliably cut them.
Images taken from http://wonderfulworldofwafers.wordpress.com/ Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 12/74 How are Individual Dies Cut From a Wafer?
The sawline sets a lower limit for the practical die size If the sawline is 100 µm, and the die is 1 mm×1 mm in size: more than 20 % of the wafer will be wasted for sawlines
Images taken from http://wonderfulworldofwafers.wordpress.com/ Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield
Wafer
Single wafer during production
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield
Manufacturing Defects
Inevitably there will be some defects on the wafer
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield
Wasted Die Wasted Area Area
Die Die Die
Wasted Wasted Area Area Die
If a large die is used, only few dies can be manufactured per each wafer
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield
Defect Die
Defect Good Defect Die Die Die
Defect Die
In this example 4 out of 5 dies are defect. Yield = 1/5 = 20 %
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield
Smaller dies improve yield: Yield = 7/12 = 58 %
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield
The smaller the dies, the better the yield: Yield = 5/60 = 92 %
Integrated Systems Laboratory Testing can be expensive and time consuming Typical tester costs more than 1 M$ Test costs expressed usually per pin and per time Make sure you spend as little time as possible on tester Test at speed within the chip (Built-in Self-Test) Most testing is outsourced to dedicated test houses
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 14/74 You Need To Test Every Chip You Manufacture
A certain amount of chips will always be defective Keeps production costs down. 100 % yield economically not feasible Yield typically improves over time for a specific process Cutting edge technologies may have as low as 10 % yield Established technologies have 90 % or more yield The smaller the dies, the higher the yield
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 14/74 You Need To Test Every Chip You Manufacture
A certain amount of chips will always be defective Keeps production costs down. 100 % yield economically not feasible Yield typically improves over time for a specific process Cutting edge technologies may have as low as 10 % yield Established technologies have 90 % or more yield The smaller the dies, the higher the yield
Testing can be expensive and time consuming Typical tester costs more than 1 M$ Test costs expressed usually per pin and per time Make sure you spend as little time as possible on tester Test at speed within the chip (Built-in Self-Test) Most testing is outsourced to dedicated test houses
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 15/74 Moore’s Law: 2x Transistors Every 18 Months
Intel Processors
10G
1G
100M
Moore’s Law 10M
1M
Number of Transistors 100k
10k
1k 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Year http://www.intel.com/content/www/us/en/history/history-intel-chips-timeline-poster.html
Integrated Systems Laboratory How long will it hold up? Has been predicted to end multiple times (even by Moore himself) Self fullfilling prophecy, sets goal for industry Process options become more complex and more expensive Physical limits (single atom) approaching fast
Chips are not necessarily getting faster Side effect of scaling: smaller transistors switch faster Power density issues: voltage is reduced Reduced voltage: reduces operating speed Trade-off between speed/power
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 16/74 What Moore’s Law Doesn’t Say
Chips are not necessarily getting smaller We can put more transistors per unit area But very small (and very large chips) are not feasible We need to find use for more transistors
Integrated Systems Laboratory Chips are not necessarily getting faster Side effect of scaling: smaller transistors switch faster Power density issues: voltage is reduced Reduced voltage: reduces operating speed Trade-off between speed/power
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 16/74 What Moore’s Law Doesn’t Say
Chips are not necessarily getting smaller We can put more transistors per unit area But very small (and very large chips) are not feasible We need to find use for more transistors
How long will it hold up? Has been predicted to end multiple times (even by Moore himself) Self fullfilling prophecy, sets goal for industry Process options become more complex and more expensive Physical limits (single atom) approaching fast
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 16/74 What Moore’s Law Doesn’t Say
Chips are not necessarily getting smaller We can put more transistors per unit area But very small (and very large chips) are not feasible We need to find use for more transistors
How long will it hold up? Has been predicted to end multiple times (even by Moore himself) Self fullfilling prophecy, sets goal for industry Process options become more complex and more expensive Physical limits (single atom) approaching fast
Chips are not necessarily getting faster Side effect of scaling: smaller transistors switch faster Power density issues: voltage is reduced Reduced voltage: reduces operating speed Trade-off between speed/power
Integrated Systems Laboratory Not all chips will profit equally from Moore’s Law If your chip is very small, making it even smaller will not reduce costs
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 17/74 Die Area from a Business Perspective
Type Approximate Size Notes very small < 1 mm2 Sawlines are a problem small 1 mm2 - 10 mm2 Low die cost large 10 mm2 - 100 mm2 More design effort required very large > 100 mm2 Yield an issue
If dies are very small, silicon area is wasted on sawlines If dies are very large, yield decreases, die cost increases Sweetspot for die area between 1 mm2 - 100 mm2
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 17/74 Die Area from a Business Perspective
Type Approximate Size Notes very small < 1 mm2 Sawlines are a problem small 1 mm2 - 10 mm2 Low die cost large 10 mm2 - 100 mm2 More design effort required very large > 100 mm2 Yield an issue
If dies are very small, silicon area is wasted on sawlines If dies are very large, yield decreases, die cost increases Sweetspot for die area between 1 mm2 - 100 mm2
Not all chips will profit equally from Moore’s Law If your chip is very small, making it even smaller will not reduce costs
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 18/74
1 Introduction
2 Cost
3 Design Flow Academic vs Industrial Goals
4 Area
5 Speed
6 Area/Speed Trade-offs
7 Power
8 Conclusions Integrated Systems Laboratory .. or it is actually not so easy.
Why doesn’t this work yet? Perhaps there is a conspiracy? Maybe we are lazy! It could be that we want to keep our jobs.
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 19/74 The Ultimate Goal of System Designers/Managers
Powerpoint to Chip (GDSII)
Integrated Systems Laboratory .. or it is actually not so easy.
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 19/74 The Ultimate Goal of System Designers/Managers
Powerpoint to Chip (GDSII)
Why doesn’t this work yet? Perhaps there is a conspiracy? Maybe we are lazy! It could be that we want to keep our jobs.
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 19/74 The Ultimate Goal of System Designers/Managers
Powerpoint to Chip (GDSII)
Why doesn’t this work yet? Perhaps there is a conspiracy? Maybe we are lazy! It could be that we want to keep our jobs. .. or it is actually not so easy.
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 20/74 From Idea to ASIC: The Design Flow
Front-end Design Flow Design: Coming up with an architecture Design Entry: Description in HDL Verification: Making sure it does what it is supposed to Synthesis: Mapping to target technology Scan Insertion: Adding structures for testing
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 21/74 From Idea to ASIC: The Design Flow
Back-end Design Flow Placement: Putting all logic gates / macros on chip Clock Tree Insertion: Distributing clock signal on the chip Routing: Establishing connections on the chip Timing Verification: Making sure the setup/hold times are met Chip Finishing: Pads, logos, seal ring DRC and LVS: Making sure that the physical layout is correct
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 22/74 Confidence in the Results Increase with Flow
Specifications
Architecture Architecture (GMU) (ETH Zurich)
RTL Description (VHDL)
Synthesis Constraints (Synopsys DC)
Place and Route (Cadence EDI)
Synthesis Wireload Model (Synopsys DC)
Place and Route (Cadence EDI)
ASIC (UMC65nm) High Low Accuracy of Results
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 23/74 Design Specifications
A List of Requirements Function What exactly is expected from the chip? Performance How fast, what area, how much power? I/O requirements How will the ASIC fit together with the rest of the system?
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 24/74 Synthesize HDL into Gates
Acquiring Physical Properties Technology Decided by choosing the target library Area Gates occupy area Speed Propagation through gates requires time Power Gates consume power during switching and when idle
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 25/74 Floorplanning I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O VDD VDD GND GND GND VDD GND How will the chip look I/O I/O Power Routing 64x32b I/O RAM I/O I/O I/O Determine area/geometry 600µ I/O 770µ Reference Design I/O I/O and Interface I/O of the chip 960µ I/O I/O I/O I/O Place macro and I/O cells g2s I/O 64x32b I/O GND Student Design: Local RAM VDD Clock Gen Design the power VDD Baby and VDD 1,800µ Pampers VDD 700µ Goliath GND connections I/O I/O I/O g2d g2d I/O Leave room for routing I/O Routing Channel I/O I/O I/O optimizations I/O d2g d2g I/O Local I/O Local I/O Clock David David Clock 470µ Iterative process, can not I/O Gen. I/O Glue Logic for Top-Level Design Gen. 450µ I/O I/O determine the perfect GND VDD floorplan from beginning. I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O VDD VDD GND GND GND
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 26/74 Placement is a NP-Hard Problem
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 27/74 Routing Determines the Parasitic Load
Integrated Systems Laboratory Cost Total silicon cost Latency Amount of time to process a given data item Energy Total energy to finish task Man Months to Design How much engineers cost
These are related, but they are not the same Important to know, which one we are really interested in!
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 28/74 What Are Performance Parameters?
Area Total area occupied by circuit Throughput Number of data items processed per unit time Power Peak power for operation Time to Market How long the design will take
Integrated Systems Laboratory These are related, but they are not the same Important to know, which one we are really interested in!
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 28/74 What Are Performance Parameters?
Area Cost Total area occupied by circuit Total silicon cost Throughput Latency Number of data items Amount of time to process a processed per unit time given data item Power Energy Peak power for operation Total energy to finish task Time to Market Man Months to Design How long the design will take How much engineers cost
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 28/74 What Are Performance Parameters?
Area Cost Total area occupied by circuit Total silicon cost Throughput Latency Number of data items Amount of time to process a processed per unit time given data item Power Energy Peak power for operation Total energy to finish task Time to Market Man Months to Design How long the design will take How much engineers cost
These are related, but they are not the same Important to know, which one we are really interested in!
Integrated Systems Laboratory Use a more advanced technology Easy way Not always feasible Costs will increase Additional problems may arise (adapting, lack of IPs)
Make a better implementation Harder, needs smart (expensive) engineers Not always possible, there is a limit to what can be done Academia should concentrate on this, show what is the limit
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 29/74 How to Improve Performance
Integrated Systems Laboratory Make a better implementation Harder, needs smart (expensive) engineers Not always possible, there is a limit to what can be done Academia should concentrate on this, show what is the limit
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 29/74 How to Improve Performance
Use a more advanced technology Easy way Not always feasible Costs will increase Additional problems may arise (adapting, lack of IPs)
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 29/74 How to Improve Performance
Use a more advanced technology Easy way Not always feasible Costs will increase Additional problems may arise (adapting, lack of IPs)
Make a better implementation Harder, needs smart (expensive) engineers Not always possible, there is a limit to what can be done Academia should concentrate on this, show what is the limit
Integrated Systems Laboratory ..but in Industry The chips need to work within specifications. Goal is to sell a product Cost and time to design important Manufacturability is paramount Mostly conservative design practices are used
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 30/74 Academic vs Industrial IC Design
Both of them have similar design flows use the same tools define performance similarly
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 30/74 Academic vs Industrial IC Design
Both of them have similar design flows use the same tools define performance similarly
..but in Industry The chips need to work within specifications. Goal is to sell a product Cost and time to design important Manufacturability is paramount Mostly conservative design practices are used
Integrated Systems Laboratory In academia: Interested only in the limits (fastest, smallest etc.) Performance numbers needed to show how we compare against others It is not easy to get fair numbers from these tools Constraints are misused to explore design space
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 31/74 EDA tools are Designed for Industry Requirements
For the industry: Circuit has to function to specification even in worst conditions Constraints are used for to define specifications EDA tools make sure circuit performance meets specifications Not designed to get peak performance
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 31/74 EDA tools are Designed for Industry Requirements
For the industry: Circuit has to function to specification even in worst conditions Constraints are used for to define specifications EDA tools make sure circuit performance meets specifications Not designed to get peak performance
In academia: Interested only in the limits (fastest, smallest etc.) Performance numbers needed to show how we compare against others It is not easy to get fair numbers from these tools Constraints are misused to explore design space
Integrated Systems Laboratory All of these problems have solutions...... but they come at a cost!
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 32/74 Real-World Problems will Effect Performance
Parasitic RLC effects on timing Clock distribution (skew, jitter) Cross talk, signal integrity I/O speed limitations IR drop (more area for power routing) Maximum current density limits (more area for power routing) PVT variations (die-to-die, inter-die), reliability Thermal issues, temperature variation during operation
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 32/74 Real-World Problems will Effect Performance
Parasitic RLC effects on timing Clock distribution (skew, jitter) Cross talk, signal integrity I/O speed limitations IR drop (more area for power routing) Maximum current density limits (more area for power routing) PVT variations (die-to-die, inter-die), reliability Thermal issues, temperature variation during operation
All of these problems have solutions...... but they come at a cost!
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 33/74 Academic Reporting Neglects Some Problems
Overhead of Solutions not Always Included For example: Large I/O bandwidth (serial I/O, pads, buffers) Very fast clocks (PLLs, costly clock tree) Dynamic Voltage/Frequency Scaling (synchronizers, level shifters) Additional Verification Effort Circuits with: Many operational modes Interfaces to different clock domains Can be costly in terms of verification regardless of the circuit size Testing Overhead Some circuits are difficult to test (i.e. asynchronous circuits), or require extensive test modes
Integrated Systems Laboratory Post-Layout Routing overhead known, more accurate timing Not all post-layout designs include proper provisions for: power, signal integrity, testing
Actual Measurement Real proof of concept Performance affected by practical problems, usually worse than expected Requires test infrastructure, costly
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 34/74 Quality of Results Depend on Design State
Different Levels, Different Accuracy Synthesis First level with physical properties Routing overhead unknown, estimated by wireload models Timing and power models are mean values, not that accurate
Integrated Systems Laboratory Actual Measurement Real proof of concept Performance affected by practical problems, usually worse than expected Requires test infrastructure, costly
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 34/74 Quality of Results Depend on Design State
Different Levels, Different Accuracy Synthesis First level with physical properties Routing overhead unknown, estimated by wireload models Timing and power models are mean values, not that accurate
Post-Layout Routing overhead known, more accurate timing Not all post-layout designs include proper provisions for: power, signal integrity, testing
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 34/74 Quality of Results Depend on Design State
Different Levels, Different Accuracy Synthesis First level with physical properties Routing overhead unknown, estimated by wireload models Timing and power models are mean values, not that accurate
Post-Layout Routing overhead known, more accurate timing Not all post-layout designs include proper provisions for: power, signal integrity, testing
Actual Measurement Real proof of concept Performance affected by practical problems, usually worse than expected Requires test infrastructure, costly
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 35/74 Throughput/Area: Post-Layout ¨ Measurement
Normalized Throughput/Area of ALL SHA-3 Candidates 3 Postlayout Results GMU Typical Case ETHZ 770 2.5 VDD=1.2V Numbers in kbits/GE
2 518
1.5
1 273 208 Throughput/Area normalized to GMU SHA-2
0.5 179 162 146 115 117 77 42 44 0 SHA-2 BLAKE Groestl JH Keccak Skein Algorithms
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 35/74 Throughput/Area: Post-Layout ¨ Measurement
Normalized Throughput/Area of ALL SHA-3 Candidates 3 Measurement Results GMU Average of 5 ASICs ETHZ 2.5 VDD=1.2V Numbers in kbits/GE 685
2
1.5 394
1 215 Throughput/Area normalized to GMU SHA-2
0.5 174 167 154 139 121 85 86 41 46 0 SHA-2 BLAKE Groestl JH Keccak Skein Algorithms
Integrated Systems Laboratory Why do we care? Silicon Cost Feasibility (will it fit?)
Should be easy to determine Does not change after/during manufacturing Reliable and accurate post-layout numbers
Units mm 2 correct unit, technology dependent GE commonly used, not accurate at all
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 36/74 Let Us Talk About Area
How much silicon area will be used for the circuit?
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 36/74 Let Us Talk About Area
How much silicon area will be used for the circuit?
Why do we care? Silicon Cost Feasibility (will it fit?)
Should be easy to determine Does not change after/during manufacturing Reliable and accurate post-layout numbers
Units mm 2 correct unit, technology dependent GE commonly used, not accurate at all
Integrated Systems Laboratory Library dependent Width, style, height can differ
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 37/74 The Story of Gate Equivalents
Area of 2-input NAND gate Historical reasons Was used to compare circuits in different styles: CMOS, TTL, ECL.. For CMOS 4 transistors is 1 GE
Not easy to standardize Everyone can interpret it differently
Many 4 transistor gates Buffer, 2-input NAND, NOR.. Different drive strengths
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 37/74 The Story of Gate Equivalents
Area of 2-input NAND gate Historical reasons Was used to compare circuits in different styles: CMOS, TTL, ECL.. For CMOS 4 transistors is 1 GE
Not easy to standardize Everyone can interpret it differently
Many 4 transistor gates Buffer, 2-input NAND, NOR.. Different drive strengths
Library dependent Width, style, height can differ
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 38/74 Same Designs, Different Technologies
Technology Library Design A Design B Foundry Node Vendor Tracks [mm2] [kGE] [mm2] [kGE] C 28 nm I 12 0.014 28.1 0.129 263.6 C 45 nm I 9 0.016 23.2 0.168 247.3 C 45 nm I 14 0.019 18.3 0.220 208.3 C 65 nm I 13 0.035 17.0 0.407 195.5 A 65 nm IV 9 0.030 21.0 0.311 216.0 D 65 nm VI 9 0.029 20.1 0.305 211.7 A 90 nm V 10 0.060 19.3 0.672 214.2 A 130 nm V 8 0.104 20.3 1.057 206.5 E 130 nm II 9 0.109 24.1 1.179 259.7 C 130 nm I 12 0.115 19.0 1.299 214.6 B 150 nm III 9 0.214 23.7 2.070 229.0 A 180 nm V 9 0.211 22.5 2.151 229.5 A 180 nm VII 9 0.198 20.5 2.056 212.5
Integrated Systems Laboratory Round your results to not more than 2 significant digits! 8 kGE 250 kGE
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 39/74 GE is a very Coarse Measure for Area
Area in gate equivalents will vary at least ± 10% between different technologies
In other words Do not use overly precise area descriptions: 8,421 GE 235,481 GE
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 39/74 GE is a very Coarse Measure for Area
Area in gate equivalents will vary at least ± 10% between different technologies
In other words Do not use overly precise area descriptions: 8,421 GE 235,481 GE Round your results to not more than 2 significant digits! 8 kGE 250 kGE
Integrated Systems Laboratory The post-layout gate count? The gate count is just the area occupied by the gates You need extra space for routing, signal integrity, and power routing
The total die area of the manufactured chip? You may have some additional structures for testing in the chip Practical limits may require specific die sizes (Europractice mini@sic) What if the chip contains more than one design?
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 40/74 What Exactly Is the Circuit Area?
The number reported by the synthesizer? Does not include routing, power, clock overhead
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 41/74 Design With and Without Power Routing
Small design Same design, larger GE: 69’042, GE: 70’906, area: 0.719 mm2 area: 1.183 mm2
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 41/74 Design With and Without Power Routing
Small design Same design, larger Would NOT work at speed Would work fine at the specified more than 20% IR frequency
Integrated Systems Laboratory The total die area of the manufactured chip? You may have some additional structures for testing in the chip Practical limits may require specific die sizes (Europractice mini@sic) What if the chip contains more than one design?
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 42/74 What Exactly is the Circuit Area?
The number reported by the synthesizer? Does not include routing, power, clock overhead
The post-layout gate count? The gate count is just the area occupied by the gates You need extra space for routing, signal integrity, and power routing
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 43/74 How Cells are Actually Placed on Chip
Close-up of a placed and routed design
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 43/74 How Cells are Actually Placed on Chip
Partially stripped layout to show cells and power routing
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 43/74 How Cells are Actually Placed on Chip
Standard cells, power routing and signal routing separated
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 43/74 How Cells are Actually Placed on Chip
Cell placement showing standard cells, body tie cells, buffers
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 44/74 What Exactly Is the Circuit Area?
The number reported by the synthesizer? Does not include routing, power, clock overhead
The post-layout gate count? The gate count is just the area occupied by the gates You need extra space for routing, signal integrity, and power routing
The total die area of the manufactured chip? You may have some additional structures for testing in the chip Practical limits may require specific die sizes (Europractice mini@sic) What if the chip contains more than one design?
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?
One block within a larger ASIC
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?
Area of smallest rectangle into which the block fits completely?
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?
Area covering the maximum reach of the cells?
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?
Close-up of the same block
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?
Block of interest shares area with a second block (in red)
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?
Closer look shows cells of two blocks with empty area in between
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 46/74 Area Numbers are Approximate
Determining the area is not so easy Synthesis numbers do not tell the whole story Routing, power overhead is design dependent. Can be only 10 % (LFSR) Can be 500 % (LDPC decoder)
Postlayout numbers are more reliable Can still be misleading There are many factors (IR drop, crosstalk) that need to be taken care of The higher the clock rate, the more additional problems you get
The total chip area is easy to determine Not often the case that the whole chip is of interest Difficult to determine the exact area of one block within a chip
Integrated Systems Laboratory Throughput? How many data items can we process per unit time Do more per unit time (parallelization) Divide the process into multiple shorter steps (pipelining) Do each processing step faster (increase clock frequency)
Latency? Amount of time required to process one data item Can not be made faster by parallelization, and pipelining Necessary for feedback systems
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 47/74 What About Speed?
How fast is my circuit?
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 47/74 What About Speed?
How fast is my circuit?
Throughput? How many data items can we process per unit time Do more per unit time (parallelization) Divide the process into multiple shorter steps (pipelining) Do each processing step faster (increase clock frequency)
Latency? Amount of time required to process one data item Can not be made faster by parallelization, and pipelining Necessary for feedback systems
Integrated Systems Laboratory Clock period [ns] 1/clock frequency Is more physical, relates to the sum of gate delays in the critical path Throughput [bits/s] How many bits are processed per unit time Good for streaming architectures, bad for processors (FP) Operations per second (GFLOPS) for processors Could be given as peak, minimum or mean values Depends on the complexity of the operation (i.e. NOP) Clocks per instruction (CPI) for processors Reflects the parallelization of the instructions. Needs the clock frequency to determine actual performance
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 48/74 Units that tell us the Speed:
Clock frequency [MHz] Determined by the critical path All other things being equal, would compare actually the speed How much is done in one clock cycle can differ from circuit to circuit
Integrated Systems Laboratory Throughput [bits/s] How many bits are processed per unit time Good for streaming architectures, bad for processors (FP) Operations per second (GFLOPS) for processors Could be given as peak, minimum or mean values Depends on the complexity of the operation (i.e. NOP) Clocks per instruction (CPI) for processors Reflects the parallelization of the instructions. Needs the clock frequency to determine actual performance
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 48/74 Units that tell us the Speed:
Clock frequency [MHz] Determined by the critical path All other things being equal, would compare actually the speed How much is done in one clock cycle can differ from circuit to circuit Clock period [ns] 1/clock frequency Is more physical, relates to the sum of gate delays in the critical path
Integrated Systems Laboratory (FP) Operations per second (GFLOPS) for processors Could be given as peak, minimum or mean values Depends on the complexity of the operation (i.e. NOP) Clocks per instruction (CPI) for processors Reflects the parallelization of the instructions. Needs the clock frequency to determine actual performance
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 48/74 Units that tell us the Speed:
Clock frequency [MHz] Determined by the critical path All other things being equal, would compare actually the speed How much is done in one clock cycle can differ from circuit to circuit Clock period [ns] 1/clock frequency Is more physical, relates to the sum of gate delays in the critical path Throughput [bits/s] How many bits are processed per unit time Good for streaming architectures, bad for processors
Integrated Systems Laboratory Clocks per instruction (CPI) for processors Reflects the parallelization of the instructions. Needs the clock frequency to determine actual performance
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 48/74 Units that tell us the Speed:
Clock frequency [MHz] Determined by the critical path All other things being equal, would compare actually the speed How much is done in one clock cycle can differ from circuit to circuit Clock period [ns] 1/clock frequency Is more physical, relates to the sum of gate delays in the critical path Throughput [bits/s] How many bits are processed per unit time Good for streaming architectures, bad for processors (FP) Operations per second (GFLOPS) for processors Could be given as peak, minimum or mean values Depends on the complexity of the operation (i.e. NOP)
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 48/74 Units that tell us the Speed:
Clock frequency [MHz] Determined by the critical path All other things being equal, would compare actually the speed How much is done in one clock cycle can differ from circuit to circuit Clock period [ns] 1/clock frequency Is more physical, relates to the sum of gate delays in the critical path Throughput [bits/s] How many bits are processed per unit time Good for streaming architectures, bad for processors (FP) Operations per second (GFLOPS) for processors Could be given as peak, minimum or mean values Depends on the complexity of the operation (i.e. NOP) Clocks per instruction (CPI) for processors Reflects the parallelization of the instructions. Needs the clock frequency to determine actual performance
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 49/74 Timing Analysis Investigates All Paths
Des/Clust/Port Wire Load Model Library ------sbox enG5K generic_lib
Point Incr Path ------InpxDI[0] (in) 0.0987 0.0987 f i_s_lut_0100/AddrxDI[0] (sub_lut_0100) 0.0000 0.0987 f i_s_lut_0100/U1/O (INV1S) 0.4191 0.5178 r i_s_lut_0100/U53/O (NR2) 0.3683 0.8861 f i_s_lut_0100/U83/O (NR2) 0.4559 1.3420 r i_s_lut_0100/U25/O (INV1S) 0.4845 1.8265 f i_s_lut_0100/U140/O (AOI22S) 0.2411 2.0676 r i_s_lut_0100/U19/O (ND3S) 0.0821 2.1497 f i_s_lut_0100/U225/O (MOAI1S) 0.1802 2.3299 f i_s_lut_0100/U226/O (AOI112S) 0.2397 2.5696 r i_s_lut_0100/U227/O (OAI222S) 0.1254 2.6951 f i_s_lut_0100/DataxDO[7] (sub_lut_0100) 0.0000 2.6951 f U16/O (MUX2) 0.2020 2.8971 f OupxDO[7] (out) 0.0000 2.8971 f data arrival time 2.8971 ------
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 50/74 CMOS Timing Depends on Capacitive Loading
A Z
B
Simple Example: 2-input NAND gate driving an inverter
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 50/74 CMOS Timing Depends on Capacitive Loading
VDD VDD
A Z
B
GND GND
Transistor level schematic: The gates when implemented using CMOS logic
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 50/74 CMOS Timing Depends on Capacitive Loading
VDD VDD
1 ?
0
GND GND
For example: When A=1, B=0, one pMOS charges up the input of the inverter
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 50/74 CMOS Timing Depends on Capacitive Loading
VDD
Cg,pmos=W.L Id~W/L
Cg,nmos=W.L
GND
In ideal case, it is a current source driving two capacitors Faster circuit = larger current source or smaller capacitance
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 50/74 CMOS Timing Depends on Capacitive Loading
VDD
Id~W/L
Cinterconnect Cgate~W.L
GND GND
Parasitic interconnect capacitance is major contributor to delay Not possible to determine the speed if Cinterconnect is not known
Integrated Systems Laboratory Wireload models for synthesis Statistical lookup tables to estimate the load Capacitive load as a function of the output fanout Circuit dependent, for best results needs to be extracted for each circuit
Parasitic extraction Only after placement and routing has been done All interconnections are known, capacitance can be extracted Different accuracy levels: 2D, 2.5D, full 3D Results are used by the timing analysis engine to calculate delay
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 51/74 How Do I Know What the Parasitic Load Is?
Ignore parasitic loads Naive approach Especially in modern processes (<90 nm) parasitics are dominant
Integrated Systems Laboratory Parasitic extraction Only after placement and routing has been done All interconnections are known, capacitance can be extracted Different accuracy levels: 2D, 2.5D, full 3D Results are used by the timing analysis engine to calculate delay
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 51/74 How Do I Know What the Parasitic Load Is?
Ignore parasitic loads Naive approach Especially in modern processes (<90 nm) parasitics are dominant
Wireload models for synthesis Statistical lookup tables to estimate the load Capacitive load as a function of the output fanout Circuit dependent, for best results needs to be extracted for each circuit
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 51/74 How Do I Know What the Parasitic Load Is?
Ignore parasitic loads Naive approach Especially in modern processes (<90 nm) parasitics are dominant
Wireload models for synthesis Statistical lookup tables to estimate the load Capacitive load as a function of the output fanout Circuit dependent, for best results needs to be extracted for each circuit
Parasitic extraction Only after placement and routing has been done All interconnections are known, capacitance can be extracted Different accuracy levels: 2D, 2.5D, full 3D Results are used by the timing analysis engine to calculate delay
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 52/74 Impact of Process Voltage Temperature Corners
Worst Case Typical Case Best Case Voltage 1.08 V 1.2 V 1.32 V Temperature 125 ℃ 27 ℃ -40 ℃ Critical Path 3.49 ns 2.24 ns 1.59 ns Throughput 13.75 Gbit/s 21.42 Gbit/s 30.19 Gbit/s Relative Performance 64.2 % 100 % 140.9 %
A crypto-algorithm implemented in the 90 nm UMC process This is exactly the same circuit, everytime a different PVT corner is loaded and the timing report is repeated
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 53/74 Every Technology has a Speed Range
Example numbers for a 180 nm process
Speed Num. of Gates Frequency Range Notes Slow > 120 < 50 MHz No problem Normal 120-50 50-150 MHz Standard design Fast 50-20 150-300 MHz Needs attention Very Fast 20-10 300-600 MHz Involved design Ultra fast < 10 > 600 MHz Full custom design
For every process there is a speed range that can achieved easily Slower circuits do not take advantage of the process capabilities Faster circuits need disproportionately more effort for design
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 54/74 Accurate Timing is Difficult to Determine
Industry is interested in corner performance Even in the worst case no setup violations Even in the best case no hold violations Statistical models. Large variation between best and worst case
Wiring capacitance needs to be considered Synthesis tools rely on wireload models. Needs to be customized Reliable estimates only after final routing
Short periods/Faster clock rates more problematic f=1/T, small changes in critical path, large changes in frequency Problems with Clock distribution (skew), IR drop, thermal problems all increase with faster clocks
Integrated Systems Laboratory It does not scale infinitely, there are upper and lower limits Not all points are equally efficient AT graphs visualize this trade-off
We say generally because
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 55/74 Trade-off between Area and Speed
Generally you can always make a circuit faster by sacrificing area
Integrated Systems Laboratory Not all points are equally efficient AT graphs visualize this trade-off
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 55/74 Trade-off between Area and Speed
Generally you can always make a circuit faster by sacrificing area
We say generally because It does not scale infinitely, there are upper and lower limits
Integrated Systems Laboratory AT graphs visualize this trade-off
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 55/74 Trade-off between Area and Speed
Generally you can always make a circuit faster by sacrificing area
We say generally because It does not scale infinitely, there are upper and lower limits Not all points are equally efficient
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 55/74 Trade-off between Area and Speed
Generally you can always make a circuit faster by sacrificing area
We say generally because It does not scale infinitely, there are upper and lower limits Not all points are equally efficient AT graphs visualize this trade-off
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained
5A
4A
3A
2A Increasing Area
A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained
5A
4A
Constant AT Product 3A Circuit Implementation
2A Increasing Area
A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained
5A
4A
Constant AT Product Implementations 3A with different constraints
2A Increasing Area
A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained
5A
4A Different constant AT lines
3A
2A Increasing Area
A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained
5A
4A Large variation of results typically 3A +/- 10%
2A Increasing Area
A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained
5A Overconstrained for Speed => Too large 4A
Efficient 3A Implementations
2A
Increasing Area Overconstrained for Area => Too slow A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code
AES SubBytes Implementations in 180 nm 25k
20k
] 15k 2 m µ
Area [ 10k
5k
0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code
AES SubBytes Implementations in 180 nm 25k
20k 8 LUT
] 15k 2 8 m µ
Area [ 10k
5k
0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code
AES SubBytes Implementations in 180 nm 25k
20k 8 Affine Inverse Transform Multiplicative
] 15k 2 8 m µ
Area [ 10k
5k
0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code
AES SubBytes Implementations in 180 nm 25k
GF4 20k 8 Map GF4 Inverse Map
] 15k 2 8 m µ
Area [ 10k
5k
0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code
AES SubBytes Implementations in 180 nm 25k
7 20k LUT LUT 128 Entry 128 Entry
1
] 15k 2 8 m µ
Area [ 10k
5k
0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code
AES SubBytes Implementations in 180 nm 25k
64 Entry 64 Entry LUT LUT 6 20k 64 Entry 64 Entry LUT LUT
2
] 15k 2 8 m µ
Area [ 10k
5k
0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code
AES SubBytes Implementations in 180 nm 25k
16 16 16 16 LUT LUT LUT LUT 16 16 16 16 4 LUT LUT LUT LUT 20k 16 16 16 16 LUT LUT LUT LUT 16 16 16 16 LUT LUT LUT LUT 4
] 15k 2 8 m µ
Area [ 10k
5k
0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code
AES SubBytes Implementations in 180 nm 25k
3 20k
5
] 15k 2 8 m µ
Area [ 10k
5k
0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code
AES SubBytes Implementations in 180 nm 25k
20k
] 15k Two stage 2 m µ
Area [ 10k Lookup Table 5k Arithmetic
0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 58/74 As Well As the Version of Synthesis Software
AES SubBytes Implementations in 180 nm 25k
20k
] 15k 2 m µ
Area [ 10k
5k
0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 59/74 Synthesis Results Depend on the Parasitics 180 Faster
160
More 140 Efficient Smaller
25 120 kbit/s/gate
100
80
Area [kGate eq] 50 kbit/s/gate 60
40 100 kbit/s/gate
20 200 kbit/s/gate 1000 500 kbit/s/gate kbit/s/gate
0 Gbit/s 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0 ns/bit 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 59/74 Synthesis Results Depend on the Parasitics 180 Faster gmu SHA-2 160 BLAKE More Grostl 140 Efficient Smaller JH
Synthesis Run Results 25 Keccak 120 kbit/s/gate with Different Skein Timing Constraints 100
80
Area [kGate eq] Grostl 50 kbit/s/gate 60 Skein 40 JH 100 kbit/s/gate Keccak 20 200 kbit/s/gate 1000 kbit/s/gate 500 kbit/s/gate SHA-2 0 Gbit/s 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0 ns/bit 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 59/74 Synthesis Results Depend on the Parasitics 180 Faster gmu SHA-2 160 BLAKE More Grostl 140 Synthesis Run Efficient with Extracted Wireload Smaller JH Results for Different 25 Keccak 120 Timing Constraints kbit/s/gate Skein
100 Change in Performance
80 Area [kGate eq] 50 kbit/s/gate 60
40 100 kbit/s/gate
20 200 kbit/s/gate 1000 kbit/s/gate 500 kbit/s/gate
0 Gbit/s 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0 ns/bit 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 59/74 Synthesis Results Depend on the Parasitics 180 Faster gmu SHA-2 160 BLAKE More Grostl 140 Efficient Smaller JH Initial 25 Keccak 120 Synthesis kbit/s/gate Synthesis Skein with 100 Extracted Wireload
80 Area [kGate eq] 50 kbit/s/gate 60
40 100 kbit/s/gate
20 200 kbit/s/gate 1000 kbit/s/gate 500 kbit/s/gate
0 Gbit/s 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0 ns/bit 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit]
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 59/74 Synthesis Results Depend on the Parasitics 180 Faster gmu SHA-2 160 Final BLAKE More Postlayout Grostl 140 Efficient Result Smaller JH Initial 25 Keccak 120 Synthesis kbit/s/gate Synthesis Skein with 100 Extracted Wireload
80 Area [kGate eq] 50 kbit/s/gate 60
40 100 kbit/s/gate
20 200 kbit/s/gate 1000 kbit/s/gate 500 kbit/s/gate
0 Gbit/s 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0 ns/bit 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit]
Integrated Systems Laboratory A Suitable SHA-3 Module Optimized for small area (10 kGE) Runs up to 350 MHz Is able to generate digests for messages at a rate of 24 Mbit/s
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 60/74 Example: Adding a Crypto Block
Evolved EDGE Receiver System Digital front end of a receiver system Accepts data from the RF module Decodes data at a rate of up to 600 kbit/s Want to add a crypto module that optionally generates a digest of the received messages using a hash algorithm
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 60/74 Example: Adding a Crypto Block
Evolved EDGE Receiver System Digital front end of a receiver system Accepts data from the RF module Decodes data at a rate of up to 600 kbit/s Want to add a crypto module that optionally generates a digest of the received messages using a hash algorithm
A Suitable SHA-3 Module Optimized for small area (10 kGE) Runs up to 350 MHz Is able to generate digests for messages at a rate of 24 Mbit/s
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 61/74 Example: Adding a Crypto Block
DFE RF Decision Burst Detector Decoder Interface Feedback Memory Equalizer
Digital Front End for Evolved Edge, Receiver Chain
Evolved EDGE receiver system Simplified block diagram of the receiver system
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 61/74 Example: Adding a Crypto Block
DFE Crypto RF Decision Burst Detector Decoder Module Interface Feedback Memory SHA-3 Equalizer
Digital Front End for Evolved Edge, Receiver Chain
Evolved EDGE receiver system Our goal: to add a SHA-3 module to the end
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 61/74 Example: Adding a Crypto Block
80k 50k 25k 240k 300k 10k
Physical Properties
Evolved EDGE receiver system Relative sizes of individual blocks, numbers in gate equivalents
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 61/74 Example: Adding a Crypto Block
80k 50k 25k 240k 300k 10k
Physical Properties
26 MHz 52 MHz 104 MHz 350 MHz
Evolved EDGE receiver system Clock rates at which each block is designed to operate
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 61/74 Example: Adding a Crypto Block
SHA
Decoder DFE
RF Interface Detector Mem
Evolved EDGE receiver system Possible placement of the modules on an actual die
Integrated Systems Laboratory Performance uncritical SHA-3 block very small part of overall system (<2%)˙ Gains in area/speed/power will not improve overall performance much
Should not cause too much extra work Goal: reduce the engineering overhead to add it to system Needs to adapt to the rest of the system as much as possible Use the same voltages and clock frequencies as the rest The interface, dft solutions more of a problem than anything else
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 62/74 Academic Performance not Always Relevant
In the previous Evolved EDGE example: SHA-3 block was way too fast Already as small as possible, no speed/area trade-off Could reduce voltage to save power, too much overhead
Integrated Systems Laboratory Should not cause too much extra work Goal: reduce the engineering overhead to add it to system Needs to adapt to the rest of the system as much as possible Use the same voltages and clock frequencies as the rest The interface, dft solutions more of a problem than anything else
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 62/74 Academic Performance not Always Relevant
In the previous Evolved EDGE example: SHA-3 block was way too fast Already as small as possible, no speed/area trade-off Could reduce voltage to save power, too much overhead
Performance uncritical SHA-3 block very small part of overall system (<2%)˙ Gains in area/speed/power will not improve overall performance much
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 62/74 Academic Performance not Always Relevant
In the previous Evolved EDGE example: SHA-3 block was way too fast Already as small as possible, no speed/area trade-off Could reduce voltage to save power, too much overhead
Performance uncritical SHA-3 block very small part of overall system (<2%)˙ Gains in area/speed/power will not improve overall performance much
Should not cause too much extra work Goal: reduce the engineering overhead to add it to system Needs to adapt to the rest of the system as much as possible Use the same voltages and clock frequencies as the rest The interface, dft solutions more of a problem than anything else
Integrated Systems Laboratory Power [W] Could be given as peak, minimum or mean values Important to determine: power supply, rail thicknesses number of power pins and distribution Power density [W/cm2] limiting factor in large chips Does not tell us how much is done in unit time, favors serialization Low-power does not necessarily mean your battery will last longer
Energy [J] Does not tell us how long the operation will take Usually favors parallelism Low-energy means the battery will last longer
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 63/74 Energy and/or Power
How green is my circuit?
Integrated Systems Laboratory Energy [J] Does not tell us how long the operation will take Usually favors parallelism Low-energy means the battery will last longer
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 63/74 Energy and/or Power
How green is my circuit?
Power [W] Could be given as peak, minimum or mean values Important to determine: power supply, rail thicknesses number of power pins and distribution Power density [W/cm2] limiting factor in large chips Does not tell us how much is done in unit time, favors serialization Low-power does not necessarily mean your battery will last longer
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 63/74 Energy and/or Power
How green is my circuit?
Power [W] Could be given as peak, minimum or mean values Important to determine: power supply, rail thicknesses number of power pins and distribution Power density [W/cm2] limiting factor in large chips Does not tell us how much is done in unit time, favors serialization Low-power does not necessarily mean your battery will last longer
Energy [J] Does not tell us how long the operation will take Usually favors parallelism Low-energy means the battery will last longer
Integrated Systems Laboratory Ensure safe operation of high performance systems Moore’s Law gives us more transistors per unit area Power density increases: 100 W/cm2 or even more Most of this power is dissipated as heat If not removed in time, chip will melt Solutions: More efficient cooling, reducing power consumption
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 64/74 Two Separate Energy/Power Related Problems
Increase efficiency of constrained devices Circuit power supplied by battery, radiated (RFID), or harvested Energy determines operation period for battery Harvesting, radiation efficiency determines peak power Usual goal: more functionality with same energy/power Not so much: increasing battery life, or reducing power
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 64/74 Two Separate Energy/Power Related Problems
Increase efficiency of constrained devices Circuit power supplied by battery, radiated (RFID), or harvested Energy determines operation period for battery Harvesting, radiation efficiency determines peak power Usual goal: more functionality with same energy/power Not so much: increasing battery life, or reducing power
Ensure safe operation of high performance systems Moore’s Law gives us more transistors per unit area Power density increases: 100 W/cm2 or even more Most of this power is dissipated as heat If not removed in time, chip will melt Solutions: More efficient cooling, reducing power consumption
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 65/74 Power Consumption Determines Power Routing
Maximum current density 1-2 mA per 1 µm of wire Skin effect High frequency current flows on the outside. Extreme wide wires are not helpful Resistance of wires Voltage drop on the power rails, not all cells have the same VDD, work slower Via current density Vias connect different metal layers, fixed geometry, narrower than the metal
Integrated Systems Laboratory Static Mainly caused by leakage in transistors (area dependent) Increases as transistors get faster (smaller technology, lower VT) Consumed even if circuit is not doing anything Many technology options to counter, body biasing, high VT transistors
Modern low power/energy designs try to Balance dynamic and static consumption at around 50 %
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 66/74 Two Components of Power/Energy Consumption
Dynamic Directly caused by activity in circuit (application dependent) Depends also on total capacitance switched by the activity (area) The rate of the changing activity (frequency) And the square of the supply voltage Reducing voltage is most promising, however not much available range
Integrated Systems Laboratory Modern low power/energy designs try to Balance dynamic and static consumption at around 50 %
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 66/74 Two Components of Power/Energy Consumption
Dynamic Directly caused by activity in circuit (application dependent) Depends also on total capacitance switched by the activity (area) The rate of the changing activity (frequency) And the square of the supply voltage Reducing voltage is most promising, however not much available range
Static Mainly caused by leakage in transistors (area dependent) Increases as transistors get faster (smaller technology, lower VT) Consumed even if circuit is not doing anything Many technology options to counter, body biasing, high VT transistors
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 66/74 Two Components of Power/Energy Consumption
Dynamic Directly caused by activity in circuit (application dependent) Depends also on total capacitance switched by the activity (area) The rate of the changing activity (frequency) And the square of the supply voltage Reducing voltage is most promising, however not much available range
Static Mainly caused by leakage in transistors (area dependent) Increases as transistors get faster (smaller technology, lower VT) Consumed even if circuit is not doing anything Many technology options to counter, body biasing, high VT transistors
Modern low power/energy designs try to Balance dynamic and static consumption at around 50 %
Integrated Systems Laboratory For C you need the placed and routed design This is problematic, only available at the end of design flow. Circuit needs to be simulated with back-annotated netlist, time consuming α activity depends on the vectors Power directly depends on the activity. Finding correct vectors not trivial
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 67/74 How to Determine Power
2 Ptotal = Pstatic + Pdynamic ∼ α · C · f · VDD
Pstatic is relatively simple As soon as you have the netlist, you can estimate it 2 f and VDD are known Both are design parameters, that would be known after synthesis
Integrated Systems Laboratory α activity depends on the vectors Power directly depends on the activity. Finding correct vectors not trivial
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 67/74 How to Determine Power
2 Ptotal = Pstatic + Pdynamic ∼ α · C · f · VDD
Pstatic is relatively simple As soon as you have the netlist, you can estimate it 2 f and VDD are known Both are design parameters, that would be known after synthesis For C you need the placed and routed design This is problematic, only available at the end of design flow. Circuit needs to be simulated with back-annotated netlist, time consuming
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 67/74 How to Determine Power
2 Ptotal = Pstatic + Pdynamic ∼ α · C · f · VDD
Pstatic is relatively simple As soon as you have the netlist, you can estimate it 2 f and VDD are known Both are design parameters, that would be known after synthesis For C you need the placed and routed design This is problematic, only available at the end of design flow. Circuit needs to be simulated with back-annotated netlist, time consuming α activity depends on the vectors Power directly depends on the activity. Finding correct vectors not trivial
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 68/74 It is Easy to Make a Mistake Estimating Power Make sure that you: Specify voltage and clock frequency Any power figure without these is useless Explain how circuit activity was obtained Assumed a global activity value Derived from input activity Actual functional simulation vectors Disclose how the power was measured Estimated during synthesis Back-annotated delays with post-layout netlist Actual measurement on manufactured chip For power estimations (not actual measurements) State operating conditions as well: best, typical, worst case Actual measurements not easy Difficult to measure current for fast switching circuits
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 69/74 What About Compound Performance Measures?
Quite common technique to compare different architectures Throughput per area Used commonly in cryptographic hardware papers, shows trade-off between area and speed.
Energy delay product Popular with computer architectures, shows the trade-off between energy and speed. Few variants of En · Dm, with different n and m exist.
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 69/74 What About Compound Performance Measures?
Quite common technique to compare different architectures Throughput per area Used commonly in cryptographic hardware papers, shows trade-off between area and speed.
Energy delay product Popular with computer architectures, shows the trade-off between energy and speed. Few variants of En · Dm, with different n and m exist.
Meaningless, unless the circuit has been designed for this measure
Integrated Systems Laboratory Can not get a 1 Gbit/s Keccak using only 2.5 kGEs
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 70/74 Example: Throughput Per Area in 65 nm
Keccak SHA-2 TpA: 395 kbit/s·GE TpA: 215 kbit/s·GE Keccak is almost twice as fast SHA-2 is slower/larger version Area: ? Area: ? Is it the same area... Is it twice as large as Keccak... Speed: ? Speed: ? .. but double the speed ? ... at the same speed ?
Compound Performance Metrics can be Misleading TpA shows neither the speed nor the area of each design It implies that Throughput can be traded-off against Area without limitation. This is not true.
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 70/74 Example: Throughput Per Area in 65 nm
Keccak SHA-2 TpA: 395 kbit/s·GE TpA: 215 kbit/s·GE Keccak is almost twice as fast SHA-2 is slower/larger version Area: 80 kGE Area: 25 kGE More than 3 times the area In reality SHA-2 is small Speed: 32 Gbit/s Speed: 5.4 Gbit/s and 6 times the speed but much slower Compound Performance Metrics can be Misleading TpA shows neither the speed nor the area of each design It implies that Throughput can be traded-off against Area without limitation. This is not true. Can not get a 1 Gbit/s Keccak using only 2.5 kGEs
Integrated Systems Laboratory Became something more Added block RAMs, programmable I/Os, PLLs, DSP slices Newest generation is even more powerful I.e. the Xilinx Zynq contains a dual core ARM A9.
Resource allocation problem FPGAs have a collection of resources If you can make use of them good, if not you don’t save anything In some cases, you can use a smaller (cheaper FPGA), cost benefit
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 71/74 Comparing FPGAs and ASICs
Started as programmable logic Alternative for glue logic, replacement for 74xx families
Integrated Systems Laboratory Newest generation is even more powerful I.e. the Xilinx Zynq contains a dual core ARM A9.
Resource allocation problem FPGAs have a collection of resources If you can make use of them good, if not you don’t save anything In some cases, you can use a smaller (cheaper FPGA), cost benefit
Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 71/74 Comparing FPGAs and ASICs
Started as programmable logic Alternative for glue logic, replacement for 74xx families Became something more Added block RAMs, programmable I/Os, PLLs, DSP slices
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 71/74 Comparing FPGAs and ASICs
Started as programmable logic Alternative for glue logic, replacement for 74xx families Became something more Added block RAMs, programmable I/Os, PLLs, DSP slices Newest generation is even more powerful I.e. the Xilinx Zynq contains a dual core ARM A9.
Resource allocation problem FPGAs have a collection of resources If you can make use of them good, if not you don’t save anything In some cases, you can use a smaller (cheaper FPGA), cost benefit
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 72/74 FPGAs have a Fast Design Flow
Typically a single tool for the entire design flow
Good post-layout results All resources on FPGA characterized, good speed/power estimations Fast implementation The complete design can be measured/tested immediately
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 73/74 Estimating Area Usage on FPGAs Problematic
FPGAs contain a set of resources Not all resources used by the same amount Design uses x % of LUTs, and y % of BRAMs and z % of DSPs
FPGA families differ in their resources The type and particular mix of resources in an FPGA varies
In practice most resources can not be used 100 % In most cases you will not be able to put 5 designs that use 20 % of the LUTs in to the same FPGA
Resources are there no matter what Pipelining FFs are in every LUT (ALM). If you do not use them they will be by-passed. A circuit with 20 pipeline stages is not larger than one with only one stage.
Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 74/74 Last Words
What Did We Discuss? Cost structure of IC design
Area and Speed of IC circuits What are the common pitfalls
Academic vs Industrial use of EDA tools Different goals in both approaches, but EDA tools designed for industrial goals
Energy/Power Why is it the more difficult parameter to determine
If you review a paper Now you know which ones you can reject.
Integrated Systems Laboratory