<<

Understanding performance numbers in Oprecomp summer school 2019, Perugia Italy 5 September 2019

Frank K. G¨urkaynak [email protected]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 2/74 Who Am I?

Born in Istanbul, Turkey Studied and worked at: Istanbul Technical University, Istanbul, Turkey EPFL, Lausanne, Switzerland Worcester Polytechnic Institute, Worcester MA, USA

Since 2008: Integrated Systems Laboratory, ETH Zurich Director, Design Center Senior Scientist, group of Prof. Luca Benini

Interests: Digital Integrated Circuits Cryptographic Hardware Design Design Flows for Digital Design Design Open Source Hardware

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 3/74 What Will We Discuss Today?

Introduction Cost Structure of Integrated Circuits (ICs)

Measuring performance of ICs Why is it difficult? EDA should give us a number

Area How do people report area? Is that fair?

Speed How fast does my circuit actually work?

Power These days much more important, but also much harder to get right

Integrated Systems Laboratory The performance establishes the solution Finally the cost sets a limit to what is possible

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 4/74 System Design Requirements

System Requirements

Functionality

Functionality determines what the system will do

Integrated Systems Laboratory Finally the cost sets a limit to what is possible

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 4/74 System Design Requirements

System Requirements

Functionality Performance

Functionality determines what the system will do The performance establishes the solution space

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 4/74 System Design Requirements

System Requirements

Cost Functionality Performance

Functionality determines what the system will do The performance establishes the solution space Finally the cost sets a limit to what is possible

Integrated Systems Laboratory  Relatively easy to program, includes , cheap  Poor performance General Purpose Processors  Simple to program, no need to know hardware design  Sequential processing, needs many peripherals, high cost Parallel/Graphics Processors  Allow parallel processing  High cost, no peripherals, not so easy to program

Programmable Logic, FPGAs  Very flexible, simpler design flow than ASICs, many IPs  High unit cost, not as fast as ASICs Processors

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 5/74 The Focus of this Talk is on Digital Systems Custom Integrated Circuits, ASICs  Best performance, cheap for mass market  Difficult to design, long time

Integrated Systems Laboratory Microcontrollers  Relatively easy to program, includes peripherals, cheap  Poor performance General Purpose Processors  Simple to program, no need to know hardware design  Sequential processing, needs many peripherals, high cost Parallel/Graphics Processors  Allow parallel processing  High cost, no peripherals, not so easy to program

Processors

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 5/74 The Focus of this Talk is on Digital Systems Custom Integrated Circuits, ASICs  Best performance, cheap for mass market  Difficult to design, long manufacturing time Programmable Logic, FPGAs  Very flexible, simpler design flow than ASICs, many IPs  High unit cost, not as fast as ASICs

Integrated Systems Laboratory General Purpose Processors  Simple to program, no need to know hardware design  Sequential processing, needs many peripherals, high cost Parallel/Graphics Processors  Allow parallel processing  High cost, no peripherals, not so easy to program

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 5/74 The Focus of this Talk is on Digital Systems Custom Integrated Circuits, ASICs  Best performance, cheap for mass market  Difficult to design, long manufacturing time Programmable Logic, FPGAs  Very flexible, simpler design flow than ASICs, many IPs  High unit cost, not as fast as ASICs Processors Microcontrollers  Relatively easy to program, includes peripherals, cheap  Poor performance

Integrated Systems Laboratory Parallel/Graphics Processors  Allow parallel processing  High cost, no peripherals, not so easy to program

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 5/74 The Focus of this Talk is on Digital Systems Custom Integrated Circuits, ASICs  Best performance, cheap for mass market  Difficult to design, long manufacturing time Programmable Logic, FPGAs  Very flexible, simpler design flow than ASICs, many IPs  High unit cost, not as fast as ASICs Processors Microcontrollers  Relatively easy to program, includes peripherals, cheap  Poor performance General Purpose Processors  Simple to program, no need to know hardware design  Sequential processing, needs many peripherals, high cost

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 5/74 The Focus of this Talk is on Digital Systems Custom Integrated Circuits, ASICs  Best performance, cheap for mass market  Difficult to design, long manufacturing time Programmable Logic, FPGAs  Very flexible, simpler design flow than ASICs, many IPs  High unit cost, not as fast as ASICs Processors Microcontrollers  Relatively easy to program, includes peripherals, cheap  Poor performance General Purpose Processors  Simple to program, no need to know hardware design  Sequential processing, needs many peripherals, high cost Parallel/Graphics Processors  Allow parallel processing  High cost, no peripherals, not so easy to program

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 6/74 Digital Solutions are Cost Efficient

For example:

Main motivation to move from VHS to DVD was cost!

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 7/74

1 Introduction

2 Cost Cost Structure Practical Limits The Need for Test Moore’s Law

3 Design Flow

4 Area

5 Speed

6 Area/Speed Trade-offs

7 Power Integrated Systems Laboratory

8 Conclusions Recurring costs (per chip) cost Cost of + processing Licensing cost Some of the IP require a licensing cost per chip sold Packaging The package that holds the chip + costs of bonding Testing Each manufactured chip needs to be tested. Costs both money and time

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 8/74 How Much Does a Chip Cost?

Non-recurring costs are not cheap (app. 100 k$/yr) IP costs Some of the IP are bought EDA tools + Infrastructure EDA tools are expensive, i.e. 100 k$/yr/seat Masks for production Chips may need 20-50 masks. dependent, may cost 50 k$-5,000 k$

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 8/74 How Much Does a Chip Cost?

Non-recurring costs Recurring costs (per chip) Engineering Silicon cost Designers are not cheap Cost of wafer + processing (app. 100 k$/yr) Licensing cost IP costs Some of the IP require a Some of the IP are bought licensing cost per chip sold EDA tools + Infrastructure Packaging EDA tools are expensive, i.e. The plastic package that holds 100 k$/yr/seat the chip + costs of bonding Masks for production Testing Chips may need 20-50 masks. Each manufactured chip needs Technology dependent, may to be tested. Costs both cost 50 k$-5,000 k$ money and time

Integrated Systems Laboratory Packaging + Licensing + Testing Design dependent from a few cents, to several of dollars (1.00 $)

Non-recurring costs 200 MM of engineering, around 2 M$ Production mask set around 1 M$ IPs + EDA tools + administration + infrastructure another 1 M$ Volume dependent. i.e for 5,000,000 chips sold, around 0.80 $ per chip

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 9/74 A Simplified Example

A 4 mm × 4 mm chip in 90 nm technology, 300 mm wafers Silicon Cost About 4,000 dies per wafer At 85 % yield, around 3400 working dies About 2,500 $ per wafer (including processing) around 0.75 $ per working

Adapted from H. Kaeslin ”Digital Integrated ”, Cambridge University Press, 2008

Integrated Systems Laboratory Non-recurring costs 200 MM of engineering, around 2 M$ Production mask set around 1 M$ IPs + EDA tools + administration + infrastructure another 1 M$ Volume dependent. i.e for 5,000,000 chips sold, around 0.80 $ per chip

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 9/74 A Simplified Example

A 4 mm × 4 mm chip in 90 nm technology, 300 mm wafers Silicon Cost About 4,000 dies per wafer At 85 % yield, around 3400 working dies About 2,500 $ per wafer (including processing) around 0.75 $ per working die

Packaging + Licensing + Testing Design dependent from a few cents, to several of dollars (1.00 $)

Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 9/74 A Simplified Example

A 4 mm × 4 mm chip in 90 nm technology, 300 mm wafers Silicon Cost About 4,000 dies per wafer At 85 % yield, around 3400 working dies About 2,500 $ per wafer (including processing) around 0.75 $ per working die

Packaging + Licensing + Testing Design dependent from a few cents, to several of dollars (1.00 $)

Non-recurring costs 200 MM of engineering, around 2 M$ Production mask set around 1 M$ IPs + EDA tools + administration + infrastructure another 1 M$ Volume dependent. i.e for 5,000,000 chips sold, around 0.80 $ per chip

Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008

Integrated Systems Laboratory If we sell 5’000 chips 4 M$ non-recurring costs, 800 $ per chip Total chip costs 801.75 $ (99.78% non-recurring)

If we sell 500’000’000 chips 4 M$ non-recurring costs, 0.008 $ per chip Total chip costs 1.76 $ (0.5% non-recurring)

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 10/74 The key to IC Design success is volume

Previous example, cost of one chip If we sell 5,000,000 chips 4 M$ non-recurring costs, 0.80 $ per chip Total chip costs 2.55 $ (31% non-recurring)

Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008

Integrated Systems Laboratory If we sell 500’000’000 chips 4 M$ non-recurring costs, 0.008 $ per chip Total chip costs 1.76 $ (0.5% non-recurring)

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 10/74 The key to IC Design success is volume

Previous example, cost of one chip If we sell 5,000,000 chips 4 M$ non-recurring costs, 0.80 $ per chip Total chip costs 2.55 $ (31% non-recurring)

If we sell 5’000 chips 4 M$ non-recurring costs, 800 $ per chip Total chip costs 801.75 $ (99.78% non-recurring)

Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 10/74 The key to IC Design success is volume

Previous example, cost of one chip If we sell 5,000,000 chips 4 M$ non-recurring costs, 0.80 $ per chip Total chip costs 2.55 $ (31% non-recurring)

If we sell 5’000 chips 4 M$ non-recurring costs, 800 $ per chip Total chip costs 801.75 $ (99.78% non-recurring)

If we sell 500’000’000 chips 4 M$ non-recurring costs, 0.008 $ per chip Total chip costs 1.76 $ (0.5% non-recurring)

Adapted from H. Kaeslin ”Digital Integrated Circuit Design”, Cambridge University Press, 2008

Integrated Systems Laboratory Recurring costs (per chip) Silicon cost Reduce die area Licensing cost Do not buy IP (increases engineering cost) Packaging Simpler packages, reduce number of I/Os, thermal requirements Testing Reduce testing time, improve DFT.

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 11/74 How Can We Make Chips Cheaper

Non-recurring costs Sell more chips! Engineering Buy IP (increases IP costs) IP costs Redesign it yourself (increases engineering costs) EDA tools + Infrastructure Outsource to design houses Masks for production Use fewer masks. Use older (cheaper)

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 11/74 How Can We Make Chips Cheaper

Non-recurring costs Recurring costs (per chip) Sell more chips! Silicon cost Engineering Reduce die area Buy IP (increases IP costs) Licensing cost IP costs Do not buy IP (increases Redesign it yourself (increases engineering cost) engineering costs) Packaging EDA tools + Infrastructure Simpler packages, reduce Outsource to design houses number of I/Os, thermal Masks for production requirements Use fewer masks. Use older Testing (cheaper) technologies Reduce testing time, improve DFT.

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 12/74 How are Individual Dies Cut From a Wafer?

Dicing is a mechanical There needs to be some space (typically 50-200 µm) between dies on a wafer so that we can reliably cut them.

Images taken from http://wonderfulworldofwafers.wordpress.com/ Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 12/74 How are Individual Dies Cut From a Wafer?

The sawline sets a lower limit for the practical die size If the sawline is 100 µm, and the die is 1 mm×1 mm in size: more than 20 % of the wafer will be wasted for sawlines

Images taken from http://wonderfulworldofwafers.wordpress.com/ Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield

Wafer

Single wafer during production

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield

Manufacturing Defects

Inevitably there will be some defects on the wafer

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield

Wasted Die Wasted Area Area

Die Die Die

Wasted Wasted Area Area Die

If a large die is used, only few dies can be manufactured per each wafer

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield

Defect Die

Defect Good Defect Die Die Die

Defect Die

In this example 4 out of 5 dies are defect. Yield = 1/5 = 20 %

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield

Smaller dies improve yield: Yield = 7/12 = 58 %

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 13/74 The Story of Yield

The smaller the dies, the better the yield: Yield = 5/60 = 92 %

Integrated Systems Laboratory Testing can be expensive and time consuming Typical tester costs more than 1 M$ Test costs expressed usually per pin and per time Make sure you spend as little time as possible on tester Test at speed within the chip (Built-in Self-Test) Most testing is outsourced to dedicated test houses

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 14/74 You Need To Test Every Chip You Manufacture

A certain amount of chips will always be defective Keeps production costs down. 100 % yield economically not feasible Yield typically improves over time for a specific process Cutting edge technologies may have as low as 10 % yield Established technologies have 90 % or more yield The smaller the dies, the higher the yield

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 14/74 You Need To Test Every Chip You Manufacture

A certain amount of chips will always be defective Keeps production costs down. 100 % yield economically not feasible Yield typically improves over time for a specific process Cutting edge technologies may have as low as 10 % yield Established technologies have 90 % or more yield The smaller the dies, the higher the yield

Testing can be expensive and time consuming Typical tester costs more than 1 M$ Test costs expressed usually per pin and per time Make sure you spend as little time as possible on tester Test at speed within the chip (Built-in Self-Test) Most testing is outsourced to dedicated test houses

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 15/74 Moore’s Law: 2x Every 18 Months

Intel Processors

10G

1G

100M

Moore’s Law 10M

1M

Number of Transistors 100k

10k

1k 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Year http://www.intel.com/content/www/us/en/history/history-intel-chips-timeline-poster.html

Integrated Systems Laboratory How long will it hold up? Has been predicted to end multiple times (even by Moore himself) Self fullfilling prophecy, sets goal for industry Process options become more complex and more expensive Physical limits (single atom) approaching fast

Chips are not necessarily getting faster Side effect of scaling: smaller transistors faster Power density issues: voltage is reduced Reduced voltage: reduces operating speed Trade-off between speed/power

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 16/74 What Moore’s Law Doesn’t Say

Chips are not necessarily getting smaller We can put more transistors per unit area But very small (and very large chips) are not feasible We need to find use for more transistors

Integrated Systems Laboratory Chips are not necessarily getting faster Side effect of scaling: smaller transistors switch faster Power density issues: voltage is reduced Reduced voltage: reduces operating speed Trade-off between speed/power

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 16/74 What Moore’s Law Doesn’t Say

Chips are not necessarily getting smaller We can put more transistors per unit area But very small (and very large chips) are not feasible We need to find use for more transistors

How long will it hold up? Has been predicted to end multiple times (even by Moore himself) Self fullfilling prophecy, sets goal for industry Process options become more complex and more expensive Physical limits (single atom) approaching fast

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 16/74 What Moore’s Law Doesn’t Say

Chips are not necessarily getting smaller We can put more transistors per unit area But very small (and very large chips) are not feasible We need to find use for more transistors

How long will it hold up? Has been predicted to end multiple times (even by Moore himself) Self fullfilling prophecy, sets goal for industry Process options become more complex and more expensive Physical limits (single atom) approaching fast

Chips are not necessarily getting faster Side effect of scaling: smaller transistors switch faster Power density issues: voltage is reduced Reduced voltage: reduces operating speed Trade-off between speed/power

Integrated Systems Laboratory Not all chips will profit equally from Moore’s Law If your chip is very small, making it even smaller will not reduce costs

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 17/74 Die Area from a Business Perspective

Type Approximate Size Notes very small < 1 mm2 Sawlines are a problem small 1 mm2 - 10 mm2 Low die cost large 10 mm2 - 100 mm2 More design effort required very large > 100 mm2 Yield an issue

If dies are very small, silicon area is wasted on sawlines If dies are very large, yield decreases, die cost increases Sweetspot for die area between 1 mm2 - 100 mm2

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 17/74 Die Area from a Business Perspective

Type Approximate Size Notes very small < 1 mm2 Sawlines are a problem small 1 mm2 - 10 mm2 Low die cost large 10 mm2 - 100 mm2 More design effort required very large > 100 mm2 Yield an issue

If dies are very small, silicon area is wasted on sawlines If dies are very large, yield decreases, die cost increases Sweetspot for die area between 1 mm2 - 100 mm2

Not all chips will profit equally from Moore’s Law If your chip is very small, making it even smaller will not reduce costs

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 18/74

1 Introduction

2 Cost

3 Design Flow Academic vs Industrial Goals

4 Area

5 Speed

6 Area/Speed Trade-offs

7 Power

8 Conclusions Integrated Systems Laboratory .. or it is actually not so easy.

Why doesn’t this work yet? Perhaps there is a conspiracy? Maybe we are lazy! It could be that we want to keep our jobs.

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 19/74 The Ultimate Goal of System Designers/Managers

Powerpoint to Chip (GDSII)

Integrated Systems Laboratory .. or it is actually not so easy.

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 19/74 The Ultimate Goal of System Designers/Managers

Powerpoint to Chip (GDSII)

Why doesn’t this work yet? Perhaps there is a conspiracy? Maybe we are lazy! It could be that we want to keep our jobs.

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 19/74 The Ultimate Goal of System Designers/Managers

Powerpoint to Chip (GDSII)

Why doesn’t this work yet? Perhaps there is a conspiracy? Maybe we are lazy! It could be that we want to keep our jobs. .. or it is actually not so easy.

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 20/74 From Idea to ASIC: The Design Flow

Front-end Design Flow Design: Coming up with an Design Entry: Description in HDL Verification: Making sure it does what it is supposed to Synthesis: Mapping to technology Scan Insertion: Adding structures for testing

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 21/74 From Idea to ASIC: The Design Flow

Back-end Design Flow Placement: Putting all logic gates / macros on chip Tree Insertion: Distributing on the chip Routing: Establishing connections on the chip Timing Verification: Making sure the setup/hold times are met Chip Finishing: Pads, logos, seal ring DRC and LVS: Making sure that the physical layout is correct

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 22/74 Confidence in the Results Increase with Flow

Specifications

Architecture Architecture (GMU) (ETH Zurich)

RTL Description (VHDL)

Synthesis Constraints ( DC)

Place and Route (Cadence EDI)

Synthesis Wireload Model (Synopsys DC)

Place and Route (Cadence EDI)

ASIC (UMC65nm) High Low Accuracy of Results

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 23/74 Design Specifications

A List of Requirements Function What exactly is expected from the chip? Performance How fast, what area, how much power? I/O requirements How will the ASIC fit together with the rest of the system?

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 24/74 Synthesize HDL into Gates

Acquiring Physical Properties Technology Decided by choosing the target Area Gates occupy area Speed Propagation through gates requires time Power Gates consume power during switching and when idle

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 25/74 Floorplanning I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O VDD VDD GND GND GND VDD GND How will the chip look I/O I/O Power Routing 64x32b I/O RAM I/O I/O I/O Determine area/geometry 600µ I/O 770µ Reference Design I/O I/O and Interface I/O of the chip 960µ I/O I/O I/O I/O Place macro and I/O cells g2s I/O 64x32b I/O GND Student Design: Local RAM VDD Clock Gen Design the power VDD Baby and VDD 1,800µ Pampers VDD 700µ Goliath GND connections I/O I/O I/O g2d g2d I/O Leave room for routing I/O Routing Channel I/O I/O I/O optimizations I/O d2g d2g I/O Local I/O Local I/O Clock David David Clock 470µ Iterative process, can not I/O Gen. I/O for Top- Design Gen. 450µ I/O I/O determine the perfect GND VDD floorplan from beginning. I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O VDD VDD GND GND GND

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 26/74 Placement is a NP-Hard Problem

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 27/74 Routing Determines the Parasitic Load

Integrated Systems Laboratory Cost Total silicon cost Amount of time to process a given data item Energy Total energy to finish task Man Months to Design How much engineers cost

These are related, but they are not the same Important to know, which one we are really interested in!

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 28/74 What Are Performance Parameters?

Area Total area occupied by circuit Throughput Number of data items processed per unit time Power Peak power for operation Time to Market How long the design will take

Integrated Systems Laboratory These are related, but they are not the same Important to know, which one we are really interested in!

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 28/74 What Are Performance Parameters?

Area Cost Total area occupied by circuit Total silicon cost Throughput Latency Number of data items Amount of time to process a processed per unit time given data item Power Energy Peak power for operation Total energy to finish task Time to Market Man Months to Design How long the design will take How much engineers cost

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 28/74 What Are Performance Parameters?

Area Cost Total area occupied by circuit Total silicon cost Throughput Latency Number of data items Amount of time to process a processed per unit time given data item Power Energy Peak power for operation Total energy to finish task Time to Market Man Months to Design How long the design will take How much engineers cost

These are related, but they are not the same Important to know, which one we are really interested in!

Integrated Systems Laboratory Use a more advanced technology Easy way Not always feasible Costs will increase Additional problems may arise (adapting, lack of IPs)

Make a better implementation Harder, needs smart (expensive) engineers Not always possible, there is a limit to what can be done Academia should concentrate on this, show what is the limit

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 29/74 How to Improve Performance

Integrated Systems Laboratory Make a better implementation Harder, needs smart (expensive) engineers Not always possible, there is a limit to what can be done Academia should concentrate on this, show what is the limit

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 29/74 How to Improve Performance

Use a more advanced technology Easy way Not always feasible Costs will increase Additional problems may arise (adapting, lack of IPs)

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 29/74 How to Improve Performance

Use a more advanced technology Easy way Not always feasible Costs will increase Additional problems may arise (adapting, lack of IPs)

Make a better implementation Harder, needs smart (expensive) engineers Not always possible, there is a limit to what can be done Academia should concentrate on this, show what is the limit

Integrated Systems Laboratory ..but in Industry The chips need to work within specifications. Goal is to sell a product Cost and time to design important Manufacturability is paramount Mostly conservative design practices are used

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 30/74 Academic vs Industrial IC Design

Both of them have similar design flows use the same tools define performance similarly

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 30/74 Academic vs Industrial IC Design

Both of them have similar design flows use the same tools define performance similarly

..but in Industry The chips need to work within specifications. Goal is to sell a product Cost and time to design important Manufacturability is paramount Mostly conservative design practices are used

Integrated Systems Laboratory In academia: Interested only in the limits (fastest, smallest etc.) Performance numbers needed to show how we compare against others It is not easy to get fair numbers from these tools Constraints are misused to explore design space

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 31/74 EDA tools are Designed for Industry Requirements

For the industry: Circuit has to function to specification even in worst conditions Constraints are used for to define specifications EDA tools make sure circuit performance meets specifications Not designed to get peak performance

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 31/74 EDA tools are Designed for Industry Requirements

For the industry: Circuit has to function to specification even in worst conditions Constraints are used for to define specifications EDA tools make sure circuit performance meets specifications Not designed to get peak performance

In academia: Interested only in the limits (fastest, smallest etc.) Performance numbers needed to show how we compare against others It is not easy to get fair numbers from these tools Constraints are misused to explore design space

Integrated Systems Laboratory All of these problems have solutions...... but they come at a cost!

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 32/74 Real-World Problems will Effect Performance

Parasitic RLC effects on timing Clock distribution (skew, jitter) Cross talk, I/O speed limitations IR drop (more area for power routing) Maximum limits (more area for power routing) PVT variations (die-to-die, inter-die), reliability Thermal issues, temperature variation during operation

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 32/74 Real-World Problems will Effect Performance

Parasitic RLC effects on timing Clock distribution (skew, jitter) Cross talk, signal integrity I/O speed limitations IR drop (more area for power routing) Maximum current density limits (more area for power routing) PVT variations (die-to-die, inter-die), reliability Thermal issues, temperature variation during operation

All of these problems have solutions...... but they come at a cost!

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 33/74 Academic Reporting Neglects Some Problems

Overhead of Solutions not Always Included For example: Large I/O (serial I/O, pads, buffers) Very fast (PLLs, costly clock tree) Dynamic Voltage/Frequency Scaling (synchronizers, level shifters) Additional Verification Effort Circuits with: Many operational modes Interfaces to different clock domains Can be costly in terms of verification regardless of the circuit size Testing Overhead Some circuits are difficult to test (i.e. asynchronous circuits), or require extensive test modes

Integrated Systems Laboratory Post-Layout Routing overhead known, more accurate timing Not all post-layout include proper provisions for: power, signal integrity, testing

Actual Measurement Real proof of concept Performance affected by practical problems, usually worse than expected Requires test infrastructure, costly

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 34/74 Quality of Results Depend on Design

Different Levels, Different Accuracy Synthesis First level with physical properties Routing overhead unknown, estimated by wireload models Timing and power models are mean values, not that accurate

Integrated Systems Laboratory Actual Measurement Real proof of concept Performance affected by practical problems, usually worse than expected Requires test infrastructure, costly

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 34/74 Quality of Results Depend on Design State

Different Levels, Different Accuracy Synthesis First level with physical properties Routing overhead unknown, estimated by wireload models Timing and power models are mean values, not that accurate

Post-Layout Routing overhead known, more accurate timing Not all post-layout designs include proper provisions for: power, signal integrity, testing

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 34/74 Quality of Results Depend on Design State

Different Levels, Different Accuracy Synthesis First level with physical properties Routing overhead unknown, estimated by wireload models Timing and power models are mean values, not that accurate

Post-Layout Routing overhead known, more accurate timing Not all post-layout designs include proper provisions for: power, signal integrity, testing

Actual Measurement Real proof of concept Performance affected by practical problems, usually worse than expected Requires test infrastructure, costly

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 35/74 Throughput/Area: Post-Layout ¨ Measurement

Normalized Throughput/Area of ALL SHA-3 Candidates 3 Postlayout Results GMU Typical Case ETHZ 770 2.5 VDD=1.2V Numbers in kbits/GE

2 518

1.5

1 273 208 Throughput/Area normalized to GMU SHA-2

0.5 179 162 146 115 117 77 42 44 0 SHA-2 BLAKE Groestl JH Keccak Skein

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 35/74 Throughput/Area: Post-Layout ¨ Measurement

Normalized Throughput/Area of ALL SHA-3 Candidates 3 Measurement Results GMU Average of 5 ASICs ETHZ 2.5 VDD=1.2V Numbers in kbits/GE 685

2

1.5 394

1 215 Throughput/Area normalized to GMU SHA-2

0.5 174 167 154 139 121 85 86 41 46 0 SHA-2 BLAKE Groestl JH Keccak Skein Algorithms

Integrated Systems Laboratory Why do we care? Silicon Cost Feasibility (will it fit?)

Should be easy to determine Does not change after/during manufacturing Reliable and accurate post-layout numbers

Units mm 2 correct unit, technology dependent GE commonly used, not accurate at all

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 36/74 Let Us Talk About Area

How much silicon area will be used for the circuit?

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 36/74 Let Us Talk About Area

How much silicon area will be used for the circuit?

Why do we care? Silicon Cost Feasibility (will it fit?)

Should be easy to determine Does not change after/during manufacturing Reliable and accurate post-layout numbers

Units mm 2 correct unit, technology dependent GE commonly used, not accurate at all

Integrated Systems Laboratory Library dependent Width, style, height can differ

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 37/74 The Story of Gate Equivalents

Area of 2-input NAND gate Historical reasons Was used to compare circuits in different styles: CMOS, TTL, ECL.. For CMOS 4 transistors is 1 GE

Not easy to standardize Everyone can interpret it differently

Many 4 gates Buffer, 2-input NAND, NOR.. Different drive strengths

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 37/74 The Story of Gate Equivalents

Area of 2-input NAND gate Historical reasons Was used to compare circuits in different styles: CMOS, TTL, ECL.. For CMOS 4 transistors is 1 GE

Not easy to standardize Everyone can interpret it differently

Many 4 transistor gates Buffer, 2-input NAND, NOR.. Different drive strengths

Library dependent Width, style, height can differ

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 38/74 Same Designs, Different Technologies

Technology Library Design A Design B Foundry Node Vendor Tracks [mm2] [kGE] [mm2] [kGE] C 28 nm I 12 0.014 28.1 0.129 263.6 C 45 nm I 9 0.016 23.2 0.168 247.3 C 45 nm I 14 0.019 18.3 0.220 208.3 C 65 nm I 13 0.035 17.0 0.407 195.5 A 65 nm IV 9 0.030 21.0 0.311 216.0 D 65 nm VI 9 0.029 20.1 0.305 211.7 A 90 nm V 10 0.060 19.3 0.672 214.2 A 130 nm V 8 0.104 20.3 1.057 206.5 E 130 nm II 9 0.109 24.1 1.179 259.7 C 130 nm I 12 0.115 19.0 1.299 214.6 B 150 nm III 9 0.214 23.7 2.070 229.0 A 180 nm V 9 0.211 22.5 2.151 229.5 A 180 nm VII 9 0.198 20.5 2.056 212.5

Integrated Systems Laboratory Round your results to not more than 2 significant digits! 8 kGE 250 kGE

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 39/74 GE is a very Coarse Measure for Area

Area in gate equivalents will vary at least ± 10% between different technologies

In other words Do not use overly precise area descriptions: 8,421 GE 235,481 GE

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 39/74 GE is a very Coarse Measure for Area

Area in gate equivalents will vary at least ± 10% between different technologies

In other words Do not use overly precise area descriptions: 8,421 GE 235,481 GE Round your results to not more than 2 significant digits! 8 kGE 250 kGE

Integrated Systems Laboratory The post-layout gate count? The gate count is just the area occupied by the gates You need extra space for routing, signal integrity, and power routing

The total die area of the manufactured chip? You may have some additional structures for testing in the chip Practical limits may require specific die sizes (Europractice mini@sic) What if the chip contains more than one design?

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 40/74 What Exactly Is the Circuit Area?

The number reported by the synthesizer? Does not include routing, power, clock overhead

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 41/74 Design With and Without Power Routing

Small design Same design, larger GE: 69’042, GE: 70’906, area: 0.719 mm2 area: 1.183 mm2

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 41/74 Design With and Without Power Routing

Small design Same design, larger Would NOT work at speed Would work fine at the specified more than 20% IR frequency

Integrated Systems Laboratory The total die area of the manufactured chip? You may have some additional structures for testing in the chip Practical limits may require specific die sizes (Europractice mini@sic) What if the chip contains more than one design?

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 42/74 What Exactly is the Circuit Area?

The number reported by the synthesizer? Does not include routing, power, clock overhead

The post-layout gate count? The gate count is just the area occupied by the gates You need extra space for routing, signal integrity, and power routing

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 43/74 How Cells are Actually Placed on Chip

Close-up of a placed and routed design

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 43/74 How Cells are Actually Placed on Chip

Partially stripped layout to show cells and power routing

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 43/74 How Cells are Actually Placed on Chip

Standard cells, power routing and signal routing separated

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 43/74 How Cells are Actually Placed on Chip

Cell placement showing standard cells, body tie cells, buffers

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 44/74 What Exactly Is the Circuit Area?

The number reported by the synthesizer? Does not include routing, power, clock overhead

The post-layout gate count? The gate count is just the area occupied by the gates You need extra space for routing, signal integrity, and power routing

The total die area of the manufactured chip? You may have some additional structures for testing in the chip Practical limits may require specific die sizes (Europractice mini@sic) What if the chip contains more than one design?

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?

One block within a larger ASIC

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?

Area of smallest rectangle into which the block fits completely?

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?

Area covering the maximum reach of the cells?

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?

Close-up of the same block

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?

Block of interest shares area with a second block (in red)

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 45/74 What is the Real Area of this Block?

Closer look shows cells of two blocks with empty area in between

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 46/74 Area Numbers are Approximate

Determining the area is not so easy Synthesis numbers do not tell the whole story Routing, power overhead is design dependent. Can be only 10 % (LFSR) Can be 500 % (LDPC decoder)

Postlayout numbers are more reliable Can still be misleading There are many factors (IR drop, ) that need to be taken care of The higher the , the more additional problems you get

The total chip area is easy to determine Not often the case that the whole chip is of interest Difficult to determine the exact area of one block within a chip

Integrated Systems Laboratory Throughput? How many data items can we process per unit time Do more per unit time (parallelization) Divide the process into multiple shorter steps (pipelining) Do each processing step faster (increase clock frequency)

Latency? Amount of time required to process one data item Can not be made faster by parallelization, and pipelining Necessary for feedback systems

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 47/74 What About Speed?

How fast is my circuit?

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 47/74 What About Speed?

How fast is my circuit?

Throughput? How many data items can we process per unit time Do more per unit time (parallelization) Divide the process into multiple shorter steps (pipelining) Do each processing step faster (increase clock frequency)

Latency? Amount of time required to process one data item Can not be made faster by parallelization, and pipelining Necessary for feedback systems

Integrated Systems Laboratory Clock [ns] 1/clock frequency Is more physical, relates to the sum of gate delays in the critical path Throughput [bits/s] How many bits are processed per unit time Good for streaming , bad for processors (FP) Operations per second (GFLOPS) for processors Could be given as peak, minimum or mean values Depends on the complexity of the operation (i.e. NOP) Clocks per instruction (CPI) for processors Reflects the parallelization of the instructions. Needs the clock frequency to determine actual performance

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 48/74 Units that tell us the Speed:

Clock frequency [MHz] Determined by the critical path All other things being equal, would compare actually the speed How much is done in one clock cycle can differ from circuit to circuit

Integrated Systems Laboratory Throughput [bits/s] How many bits are processed per unit time Good for streaming architectures, bad for processors (FP) Operations per second (GFLOPS) for processors Could be given as peak, minimum or mean values Depends on the complexity of the operation (i.e. NOP) Clocks per instruction (CPI) for processors Reflects the parallelization of the instructions. Needs the clock frequency to determine actual performance

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 48/74 Units that tell us the Speed:

Clock frequency [MHz] Determined by the critical path All other things being equal, would compare actually the speed How much is done in one clock cycle can differ from circuit to circuit Clock period [ns] 1/clock frequency Is more physical, relates to the sum of gate delays in the critical path

Integrated Systems Laboratory (FP) Operations per second (GFLOPS) for processors Could be given as peak, minimum or mean values Depends on the complexity of the operation (i.e. NOP) Clocks per instruction (CPI) for processors Reflects the parallelization of the instructions. Needs the clock frequency to determine actual performance

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 48/74 Units that tell us the Speed:

Clock frequency [MHz] Determined by the critical path All other things being equal, would compare actually the speed How much is done in one clock cycle can differ from circuit to circuit Clock period [ns] 1/clock frequency Is more physical, relates to the sum of gate delays in the critical path Throughput [bits/s] How many bits are processed per unit time Good for streaming architectures, bad for processors

Integrated Systems Laboratory Clocks per instruction (CPI) for processors Reflects the parallelization of the instructions. Needs the clock frequency to determine actual performance

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 48/74 Units that tell us the Speed:

Clock frequency [MHz] Determined by the critical path All other things being equal, would compare actually the speed How much is done in one clock cycle can differ from circuit to circuit Clock period [ns] 1/clock frequency Is more physical, relates to the sum of gate delays in the critical path Throughput [bits/s] How many bits are processed per unit time Good for streaming architectures, bad for processors (FP) Operations per second (GFLOPS) for processors Could be given as peak, minimum or mean values Depends on the complexity of the operation (i.e. NOP)

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 48/74 Units that tell us the Speed:

Clock frequency [MHz] Determined by the critical path All other things being equal, would compare actually the speed How much is done in one clock cycle can differ from circuit to circuit Clock period [ns] 1/clock frequency Is more physical, relates to the sum of gate delays in the critical path Throughput [bits/s] How many bits are processed per unit time Good for streaming architectures, bad for processors (FP) Operations per second (GFLOPS) for processors Could be given as peak, minimum or mean values Depends on the complexity of the operation (i.e. NOP) Clocks per instruction (CPI) for processors Reflects the parallelization of the instructions. Needs the clock frequency to determine actual performance

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 49/74 Timing Analysis Investigates All Paths

Des/Clust/Port Load Model Library ------sbox enG5K generic_lib

Point Incr Path ------InpxDI[0] (in) 0.0987 0.0987 f i_s_lut_0100/AddrxDI[0] (sub_lut_0100) 0.0000 0.0987 f i_s_lut_0100/U1/O (INV1S) 0.4191 0.5178 r i_s_lut_0100/U53/O (NR2) 0.3683 0.8861 f i_s_lut_0100/U83/O (NR2) 0.4559 1.3420 r i_s_lut_0100/U25/O (INV1S) 0.4845 1.8265 f i_s_lut_0100/U140/O (AOI22S) 0.2411 2.0676 r i_s_lut_0100/U19/O (ND3S) 0.0821 2.1497 f i_s_lut_0100/U225/O (MOAI1S) 0.1802 2.3299 f i_s_lut_0100/U226/O (AOI112S) 0.2397 2.5696 r i_s_lut_0100/U227/O (OAI222S) 0.1254 2.6951 f i_s_lut_0100/DataxDO[7] (sub_lut_0100) 0.0000 2.6951 f U16/O (MUX2) 0.2020 2.8971 f OupxDO[7] (out) 0.0000 2.8971 f data arrival time 2.8971 ------

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 50/74 CMOS Timing Depends on Capacitive Loading

A Z

B

Simple Example: 2-input NAND gate driving an

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 50/74 CMOS Timing Depends on Capacitive Loading

VDD VDD

A Z

B

GND GND

Transistor level schematic: The gates when implemented using CMOS logic

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 50/74 CMOS Timing Depends on Capacitive Loading

VDD VDD

1 ?

0

GND GND

For example: When A=1, B=0, one pMOS charges up the input of the inverter

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 50/74 CMOS Timing Depends on Capacitive Loading

VDD

Cg,pmos=W.L Id~W/L

Cg,nmos=W.L

GND

In ideal case, it is a driving two Faster circuit = larger current source or smaller capacitance

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 50/74 CMOS Timing Depends on Capacitive Loading

VDD

Id~W/L

Cinterconnect Cgate~W.L

GND GND

Parasitic interconnect capacitance is major contributor to delay Not possible to determine the speed if Cinterconnect is not known

Integrated Systems Laboratory Wireload models for synthesis Statistical lookup tables to estimate the load Capacitive load as a function of the output fanout Circuit dependent, for best results needs to be extracted for each circuit

Parasitic extraction Only after placement and routing has been done All interconnections are known, capacitance can be extracted Different accuracy levels: 2D, 2.5D, full 3D Results are used by the timing analysis to calculate delay

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 51/74 How Do I Know What the Parasitic Load Is?

Ignore parasitic loads Naive approach Especially in modern processes (<90 nm) parasitics are dominant

Integrated Systems Laboratory Parasitic extraction Only after placement and routing has been done All interconnections are known, capacitance can be extracted Different accuracy levels: 2D, 2.5D, full 3D Results are used by the timing analysis engine to calculate delay

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 51/74 How Do I Know What the Parasitic Load Is?

Ignore parasitic loads Naive approach Especially in modern processes (<90 nm) parasitics are dominant

Wireload models for synthesis Statistical lookup tables to estimate the load Capacitive load as a function of the output fanout Circuit dependent, for best results needs to be extracted for each circuit

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 51/74 How Do I Know What the Parasitic Load Is?

Ignore parasitic loads Naive approach Especially in modern processes (<90 nm) parasitics are dominant

Wireload models for synthesis Statistical lookup tables to estimate the load Capacitive load as a function of the output fanout Circuit dependent, for best results needs to be extracted for each circuit

Parasitic extraction Only after placement and routing has been done All interconnections are known, capacitance can be extracted Different accuracy levels: 2D, 2.5D, full 3D Results are used by the timing analysis engine to calculate delay

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 52/74 Impact of Process Voltage Temperature Corners

Worst Case Typical Case Best Case Voltage 1.08 V 1.2 V 1.32 V Temperature 125 ℃ 27 ℃ -40 ℃ Critical Path 3.49 ns 2.24 ns 1.59 ns Throughput 13.75 Gbit/s 21.42 Gbit/s 30.19 Gbit/s Relative Performance 64.2 % 100 % 140.9 %

A crypto- implemented in the 90 nm UMC process This is exactly the same circuit, everytime a different PVT corner is loaded and the timing report is repeated

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 53/74 Every Technology has a Speed Range

Example numbers for a 180 nm process

Speed Num. of Gates Frequency Range Notes Slow > 120 < 50 MHz No problem Normal 120-50 50-150 MHz Standard design Fast 50-20 150-300 MHz Needs attention Very Fast 20-10 300-600 MHz Involved design Ultra fast < 10 > 600 MHz Full custom design

For every process there is a speed range that can achieved easily Slower circuits do not take advantage of the process capabilities Faster circuits need disproportionately more effort for design

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 54/74 Accurate Timing is Difficult to Determine

Industry is interested in corner performance Even in the worst case no setup violations Even in the best case no hold violations Statistical models. Large variation between best and worst case

Wiring capacitance needs to be considered Synthesis tools rely on wireload models. Needs to be customized Reliable estimates only after final routing

Short periods/Faster clock rates more problematic f=1/T, small changes in critical path, large changes in frequency Problems with Clock distribution (skew), IR drop, thermal problems all increase with faster clocks

Integrated Systems Laboratory It does not scale infinitely, there are upper and lower limits Not all points are equally efficient AT graphs visualize this trade-off

We say generally because

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 55/74 Trade-off between Area and Speed

Generally you can always make a circuit faster by sacrificing area

Integrated Systems Laboratory Not all points are equally efficient AT graphs visualize this trade-off

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 55/74 Trade-off between Area and Speed

Generally you can always make a circuit faster by sacrificing area

We say generally because It does not scale infinitely, there are upper and lower limits

Integrated Systems Laboratory AT graphs visualize this trade-off

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 55/74 Trade-off between Area and Speed

Generally you can always make a circuit faster by sacrificing area

We say generally because It does not scale infinitely, there are upper and lower limits Not all points are equally efficient

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 55/74 Trade-off between Area and Speed

Generally you can always make a circuit faster by sacrificing area

We say generally because It does not scale infinitely, there are upper and lower limits Not all points are equally efficient AT graphs visualize this trade-off

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained

5A

4A

3A

2A Increasing Area

A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained

5A

4A

Constant AT Product 3A Circuit Implementation

2A Increasing Area

A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained

5A

4A

Constant AT Product Implementations 3A with different constraints

2A Increasing Area

A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained

5A

4A Different constant AT lines

3A

2A Increasing Area

A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained

5A

4A Large variation of results typically 3A +/- 10%

2A Increasing Area

A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 56/74 AT Graphs Explained

5A Overconstrained for Speed => Too large 4A

Efficient 3A Implementations

2A

Increasing Area Overconstrained for Area => Too slow A More Efficient Implementation Increasing Critical Path / Decreasing Operating Frequency 0 0 T 2T 3T 4T 5T 6T 7T 8T

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code

AES SubBytes Implementations in 180 nm 25k

20k

] 15k 2 m µ

Area [ 10k

5k

0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code

AES SubBytes Implementations in 180 nm 25k

20k 8 LUT

] 15k 2 8 m µ

Area [ 10k

5k

0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code

AES SubBytes Implementations in 180 nm 25k

20k 8 Affine Inverse Transform Multiplicative

] 15k 2 8 m µ

Area [ 10k

5k

0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code

AES SubBytes Implementations in 180 nm 25k

GF4 20k 8 GF4 Inverse Map

] 15k 2 8 m µ

Area [ 10k

5k

0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code

AES SubBytes Implementations in 180 nm 25k

7 20k LUT LUT 128 Entry 128 Entry

1

] 15k 2 8 m µ

Area [ 10k

5k

0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code

AES SubBytes Implementations in 180 nm 25k

64 Entry 64 Entry LUT LUT 6 20k 64 Entry 64 Entry LUT LUT

2

] 15k 2 8 m µ

Area [ 10k

5k

0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code

AES SubBytes Implementations in 180 nm 25k

16 16 16 16 LUT LUT LUT LUT 16 16 16 16 4 LUT LUT LUT LUT 20k 16 16 16 16 LUT LUT LUT LUT 16 16 16 16 LUT LUT LUT LUT 4

] 15k 2 8 m µ

Area [ 10k

5k

0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code

AES SubBytes Implementations in 180 nm 25k

3 20k

5

] 15k 2 8 m µ

Area [ 10k

5k

0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 57/74 Synthesis Results Depend on The HDL Code

AES SubBytes Implementations in 180 nm 25k

20k

] 15k Two stage 2 m µ

Area [ 10k 5k Arithmetic

0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 58/74 As Well As the Version of Synthesis

AES SubBytes Implementations in 180 nm 25k

20k

] 15k 2 m µ

Area [ 10k

5k

0 0 0.5 1 1.5 2 2.5 3 3.5 Critical Path [ns]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 59/74 Synthesis Results Depend on the Parasitics 180 Faster

160

More 140 Efficient Smaller

25 120 kbit/s/gate

100

80

Area [kGate eq] 50 kbit/s/gate 60

40 100 kbit/s/gate

20 200 kbit/s/gate 1000 500 kbit/s/gate kbit/s/gate

0 Gbit/s 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0 ns/bit 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 59/74 Synthesis Results Depend on the Parasitics 180 Faster gmu SHA-2 160 BLAKE More Grostl 140 Efficient Smaller JH

Synthesis Run Results 25 Keccak 120 kbit/s/gate with Different Skein Timing Constraints 100

80

Area [kGate eq] Grostl 50 kbit/s/gate 60 Skein 40 JH 100 kbit/s/gate Keccak 20 200 kbit/s/gate 1000 kbit/s/gate 500 kbit/s/gate SHA-2 0 Gbit/s 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0 ns/bit 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 59/74 Synthesis Results Depend on the Parasitics 180 Faster gmu SHA-2 160 BLAKE More Grostl 140 Synthesis Run Efficient with Extracted Wireload Smaller JH Results for Different 25 Keccak 120 Timing Constraints kbit/s/gate Skein

100 Change in Performance

80 Area [kGate eq] 50 kbit/s/gate 60

40 100 kbit/s/gate

20 200 kbit/s/gate 1000 kbit/s/gate 500 kbit/s/gate

0 Gbit/s 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0 ns/bit 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 59/74 Synthesis Results Depend on the Parasitics 180 Faster gmu SHA-2 160 BLAKE More Grostl 140 Efficient Smaller JH Initial 25 Keccak 120 Synthesis kbit/s/gate Synthesis Skein with 100 Extracted Wireload

80 Area [kGate eq] 50 kbit/s/gate 60

40 100 kbit/s/gate

20 200 kbit/s/gate 1000 kbit/s/gate 500 kbit/s/gate

0 Gbit/s 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0 ns/bit 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit]

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 59/74 Synthesis Results Depend on the Parasitics 180 Faster gmu SHA-2 160 Final BLAKE More Postlayout Grostl 140 Efficient Result Smaller JH Initial 25 Keccak 120 Synthesis kbit/s/gate Synthesis Skein with 100 Extracted Wireload

80 Area [kGate eq] 50 kbit/s/gate 60

40 100 kbit/s/gate

20 200 kbit/s/gate 1000 kbit/s/gate 500 kbit/s/gate

0 Gbit/s 10 Gbit/s 5.0 Gbit/s 3.3 Gbit/s 2.5 Gbit/s 2.0 Gbit/s 1.6 Gbit/s 0 ns/bit 0.1 ns/bit 0.2 ns/bit 0.3 ns/bit 0.4 ns/bit 0.5 ns/bit 0.6 ns/bit Throughput [Gbit/s] Time per bit [ns/bit]

Integrated Systems Laboratory A Suitable SHA-3 Module Optimized for small area (10 kGE) Runs up to 350 MHz Is able to generate digests for messages at a rate of 24 Mbit/s

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 60/74 Example: Adding a Crypto Block

Evolved EDGE Receiver System Digital front end of a receiver system Accepts data from the RF module Decodes data at a rate of up to 600 kbit/s Want to add a crypto module that optionally generates a digest of the received messages using a hash algorithm

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 60/74 Example: Adding a Crypto Block

Evolved EDGE Receiver System Digital front end of a receiver system Accepts data from the RF module Decodes data at a rate of up to 600 kbit/s Want to add a crypto module that optionally generates a digest of the received messages using a hash algorithm

A Suitable SHA-3 Module Optimized for small area (10 kGE) Runs up to 350 MHz Is able to generate digests for messages at a rate of 24 Mbit/s

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 61/74 Example: Adding a Crypto Block

DFE RF Decision Burst Decoder Interface Feedback Memory Equalizer

Digital Front End for Evolved Edge, Receiver Chain

Evolved EDGE receiver system Simplified block diagram of the receiver system

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 61/74 Example: Adding a Crypto Block

DFE Crypto RF Decision Burst Detector Decoder Module Interface Feedback Memory SHA-3 Equalizer

Digital Front End for Evolved Edge, Receiver Chain

Evolved EDGE receiver system Our goal: to add a SHA-3 module to the end

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 61/74 Example: Adding a Crypto Block

80k 50k 25k 240k 300k 10k

Physical Properties

Evolved EDGE receiver system Relative sizes of individual blocks, numbers in gate equivalents

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 61/74 Example: Adding a Crypto Block

80k 50k 25k 240k 300k 10k

Physical Properties

26 MHz 52 MHz 104 MHz 350 MHz

Evolved EDGE receiver system Clock rates at which each block is designed to operate

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 61/74 Example: Adding a Crypto Block

SHA

Decoder DFE

RF Interface Detector Mem

Evolved EDGE receiver system Possible placement of the modules on an actual die

Integrated Systems Laboratory Performance uncritical SHA-3 block very small part of overall system (<2%)˙ Gains in area/speed/power will not improve overall performance much

Should not cause too much extra work Goal: reduce the engineering overhead to add it to system Needs to adapt to the rest of the system as much as possible Use the same voltages and clock frequencies as the rest The interface, dft solutions more of a problem than anything else

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 62/74 Academic Performance not Always Relevant

In the previous Evolved EDGE example: SHA-3 block was way too fast Already as small as possible, no speed/area trade-off Could reduce voltage to save power, too much overhead

Integrated Systems Laboratory Should not cause too much extra work Goal: reduce the engineering overhead to add it to system Needs to adapt to the rest of the system as much as possible Use the same voltages and clock frequencies as the rest The interface, dft solutions more of a problem than anything else

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 62/74 Academic Performance not Always Relevant

In the previous Evolved EDGE example: SHA-3 block was way too fast Already as small as possible, no speed/area trade-off Could reduce voltage to save power, too much overhead

Performance uncritical SHA-3 block very small part of overall system (<2%)˙ Gains in area/speed/power will not improve overall performance much

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 62/74 Academic Performance not Always Relevant

In the previous Evolved EDGE example: SHA-3 block was way too fast Already as small as possible, no speed/area trade-off Could reduce voltage to save power, too much overhead

Performance uncritical SHA-3 block very small part of overall system (<2%)˙ Gains in area/speed/power will not improve overall performance much

Should not cause too much extra work Goal: reduce the engineering overhead to add it to system Needs to adapt to the rest of the system as much as possible Use the same voltages and clock frequencies as the rest The interface, dft solutions more of a problem than anything else

Integrated Systems Laboratory Power [W] Could be given as peak, minimum or mean values Important to determine: power supply, rail thicknesses number of power pins and distribution Power density [W/cm2] limiting factor in large chips Does not tell us how much is done in unit time, favors serialization Low-power does not necessarily mean your battery will last longer

Energy [J] Does not tell us how long the operation will take Usually favors parallelism Low-energy means the battery will last longer

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 63/74 Energy and/or Power

How green is my circuit?

Integrated Systems Laboratory Energy [J] Does not tell us how long the operation will take Usually favors parallelism Low-energy means the battery will last longer

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 63/74 Energy and/or Power

How green is my circuit?

Power [W] Could be given as peak, minimum or mean values Important to determine: power supply, rail thicknesses number of power pins and distribution Power density [W/cm2] limiting factor in large chips Does not tell us how much is done in unit time, favors serialization Low-power does not necessarily mean your battery will last longer

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 63/74 Energy and/or Power

How green is my circuit?

Power [W] Could be given as peak, minimum or mean values Important to determine: power supply, rail thicknesses number of power pins and distribution Power density [W/cm2] limiting factor in large chips Does not tell us how much is done in unit time, favors serialization Low-power does not necessarily mean your battery will last longer

Energy [J] Does not tell us how long the operation will take Usually favors parallelism Low-energy means the battery will last longer

Integrated Systems Laboratory Ensure safe operation of high performance systems Moore’s Law gives us more transistors per unit area Power density increases: 100 W/cm2 or even more Most of this power is dissipated as heat If not removed in time, chip will melt Solutions: More efficient cooling, reducing power consumption

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 64/74 Two Separate Energy/Power Related Problems

Increase efficiency of constrained devices Circuit power supplied by battery, radiated (RFID), or harvested Energy determines operation period for battery Harvesting, radiation efficiency determines peak power Usual goal: more functionality with same energy/power Not so much: increasing battery life, or reducing power

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 64/74 Two Separate Energy/Power Related Problems

Increase efficiency of constrained devices Circuit power supplied by battery, radiated (RFID), or harvested Energy determines operation period for battery Harvesting, radiation efficiency determines peak power Usual goal: more functionality with same energy/power Not so much: increasing battery life, or reducing power

Ensure safe operation of high performance systems Moore’s Law gives us more transistors per unit area Power density increases: 100 W/cm2 or even more Most of this power is dissipated as heat If not removed in time, chip will melt Solutions: More efficient cooling, reducing power consumption

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 65/74 Power Consumption Determines Power Routing

Maximum current density 1-2 mA per 1 µm of wire Skin effect High frequency current flows on the outside. Extreme wide are not helpful Resistance of wires Voltage drop on the power rails, not all cells have the same VDD, work slower Via current density Vias connect different metal layers, fixed geometry, narrower than the metal

Integrated Systems Laboratory Static Mainly caused by leakage in transistors (area dependent) Increases as transistors get faster (smaller technology, lower VT) Consumed even if circuit is not doing anything Many technology options to , body biasing, high VT transistors

Modern low power/energy designs try to Balance dynamic and static consumption at around 50 %

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 66/74 Two Components of Power/

Dynamic Directly caused by activity in circuit (application dependent) Depends also on total capacitance switched by the activity (area) The rate of the changing activity (frequency) And the square of the supply voltage Reducing voltage is most promising, however not much available range

Integrated Systems Laboratory Modern low power/energy designs try to Balance dynamic and static consumption at around 50 %

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 66/74 Two Components of Power/Energy Consumption

Dynamic Directly caused by activity in circuit (application dependent) Depends also on total capacitance switched by the activity (area) The rate of the changing activity (frequency) And the square of the supply voltage Reducing voltage is most promising, however not much available range

Static Mainly caused by leakage in transistors (area dependent) Increases as transistors get faster (smaller technology, lower VT) Consumed even if circuit is not doing anything Many technology options to counter, body biasing, high VT transistors

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 66/74 Two Components of Power/Energy Consumption

Dynamic Directly caused by activity in circuit (application dependent) Depends also on total capacitance switched by the activity (area) The rate of the changing activity (frequency) And the square of the supply voltage Reducing voltage is most promising, however not much available range

Static Mainly caused by leakage in transistors (area dependent) Increases as transistors get faster (smaller technology, lower VT) Consumed even if circuit is not doing anything Many technology options to counter, body biasing, high VT transistors

Modern low power/energy designs try to Balance dynamic and static consumption at around 50 %

Integrated Systems Laboratory For C you need the placed and routed design This is problematic, only available at the end of design flow. Circuit needs to be simulated with back-annotated , time consuming α activity depends on the vectors Power directly depends on the activity. Finding correct vectors not trivial

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 67/74 How to Determine Power

2 Ptotal = Pstatic + Pdynamic ∼ α · C · f · VDD

Pstatic is relatively simple As soon as you have the netlist, you can estimate it 2 f and VDD are known Both are design parameters, that would be known after synthesis

Integrated Systems Laboratory α activity depends on the vectors Power directly depends on the activity. Finding correct vectors not trivial

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 67/74 How to Determine Power

2 Ptotal = Pstatic + Pdynamic ∼ α · C · f · VDD

Pstatic is relatively simple As soon as you have the netlist, you can estimate it 2 f and VDD are known Both are design parameters, that would be known after synthesis For C you need the placed and routed design This is problematic, only available at the end of design flow. Circuit needs to be simulated with back-annotated netlist, time consuming

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 67/74 How to Determine Power

2 Ptotal = Pstatic + Pdynamic ∼ α · C · f · VDD

Pstatic is relatively simple As soon as you have the netlist, you can estimate it 2 f and VDD are known Both are design parameters, that would be known after synthesis For C you need the placed and routed design This is problematic, only available at the end of design flow. Circuit needs to be simulated with back-annotated netlist, time consuming α activity depends on the vectors Power directly depends on the activity. Finding correct vectors not trivial

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 68/74 It is Easy to Make a Mistake Estimating Power Make sure that you: Specify voltage and clock frequency Any power figure without these is useless Explain how circuit activity was obtained Assumed a global activity value Derived from input activity Actual functional simulation vectors Disclose how the power was measured Estimated during synthesis Back-annotated delays with post-layout netlist Actual measurement on manufactured chip For power estimations (not actual measurements) State operating conditions as well: best, typical, worst case Actual measurements not easy Difficult to measure current for fast switching circuits

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 69/74 What About Compound Performance Measures?

Quite common technique to compare different architectures Throughput per area Used commonly in cryptographic hardware , shows trade-off between area and speed.

Energy delay product Popular with architectures, shows the trade-off between energy and speed. Few variants of En · Dm, with different n and m exist.

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 69/74 What About Compound Performance Measures?

Quite common technique to compare different architectures Throughput per area Used commonly in cryptographic hardware papers, shows trade-off between area and speed.

Energy delay product Popular with computer architectures, shows the trade-off between energy and speed. Few variants of En · Dm, with different n and m exist.

Meaningless, unless the circuit has been designed for this measure

Integrated Systems Laboratory Can not get a 1 Gbit/s Keccak using only 2.5 kGEs

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 70/74 Example: Throughput Per Area in 65 nm

Keccak SHA-2 TpA: 395 kbit/s·GE TpA: 215 kbit/s·GE Keccak is almost twice as fast SHA-2 is slower/larger version Area: ? Area: ? Is it the same area... Is it twice as large as Keccak... Speed: ? Speed: ? .. but double the speed ? ... at the same speed ?

Compound Performance Metrics can be Misleading TpA shows neither the speed nor the area of each design It implies that Throughput can be traded-off against Area without limitation. This is not true.

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 70/74 Example: Throughput Per Area in 65 nm

Keccak SHA-2 TpA: 395 kbit/s·GE TpA: 215 kbit/s·GE Keccak is almost twice as fast SHA-2 is slower/larger version Area: 80 kGE Area: 25 kGE More than 3 times the area In reality SHA-2 is small Speed: 32 Gbit/s Speed: 5.4 Gbit/s and 6 times the speed but much slower Compound Performance Metrics can be Misleading TpA shows neither the speed nor the area of each design It implies that Throughput can be traded-off against Area without limitation. This is not true. Can not get a 1 Gbit/s Keccak using only 2.5 kGEs

Integrated Systems Laboratory Became something more Added block RAMs, programmable I/Os, PLLs, DSP slices Newest generation is even more powerful I.e. the Zynq contains a dual core ARM A9.

Resource allocation problem FPGAs have a collection of resources If you can make use of them good, if not you don’t save anything In some cases, you can use a smaller (cheaper FPGA), cost benefit

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 71/74 Comparing FPGAs and ASICs

Started as programmable logic Alternative for glue logic, replacement for 74xx families

Integrated Systems Laboratory Newest generation is even more powerful I.e. the Xilinx Zynq contains a dual core ARM A9.

Resource allocation problem FPGAs have a collection of resources If you can make use of them good, if not you don’t save anything In some cases, you can use a smaller (cheaper FPGA), cost benefit

Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 71/74 Comparing FPGAs and ASICs

Started as programmable logic Alternative for glue logic, replacement for 74xx families Became something more Added block RAMs, programmable I/Os, PLLs, DSP slices

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 71/74 Comparing FPGAs and ASICs

Started as programmable logic Alternative for glue logic, replacement for 74xx families Became something more Added block RAMs, programmable I/Os, PLLs, DSP slices Newest generation is even more powerful I.e. the Xilinx Zynq contains a dual core ARM A9.

Resource allocation problem FPGAs have a collection of resources If you can make use of them good, if not you don’t save anything In some cases, you can use a smaller (cheaper FPGA), cost benefit

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 72/74 FPGAs have a Fast Design Flow

Typically a single for the entire design flow

Good post-layout results All resources on FPGA characterized, good speed/power estimations Fast implementation The complete design can be measured/tested immediately

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 73/74 Estimating Area Usage on FPGAs Problematic

FPGAs contain a set of resources Not all resources used by the same amount Design uses x % of LUTs, and y % of BRAMs and z % of DSPs

FPGA families differ in their resources The type and particular mix of resources in an FPGA varies

In practice most resources can not be used 100 % In most cases you will not be able to put 5 designs that use 20 % of the LUTs in to the same FPGA

Resources are there no matter what Pipelining FFs are in every LUT (ALM). If you do not use them they will be by-passed. A circuit with 20 stages is not larger than one with only one stage.

Integrated Systems Laboratory Introduction Cost Design Flow Area Speed Area/Speed Trade-offs Power Conclusions 74/74 Last Words

What Did We Discuss? Cost structure of IC design

Area and Speed of IC circuits What are the common pitfalls

Academic vs Industrial use of EDA tools Different goals in both approaches, but EDA tools designed for industrial goals

Energy/Power Why is it the more difficult parameter to determine

If you review a Now you know which ones you can reject.

Integrated Systems Laboratory