<<

ECE473 Architecture and Organization

Technology Trends

Lecturer: Prof. Yifeng Zhu

Fall, 2014

Portions of these slides are derived from: ECE473 Lec 1.1 Dave Patterson © UCB Couse Website

http://arch.eece.maine.edu/ece473

ECE473 Lec 1.2 Author of the Text Book

Communications of the ACM, Volume 49, No. 4, April 2006, Page 31

ECE473 Lec 1.3 Author of the Text Book

ECE473 Lec 1.4 Outline

Trends • Introduction to Computer Architecture

ECE473 Lec 1.5 What If Your Salary? • Parameters – $16 base – 59% growth/year – 40 years

• Initially $16  buy book • 3rd year’s $64  buy computer game • 16th year’s $27,000  buy car • 22nd year’s $430,000  buy house • 40th year’s > billion dollars  buy a lot

You have to find fundamental new ways to spend money!

ECE473 Lec 1.6 Birth of the Revolution -- The 4004

First in 1971

• Intel 4004 • 2300 • Barely a processor • Could access 300 of memory

@intel Introduced November 15, 1971 108 KHz, 50 KIPs, 2300 10m transistors

ECE473 Lec 1.7 2002 – ® 4 Processor

November 14, 2002

@3.06 GHz, 533 MT/s

1099 SPECint_base2000* 1077 SPECfp_base2000*

55 Million 130 nm process

@intel

ECE473 Lec 1.8 Source: http://www.specbench.org/cpu2000/results/ 2002 - Intel 2 Processor for Servers

• 64-bit processors Branch Unit Floating Point Unit • .18mm bulk, 6 layer Al process IA32 Pipeline Control L1I • 8 stage, fully stalled in- cache ALAT Integer Multi- Int order pipeline L1D Medi Datapath RF a • Symmetric six integer- cache unit CLK issue design HPW DTLB • IA32 execution engine

integrated mm 21.6 L2D Array and Control L3 Tag • 3 levels of cache on-die totaling 3.3MB • 221 Million transistors Bus Logic • 130W @1GHz, 1.5V • 421 mm2 die • 142 mm2 CPU core L3 Cache

19.5mm ECE473 @ 1.9 2006 - Duo Processors for Desktop

PERFORMANCE 40%

POWER 40% …relative to Intel® Pentium® D 960

When compared to the Intel® Pentium® D processor 960. Performance measured using SPECint* rate base2000. Actual performance may vary. Energy efficiency based on Thermal Design Power (TDP) measurement. See http://www.intel.com/performance for more information.

ECE473 Lec 1.10 2008-2014 - Intel Core i7 64-bit -64

a 5-stage pipelined processor (group project, two members)

• Successor to the family • Scales up number of threads • 32 nm CMOS process supported • Adding GPU into the processor • 4 SMT cores, each supporting 4 • Intel Core i7 uses simultaneous multi- threads appears as 16 core ECE473 threading (SMT) Lec 1.11

Intel 4th Generation Core Processor: “Haswell”

• More than 90% of processors shipping today include a GPU on die • Low energy use is a key design goal

4-core GT2 Desktop: 35 W package 2-core GT2 Ultrabook: 11.5 W package

ECE473 Lec 1.12 Integrating GPU into CPU Chip: AMD Fusion

notebook

ECE473 Lec 1.13 Integrating GPU into CPU Chip: AMD Fusion, 2011

ECE473 Lec 1.14 Integrating GPU into CPU Chip: AMD Trinity, May 2012

ECE473 Lec 1.15 Embedded Processors: Intel

ECE473 Lec 1.16 Technology constantly on the move! • Num of transistors not limiting factor – Currently ~ 1 billion transistors/chip – Problems: » Too much Power, Heat, Latency » Not enough Parallelism • 3-dimensional chip technology? – Sandwiches of silicon – “Through-Vias” for communication • On-chip optical connections? – Power savings for large packets Nehalem • The Intel® Core™ i7 microprocessor (“Nehalem”) – 4 cores/chip – 32 nm, Hafnium hi-k dielectric – 731M Transistors – Shared L3 Cache - 8MB – L2 Cache - 1MB (256K x 4) ECE473 Lec 1.17 Amazing Underlying Technology Change

• In 1965, (co- founder of Intel) sketched out his prediction of the pace of silicon technology.

• Moore's Law: The number of transistors incorporated in a chip will approximately double every 24 months.

• Decades later, Moore's Law From Intel remains true.

ECE473 Lec 1.18 Technology Trends: Moore’s Law

• Gordon Moore (Founder of Intel) observed in 1965 that the number of transistors on a chip doubles about every 24 months. • In fact, the number of transistors on a chip doubles about every 18 months.

From intel ECE473 Lec 1.19 How did we do so far ?

Moore´s Law applied to the travel industry • A flight from New York to Paris

ECE473 Lec 1.20 IBM Power4 Dual Processor on a Chip

Two cores (~30M transistors each)

Large Shared L2: Multi-ported: 3 independent L3 & Mem slices Controller: L3 tags Chip-to-Chip & on-die for MCM-to-MCM full-speed Fabric: coherency Glueless SMP checks

@IBM

ECE473 *Other names and brands may be claimed as the property of others Lec 1.21 AMD64 Dual Core Processor

• Two AMD Opteron™ CPU cores on one single die, each with 1MB L2 cache Core 0 • 90nm, ~205 million 1-MB L2 transistors* – Approximately same die size as 130nm single-core AMD Opteron processor* • 95 watt power envelope Northbridge fits into 90nm power infrastructure • Dual-core processors for client market are expected to follow 1-MB L2 Core 1

*Based on current revisions of the design ECE473 Lec 1.22 Niagara: Multithreaded SPARC Processor

ECE473 Lec 1.23 Niagara Architecture

ECE473 Lec 1.24 Cell Overview Cell Prototype Die (Pham et al, ISSCC 2005)

S S S S P P P P P U U U U R M P B MIB R I U I A C C C S S S S P P P P U U U U

•IBM/Toshiba/Sony joint project - 4-5 years, 400 designers – 234 million transistors, ~80 watts at 4+ Ghz – 256 Gflops (billions of floating pointer operations per second)

ECE473 – Used in Sony PlayStation 3 Lec 1.25 Cell Overview - Main Processor Cell Prototype Die (Pham et al, ISSCC 2005)

S S S S P P P P P U U U U R M P B MIB R I U I A C C C S S S S P P P P U U U U

•One 64-bit PowerPC processor – 4+ Ghz, dual issue, two threads – 512 kB of second-level cache ECE473 Lec 1.26 Cell Overview - SPE Cell Prototype Die (Pham et al, ISSCC 2005)

S S S S P P P P P U U U U R M P B MIB R I U I A C C C S S S S P P P P U U U U

•Eight Synergistic Processor Elements – Or “Streaming Processor Elements” – Co-processors with dedicated 256kB of memory (not cache) ECE473 Lec 1.27 Cell Overview - SPE Cell Prototype Die (Pham et al, ISSCC 2005)

S S S S P P P P P U U U U R M P B MIB R I U I A C C C S S S S P P P P U U U U

•Synergistic Processor Elements – Or “Streaming Processor Elements” – Co-processors with dedicated 256kB of memory (not cache) ECE473 Lec 1.28 Cell Overview - Memory and I/O Cell Prototype Die (Pham et al, ISSCC 2005)

S S S S P P P P P U U U U R M P B MIB R I U I A C C C S S S S P P P P U U U U

•Dual Rambus XDR memory controllers (on chip) – 25.6 GB/sec of memory bandwidth

ECE473•76.8 GB/s chip-to-chip bandwidth (to off-chip GPU) Lec 1.29 What else except desktop and server processors?

ECE473 Lec 1.30 Slides from ECE692 of Jie Hu@NJIT Embedded Processors World Embedded Systems Market, 2003, 2004 and 2009

AAGR% 2003 2004 2009 2004-2009 Embedded Software 1,401 1,641 3,448 16.0 Embedded IC 34,681 40,539 78,746 14.2 Embedded Boards 3,401 3,693 5,950 10.0 Total 39,483 45,873 88,144 14.0

from “High Growth Expected in the Worldwide Market in the Next Five ECE473 Years”, 04/28/2005 Lec 1.31 Slides from ECE692 of Jie Hu@NJIT Intel® Atom™

ADDR IO

FUSE B L2 U CORE

S PLL

DATA IO

13x14mm

Intel® Atom 45 nm CMOS Used most for Netbook

Ultra-Low Power, Small Form Factor, Embedded Applications

ECE473 Lec 1.32 Tear-down of iPhone 5

photo from ifixit.com ECE473 Lec 1.33 Tear-down of iPhone 5

• Apple’s A6 combines an ARM based dual-core CPU with a triple-core GPU

photo from ifixit.com ECE473 Lec 1.34 Tear-down of iPhone 4

• Apple’s A4 combines an ARM based CPU with a PowerVR GPU with an emphasis on power efficiency • ARM Cortex-A8 core is used in A4

photo from ifixit.com ECE473 Lec 1.35 Tear-Down 2nd Generation Nexus 7

ECE473 Lec 1.36 Tear-Down 2nd Nexus 7

• Qualcomm APQ8064 Snapdragon S4 • Qualcomm Atheros WCN3660 WLAN Pro Quad-Core CPU (includes the a/b/g/n, Bluetooth 4.0 (BR/EDR+BLE), Adreno 320 GPU) and FM Radio Module • Elpida J4216EFBG 512 MB DDR3L • SK Hynix H26M51003EQR 16 GB SDRAM (four ICs for 2 GB total) eMMC NAND Flash • Analogix ANX7808 SlimPort Transmitter • Qualcomm PM8921 Quick Charge • BQ51013B Inductive Battery Management IC Charging Controller

ECE473 Lec 1.37 New Challenge: Slowdown in Joy’s law of Performance

10000 3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006 ??%/year

1000

52%/year

100

 Sea change in chip Performance (vs. VAX-11/780) 10 25%/year design: multiple “cores” or processors per chip

1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 • RISC + x86: ??%/year 2002 to present ECE473 Lec 1.38 Power Density

Sun's 1000 Surface Rocket Nozzle

Nuclear Reactor

2 100 Pentium® 4

Hot plate Pentium® III Pentium® II Watts/cm 10 Pentium® Pro Pentium® 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m

8/22/13 ECE473 39 Lec 1.39 CPU determines performance?

Based on SPEED, the CPU has increased dramatically, but memory and disk have increased only a little. This has led to dramatic changed in architecture, Operating Systems, and programming practices.

ECE473 Lec 1.40 Memory Technology

• DDR: Double Data Rate SDRAM • Bandwidth of a memory module

SBmax = SBbus* fbus* 2 where

– SBmax: max. memory bandwidth

– SBbus: Bandwidth of the memory bus (64 Bit = 8 Bytes)

– fbus: Frequency of the memory bus

http://www.kingston.com/newtech

ECE473 Lec 1.41 Memory Technology

Memory speed improves ~10% per year.

http://en.wikipedia.org/wiki/DDR_SDRAM ECE473 Lec 1.42 Disk Technology

ECE473 Lec 1.43 Photo of Disk Head, Arm, Actuator

Spindle

Arm Head

Actuator

Platters (12)

ECE473 Lec 1.44 Disk Technology

2007: Hitachi releases the 1TB (1024 Gigabytes (GB) = 1 Terabyte (TB) ) Hitachi Deskstar 7k100

Disk capacity improves about 60% per year. ECE473 Lec 1.45 Disk Device Terminology

Sector Inner Track Head Outer Track

Platter Arm Actuator

Disk Latency = Seek Time + Rotation Time + Transfer Time

Order-of-magnitude times for 4K transfers: Seek: 8 ms or less Rotate: 4.2 ms @ 7200 rpm Transfer: 1 ms @ 7200 rpm

ECE473 Lec 1.46 Disk Device Terminology

Disk Latency = Seek Time + Rotation Time + Transfer Time

ECE473 Lec 1.47 Technology  dramatic change

• Processor – number in a chip: about 59% per year – : about 20% per year • Memory – DRAM capacity: about 60% per year (4x every 3 years) – Memory speed: about 10% per year – Cost per bit: improves about 25% per year • Disk – capacity: about 60% per year – Total use of data: 100% per 9 months! • Network Bandwidth – 10 years: 10Mb  100Mb – 5 years: 100Mb  1 Gb

ECE473 Lec 1.48 Methodology

Evaluate Existing Implementation Systems for Complexity Bottlenecks Benchmarks Technology Trends Implement Next Simulate New Generation System Designs and Organizations Workloads

Architecture design is an iterative process: Searching the space of possible designs at all levels of computer systems ECE473 Lec 1.49 What is Architecture?

• Original sense: – Taking a range of building materials, putting together in desirable ways to achieve a building suited to its purpose

• In Computer Engineering: – Similar: how parts are put together to achieve some overall goal – Examples: the architecture of a chip, of the Internet, of an enterprise database system, an email system, a cable TV distribution system

Adapted from David Clark’s, What is “Architecture”?

ECE473 Lec 1.50 What is “Computer Architecture”? • Instruction Set Architecture (ISA) – Visible to the programmer – E.g., IA-32, IA-64, SPARC, ARM,… • Organization – High-level detail of the system »Does it have a cache, full FP support, etc? • Hardware – Specifics »E.g., at 3GHz vs. Core Duo at 2 GHz

ECE473 Lec 1.51 The Rest of this Course • How are modern ISAs arranged? • How do you organize these millions/billions of transistors to implement the ISA – data-processing (workers) – control-logic (managers) – memory (warehouse) – parallel systems (multiple worksites) • How to bridge the performance gap between CPU and memory? – Cache – Redundant Array of Inexpensive Disks (RAID)

ECE473 Lec 1.52 How Fast is it?

A beam of light travels less than a tenth of Saturating Performance an inch during the time it takes a 45nm transistor to switch on and off.

From 1960-1980, computer design was only about performance This is the era of Power Next will be era of Reliability

ECE473 Lec 1.53 Summary

1. Moore’s laws: The number of transistors incorporated in a chip will approximately double every 18 months.

2. CPU speed increases dramatically, but the speed of memory, disk and network increases slowly.

3. Architecture design is an iterative process. Measure performance: Benchmarks

ECE473 Lec 1.54