The University of Texas at Arlington

Lecture-2: Evolution of The High Performance (An Perspective)

CSE 4342/5343 – 001 Embedded Systems II – Real Time Data Acquisition and Control Systems Based heavily on material provided by Dr. Roger Walker Some pictures are from Wikipedia

Reading Assignment

• Reading assignment for next class: read Chapter 1 from textbook.

2 of the 1970’s

• First microprocessor , November 15, 1971 – 10 years before the IBM PC. The 4004 was a 4 bit processor, w/ clock speed of 108,000cps, 0.108 MHz, contained 2300 transistors and was built on a 10-micron process. • April 1972 Intel released the 8008 (8 bit processor version, 0.2 MHz, 16K bytes address space) 3

Microprocessors of the 1970’s (contd.)

• April 1974 Intel released the 8080 (8 bit processor, 2 MHz, 64 KB address space). Altair 8800 (first personal ) used the 8080 • 1976 - 8085 processor, an upgraded 8080 and was primarily used in embedded controllers. • Other processors were developed in the late 1970’s which were used in some of the first personal and home . These include the (1976) & MOS Technologies 6502 (and later 6510 – Commodore 64).

4 Microprocessors of the Late 1970’s

• The Apple I, a single board computer, was introduced in 1976 and used the MOS 6502. • Other popular PC’s were the Apple II (1977), the Commodore PET, and Radio Shack’s TRS-80. • In 1979 Intel released the 8086 (the original instruction set). The 8086 was a 16-bit processor w/ 16 bit internal and external data , 29,000 transistors, 20-bit addressing or 1MB address space, and had 5MHz clock speed.

5

Late 1970’s to Early 1980’s

• Intel also introduced the 8088. The 8088 used the same internal core as the 8086 but had an 8 bit external data bus and was thus less expensive to design with in systems. • Because of the success of the 8086/8088, Intel has attempted to maintain to this processor in all X86 future processors, • IBM used the 8088 in the first IBM PC, in August of 1981.

6 Intel Microprocessor Generations

• IBM introduced the XT PC in 1983, a slightly enhanced version of the PC, replacing the cassette port with the floppy drive (8-bit ISA bus). • The Intel 8088/8086 with 8087 Co-processor are often referred to as the First Generation of processor chips from Intel. The 8086 had both a 20 bit address and 16 bit data bus. • The 2nd Generation processors began with the 80286 and the 80287 co-processor). This processor, introduced the first part of 1980 (typically known as the 286), was used in the IBM AT PC. 7

3rd and 4th Intel Microprocessor Generations • In 1985 Intel released the 32 bit 386 processors. The 386 along with the 80387 (which ran at same speed as CPU) were known as the 3rd Generation of processors (SX variant still 16bit). 12-40MHz clocks. • 4th Generation (486); the 486 processors made GUI’s in Windows & OS/2 a realistic option for every day use. The 486 included a built in co- processor (except for SX models) and an internal level-1 cache. 16-100MHz clocks. 32bit data and address bus.

8 The Intel (IA-32 and IA-64) Architecture

• The Intel 386 Architecture was called the IA-32 (or Intel Architecture). • In 1994 Intel Introduced the 64-bit (IA-64) Architecture in the processors for servers and desk-top PC’s. • It took 10 years after the 386 was introduced before Microsoft made full use of the IA-32 in 1995 in Windows 95. • The IA-64 is not an extension of IA-32 architecture. • AMD developed a 64 bit extension to the IA-32 called AMD64. • Intel followed with EM64T compatible with AMD64 the following year.

9

Instructions per Cycle

• The 8086/8088 took an average of 12 cycles per instruction. • The 286/386 took an average of 4.5 cycles per instruction. • The 486 took an average if 2 cycles per instructions. • The 5th generation processors () was the beginning of multiple instructions per cycle. Making processors faster was not a matter of increasing clock rates anymore. • Intel calls the capability to execute more than one instruction at a time superscalar technology

10 Intel Pentium Processors

• 5th Generation (586) – Pentium Processors (October 1992 – 1st Generation of Pentium Processors) – had many new features including superscalar capability (features two instruction pipe lines with one ALU for both integer and floating point operations and the other integer only) – used clock-multiplier circuits (up to 3x bus speeds – 66-133-166) . • Pentium – with the two instruction pipelines and other improvements – took an average of 1 to 2 instructions per cycle

11

Superscalar Processor

12

• The Pentium Pro considered a 6th generation processor (i686) – 3 or more instructions per cycle with L1 and L2 Cache, Nov 1995. • Pentium name was catchy… • Dynamic Execution – greater parallel execution using: – Multiple branch prediction – predict program flow through several branches – Data flow analysis – out of order execution – Speculative execution - (executing future instructions that are probably needed) • 3.2V supply, 100, 150, and 200 MHz processor options, and MMX in 1996 (MMX improves video compression/decompression, image processing, etc.) • Dual Independent Bus (DIB) – (One for system or mother board and the other for cache) 13

MMX Instruction Set

• MMX technology (MultiMedia eXtensions, MatrixMath eXtensions): Single Instruction Multiple Data (SIMD) instruction set. • Eight 64-bit MMX registers for SIMD integer operations (physically implemented by chopping 16 bits from 80 bit registers) • Supports SIMD operations on packed (the same register can hold, e.g., 8 bytes packed) byte, word, and double-word integers • Useful for multimedia and communications software 14 Intel Pentium II and III ( or i686)

• Pentium II: – In 1997 the second sub-generation of i686 processors is released. – Pentium II Processors – among other things went to a different board mounting method (small PCB and plastic cartridge holder) – additional MMX instructions – increased cache size.

• Pentium III: – The third sub-generation of i686 processors in February 1999 is released: Pentium III – Same core as Pentium II with SSE (Streaming S[ingle Instruction Multiple Data] Extension) instructions (70 new instructions supporting imaging, 3D, audio and video applications) – Larger integrated on-die L2 cache.

15

Streaming SIMD Extension - SSE

• Intel introduced an update to MMX to the Pentium III in 1999 called the SSE • MMX and SSEs are architectural extensions to help increase speed! • 8 new 128-bit ( manager - XMM) vector registers • XMM can hold/pack four floats or two doubles (in addition to integers) • Introduced data pre-fetch instructions • Useful for 3D geometry, 3D rendering, and video encoding/decoding 16 Intel

• Pentium 4 appears in November of 2000 becoming the 7th Generation (i786) Intel processor • Speeds to 3.8 GHz processors with >=800MHz Front Side Bus • Hyper-Threading - allows simultaneous multitasking on single processor core (looks like two independent processors to the software). Hyper-threading is Intel’s word for simultaneous multithreading. • NetBurst architecture – hyper-pipelined technology (20-stage or 31- stage pipeline) – enables much higher clock rates • : Xeon processor name was added with the introduction of the Pentium III representing high end versions of the processors primarily for servers, workstations. The Xeon brand has been maintained over several generations of the Intel processors including multi-processor configurations. • In 2003 Pentium 4 Extreme Edition (P4EE), incorporates Level 3 cache. (Athlon 64 was launched shortly after giving AMD an edge. EE was expensive, sometimes referred to as Emergency Edition, Expensive Edition or Extremely Expensive) 17

SSE2

• SSE2 introduced in Intel Pentium 4 and Intel Xeon processors (in 2001) • Removes the need to use MMX registers when using XMM operations. • Has 144 new instructions for data support (no new registers) • Adds support for cacheability and memory ordering operations • Useful for 3D graphics, video encoding/decoding and encryption 18 SSE3 and on

• SSE3 introduced in Intel Pentium 4 and Intel Xeon processors in 2004 • Arithmetic operations inside the same register (horizontal versus vertical) • Useful for quick DSP applications, 3D graphics, video encoding/decoding and encryption • SSSE3: supplemental instruction (mostly horizontal operations) • SSE4 in 2007. (more instructions, more refined horizontal operations). (AMD started giving similar names to their vector extensions [SSE5, XOP, etc.]) • AVX: Advanced Vector Extensions coming in 2010. 256 bit registers, 3 operand instructions. 19

Making Computers Faster

• There is a huge consumer need to make computers faster (more multitasking) • Moore’s law (number of transistors) • The race towards higher processor clock rates was over in the middle of the first decade of the new millennium. • More parallelism was quickly emerging as the way to speed things up (see vector extensions, hyper threading, and pipelining). • Improving on instructions per clock cycle becomes the new way to advance. 20 Intel

• In 2005, Intel introduces the Pentium D and Extreme Edition (redefined) (90/65nm, up to 3.73GHz) for desktop applications • Two dies inside a single CPU (each based on NetBurst microarchitecture) • Each core does Hyper-Threading (in high-end processors) • Hardware starts to match multithreading application needs, taking concurrency into parallelism. • Software thus needs to be parallelized to make sure hardware speed capabilities are passed to the user.

21

Intel Microarchitectures after NetBurst

From Wikipedia Since 2008 Coming in 2011

22 (Core 2, Core i7, Core i9)

• Because of high power consumption of NetBurst architecture, Intel has changed to the Intel Core Micro Architecture for its new processors. (65nm to 45nm) • Intel introduced the Core 2 series processors in July of 2006 beginning with the Core 2 Dual. (Core 2 is a brand name for Intel’s Core microarchitecture; the 2 does not refer to the number of cores) • Two cores are on a single die. (Two dies for quad core.) • The Core 2 Quad were added to the series late 2007.

23

Intel Core (Core 2, Core i7, Core i9)

24 From Wikipedia Intel Nehalem

• Two, four, six, or eight cores • 731 million transistors for the quad core variant (45 nm) • Integrated memory controller for DDR3 SDRAM or FB- DIMM • Integrated graphics processor (IGP) (separate die) • New type of front side bus: Intel QuickPath Interconnect • Northbridge replacement with integrated PCI express controller • A new point-to-point processor interconnect, the, in high- end • All eight cores can be on a single die. • Improved L1,L2, L3 caches. • Two levels of branch predictor 25

New Features

• Intel Wide Dynamic Execution – 14-stage efficient pipeline • Wider execution path • Advanced branch prediction • Macro-Ops fusion – Roughly ~15% of all instructions are conditional branches, thus – Macro-Ops fusion fuses a comparison and conditional jump into a single micro-op to reduce micro-ops running down the pipeline • 64-Bit Support: , , and Woodcrest support EM64T (catching up with AMD on that front) • 1 clock-cycle execution for most SSE instructions (used to be two) • Low power consumption

26 Intel Nehalem Intel Nehalem

27 From Wikipedia