Quick viewing(Text Mode)

COPE-09 Architecture.Indd

COPE-09 Architecture.Indd

Computer Organisation & Program Execution 2021

Architecture9 Uwe R. Zimmer - The Australian National University Architecture

References for this chapter [Patterson17] David A. Patterson & John L. Hennessy Organization and Design – The Hardware/Software Interface Chapter 4 “The ”, Chapter 6 “Parallel Processors from Client to Cloud” ARM edition, Morgan Kaufmann 2017

© 2021 Uwe R. Zimmer, The Australian National University page 468 of 490 (chapter 9: “Architecture” up to page 487) Architecture

© 2021 Uwe R. Zimmer, The Australian National University page 469 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Defi nition: Processor Hardware origins 18th century machines

L’Ecrivain 1770 Programmable, yet not a computer in today’s defi nition (not Turing complete)

L’Ecrivain (1770) Pierre Jaquet-Droz, Henri-Louis Jaquet-Droz & Jean-Frédéric Lescho

© 2021 Uwe R. Zimmer, The Australian National University page 470 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Defi nition: Processor Digital

Konrad Zuse with Z1, © Dr. Horst Zuse Hardware origins (replica of the 1937 computer) • Patents by Konrad Zuse (Germany), 1936. ENIAC 1946, • First digital computer: Z1 (Germany), 1937: Re- Glen Beck (background), Betty Jennings (foreground) lays, programmable via punch tape, clock: 1 Hz, 64 words memory à 22-bit, 2 registers, fl oating point unit, weight: 1 t. • First freely programmable (Turing complete) relays computer: (Germany), 1941: 5.3 Hz • Atanasoff Berry Computer (US) 1942: Vacuum tubes, (not Turing complete). • Colossus Mark 1 (UK) 1944: Vacuum tubes (not Turing complete). • “First Draft of a Report on the EDVAC” (Electronic Discrete Variable Automatic Computer) by (US), 1945: Infl uential article about core elements of a computer: Arithmetic unit, (Sequencer), memory (holding and program), and I/O. • First high level : Plankalkül (“Plan Calculus”) by Konrad Zuse, 1945. • ENIAC (Electronic Numerical Integrator And Computer) (US) 1946: programed by plugboard, First Turing complete vacuum tubes based computer, clock: 100 kHz, weight: 27 t on 167 m2. © 2021 Uwe R. Zimmer, The Australian National University page 471 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Computer Architectures

• Control unit Instructions Concurrently addresses program and data Control unit memory and fetches next instruction. Controls next ALU operations and Program memoryProgram Address determines the next instruction (based on ALU status). • (ALU) Control Status Fetches data from memory. Executes arithmetic/logic operation.

Address Writes data to memory. • Input/Output Arithmetic Logic Unit • Program memory Data memory

Input/Output • Data memory Data

© 2021 Uwe R. Zimmer, The Australian National University page 472 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Computer Architectures von Neumann Architecture

• Control unit Instructions Sequentially addresses program and data Control unit memory and fetches next instruction. Controls next ALU operations and Address determines the next instruction (based on ALU status). • Arithmetic Logic Unit (ALU) Control Status Memory Fetches data from memory. Executes arithmetic/logic operation. Writes data to memory. • Input/Output Arithmetic Logic Unit • Memory Program and data is not distinguished Input/Output

Data G Programs can change themselves.

© 2021 Uwe R. Zimmer, The Australian National University page 473 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Computer Architectures A simple processor (CPU) Code management

• Decoder/Sequencer Decoder Can be a machine in itself which breaks CPU Sequencer instructions into concurrent micro code. • / Arithmetic-Logic-Unit (ALU) IP A collection of transformational logic. Flags Registers • Memory SP

Memory • Registers Instruction pointer, stack pointer, general purpose and specialized registers. ALU • Flags Indicating the states of the latest calculations. Data management • Code/Data management Fetching, Caching, Storing.

© 2021 Uwe R. Zimmer, The Australian National University page 474 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Processor Architectures CodeCode management managementent Some CPU actions are naturally sequential Decoder (e.g. instructions need to be fi rst loaded, then Int. SequencerSe Sequencer decoded before they can be executed). More fi ne grained sequences can IP be introduced by breaking CPU Registers Flags instructions into micro code. SP

Memory G Overlapping those sequences in time will lead to the concept of pipelines. ALU G Same latency, yet higher . G (Conditional) branches might break the pipelines G Branch predictors become essential. DataData management management

© 2021 Uwe R. Zimmer, The Australian National University page 475 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Processor Architectures Parallel pipelines CodeCode management managementt Filling parallel pipelines Decoder (by alternating incoming commands between Int. SequencerSe Sequencererr pipelines) may employ multiple ALU’s.

G (Conditional) branches might IP again break the pipelines. Registers FlagsFla G Interdependencies might limit SP

Memory the degree of concurrency. G Same latency, yet even higher throughput.

AALU ALULU G need to be aware of the options.

DataData management management

© 2021 Uwe R. Zimmer, The Australian National University page 476 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Processor Architectures Pipeline hazards CodeCode management managementent Structural hazard Decoder G Lack of hardware to run Int. SequencerSe Sequencer operations in parallel, … e.g. load an new instruction and

IP load new data in parallel. Registers Flags Control hazard

SP G A decision depends on the Memory previous instruction. … e.g. a conditional branch based on the ALU fl ags from the previous instruction. Data hazard G Needed data is not yet available DataData management management … e.g. the result of an arithmetic operation is needed in the next instruction.

© 2021 Uwe R. Zimmer, The Australian National University page 477 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Processor Architectures Out of order execution CodeCode management managemente Breaking the sequence inside each pipe- Decoder line leads to ‘out of order’ CPU designs. Int. SequencerSe Sequencer G Replace pipelines with hardware scheduler.

IP G Results need to be Registers FlagsFla “re-sequentialized” or possibly discarded.

SP G “Conditional branch prediction” executes

Memory the most likely branch or multiple branches. G Works better if the presented code AALU ALULU sequence has more independent instructions and fewer conditional branches. G This hardware will require (extensive) code optimization to be fully utilized. DataData management management

© 2021 Uwe R. Zimmer, The Australian National University page 478 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Processor Architectures SIMD ALU units Code management Provides the facility to apply the same in- Decoder struction to multiple data concurrently. Int. Sequencer Also referred to as “vector units”. Examples: Altivec, MMX, SSE[2|3|4], … IP Registers Flags G Requires specialized compilers SP or programming languages with Memory implicit concurrency.

AALALULU ALU ALU ALU GPU processing

Graphics processor as a vector unit. Data managementgeemmenmeent G Unifying architecture languages are used (OpenCL, CUDA, GPGPU).

© 2021 Uwe R. Zimmer, The Australian National University page 479 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Processor Architectures Hyper-threading CodeCod management Emulates multiple virtual CPU cores DecoderDecoder by means of replication of: Int. SequencerSequencer • Register sets • Sequencer IPIPIP • Flags RegistersRRegistersegistersegisegi FlagsFFlagslaggss • logic SPSSPP

Memory while keeping the “expensive” resources like the ALU central yet accessible by multiple hyper-threads concurrently. ALU G Requires programming languages with implicit or explicit concurrency.

DataData management Examples: Intel Pentium 4, Core i5/i7, Xeon, Atom, Sun UltraSPARC T2 (8 threads per core)

© 2021 Uwe R. Zimmer, The Australian National University page 480 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Processor Architectures Multi-core CPUs CodeCod management Full replication of multiple CPU cores DecoderDecoder on the same chip package. Int. SequencerSequencer • Often combined with hyper-- ing and/or multiple other means (as IPIP IP introduced above) on each core. RegistersRRegistersegistersegisgi FlagsFFlagslaggss • Cleanest and most explicit implementation SPSSPP

Memory of concurrency on the CPU level.

G Requires synchronized atomic operations. ALU G Requires programming languages with implicit or explicit concurrency.

Historically the introduction of multi-core DataData management CPUs ended the “GHz race” in the early 2000’s.

© 2021 Uwe R. Zimmer, The Australian National University page 481 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Processor Architectures Code management Translates logical memory addresses Decoder into physical memory addresses Int. Sequencer and provides features.

IP • Does not introduce concurrency by itself. Registers Flags G Is still essential for concurrent programming as hardware memory protection SP

Memory guarantees memory integrity for

Virtual memory individual processes / threads. Physical memory Physical ALU

Data management

© 2021 Uwe R. Zimmer, The Australian National University page 482 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Alternative Processor Architectures: Parallax Propeller

© 2021 Uwe R. Zimmer, The Australian National University page 483 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Alternative Processor Architectures: Parallax Propeller (2006)

8 cores with 2 kB local memory

Low cost 32 bit processor ($8) 40 kB

8 semaphores No !

© 2021 Uwe R. Zimmer, The Australian National University page 484 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Alternative Processor Architectures: IBM Cell processor (2001)

8 cores for specialized high- bandwidth fl oating point operations and 128 bit registers

theoretical 25.6 GFLOPS at 3.2 GHz

Multiple interconnect topologies

64 bit PowerPC core

© 2021 Uwe R. Zimmer, The Australian National University page 485 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Multi-CPU systems

Scaling up: • Multi-CPU on the same memory multiple CPUs on same motherboard and memory , e.g. servers, workstations

• Multi-CPU with high-speed interconnects various supercomputer architectures, e.g. Cray XE6: • 12-core AMD Opteron, up to 192 per cabinet (2304 cores) • 3D torus interconnect (160 GB/sec capacity, 48 ports per node)

• Cluster computer (Multi-CPU over network) multiple computers connected by network interface, e.g. Sun Constellation Cluster at ANU: • 1492 nodes, each: 2x Quad core Intel Nehalem, 24 GB RAM • QDR Infi niband network, 2.6 GB/sec

© 2021 Uwe R. Zimmer, The Australian National University page 486 of 490 (chapter 9: “Architecture” up to page 487) Architecture

Summary Architecture • History • Architectures • Pipelines • Parallel pipelines • Out of order execution • Vector machines • Multi-core CPUs • Virtual memory

© 2021 Uwe R. Zimmer, The Australian National University page 487 of 490 (chapter 9: “Architecture” up to page 487)