COPE-09 Architecture.Indd
Computer Organisation & Program Execution 2021
Architecture9 Uwe R. Zimmer - The Australian National University Architecture
References for this chapter [Patterson17] David A. Patterson & John L. Hennessy Computer Organization and Design – The Hardware/Software Interface Chapter 4 “The Processor”, Chapter 6 “Parallel Processors from Client to Cloud” ARM edition, Morgan Kaufmann 2017
© 2021 Uwe R. Zimmer, The Australian National University page 468 of 490 (chapter 9: “Architecture” up to page 487) Architecture
© 2021 Uwe R. Zimmer, The Australian National University page 469 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Defi nition: Processor Hardware origins 18th century machines
L’Ecrivain 1770 Programmable, yet not a computer in today’s defi nition (not Turing complete)
L’Ecrivain (1770) Pierre Jaquet-Droz, Henri-Louis Jaquet-Droz & Jean-Frédéric Lescho
© 2021 Uwe R. Zimmer, The Australian National University page 470 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Defi nition: Processor Digital Computers
Konrad Zuse with Z1, © Dr. Horst Zuse Hardware origins (replica of the 1937 computer) • Patents by Konrad Zuse (Germany), 1936. ENIAC 1946, • First digital computer: Z1 (Germany), 1937: Re- Glen Beck (background), Betty Jennings (foreground) lays, programmable via punch tape, clock: 1 Hz, 64 words memory à 22-bit, 2 registers, fl oating point unit, weight: 1 t. • First freely programmable (Turing complete) relays computer: Z3 (Germany), 1941: 5.3 Hz • Atanasoff Berry Computer (US) 1942: Vacuum tubes, (not Turing complete). • Colossus Mark 1 (UK) 1944: Vacuum tubes (not Turing complete). • “First Draft of a Report on the EDVAC” (Electronic Discrete Variable Automatic Computer) by John von Neumann (US), 1945: Infl uential article about core elements of a computer: Arithmetic unit, control unit (Sequencer), memory (holding data and program), and I/O. • First high level programming language: Plankalkül (“Plan Calculus”) by Konrad Zuse, 1945. • ENIAC (Electronic Numerical Integrator And Computer) (US) 1946: programed by plugboard, First Turing complete vacuum tubes based computer, clock: 100 kHz, weight: 27 t on 167 m2. © 2021 Uwe R. Zimmer, The Australian National University page 471 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Computer Architectures Harvard Architecture
• Control unit Instructions Concurrently addresses program and data Control unit memory and fetches next instruction. Controls next ALU operations and Program memoryProgram Address determines the next instruction (based on ALU status). • Arithmetic Logic Unit (ALU) Control Status Fetches data from memory. Executes arithmetic/logic operation.
Address Writes data to memory. • Input/Output Arithmetic Logic Unit • Program memory Data memory
Input/Output • Data memory Data
© 2021 Uwe R. Zimmer, The Australian National University page 472 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Computer Architectures von Neumann Architecture
• Control unit Instructions Sequentially addresses program and data Control unit memory and fetches next instruction. Controls next ALU operations and Address determines the next instruction (based on ALU status). • Arithmetic Logic Unit (ALU) Control Status Memory Fetches data from memory. Executes arithmetic/logic operation. Writes data to memory. • Input/Output Arithmetic Logic Unit • Memory Program and data is not distinguished Input/Output
Data G Programs can change themselves.
© 2021 Uwe R. Zimmer, The Australian National University page 473 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Computer Architectures A simple processor (CPU) Code management
• Decoder/Sequencer Decoder Can be a machine in itself which breaks CPU Sequencer instructions into concurrent micro code. • Execution Unit / Arithmetic-Logic-Unit (ALU) IP A collection of transformational logic. Flags Registers • Memory SP
Memory • Registers Instruction pointer, stack pointer, general purpose and specialized registers. ALU • Flags Indicating the states of the latest calculations. Data management • Code/Data management Fetching, Caching, Storing.
© 2021 Uwe R. Zimmer, The Australian National University page 474 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Processor Architectures Pipeline CodeCode management managementent Some CPU actions are naturally sequential Decoder (e.g. instructions need to be fi rst loaded, then Int. SequencerSe Sequencer decoded before they can be executed). More fi ne grained sequences can IP be introduced by breaking CPU Registers Flags instructions into micro code. SP
Memory G Overlapping those sequences in time will lead to the concept of pipelines. ALU G Same latency, yet higher throughput. G (Conditional) branches might break the pipelines G Branch predictors become essential. DataData management management
© 2021 Uwe R. Zimmer, The Australian National University page 475 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Processor Architectures Parallel pipelines CodeCode management managementt Filling parallel pipelines Decoder (by alternating incoming commands between Int. SequencerSe Sequencererr pipelines) may employ multiple ALU’s.
G (Conditional) branches might IP again break the pipelines. Registers FlagsFla G Interdependencies might limit SP
Memory the degree of concurrency. G Same latency, yet even higher throughput.
AALU ALULU G Compilers need to be aware of the options.
DataData management management
© 2021 Uwe R. Zimmer, The Australian National University page 476 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Processor Architectures Pipeline hazards CodeCode management managementent Structural hazard Decoder G Lack of hardware to run Int. SequencerSe Sequencer operations in parallel, … e.g. load an new instruction and
IP load new data in parallel. Registers Flags Control hazard
SP G A decision depends on the Memory previous instruction. … e.g. a conditional branch based on the ALU fl ags from the previous instruction. Data hazard G Needed data is not yet available DataData management management … e.g. the result of an arithmetic operation is needed in the next instruction.
© 2021 Uwe R. Zimmer, The Australian National University page 477 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Processor Architectures Out of order execution CodeCode management managemente Breaking the sequence inside each pipe- Decoder line leads to ‘out of order’ CPU designs. Int. SequencerSe Sequencer G Replace pipelines with hardware scheduler.
IP G Results need to be Registers FlagsFla “re-sequentialized” or possibly discarded.
SP G “Conditional branch prediction” executes
Memory the most likely branch or multiple branches. G Works better if the presented code AALU ALULU sequence has more independent instructions and fewer conditional branches. G This hardware will require (extensive) code optimization to be fully utilized. DataData management management
© 2021 Uwe R. Zimmer, The Australian National University page 478 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Processor Architectures SIMD ALU units Code management Provides the facility to apply the same in- Decoder struction to multiple data concurrently. Int. Sequencer Also referred to as “vector units”. Examples: Altivec, MMX, SSE[2|3|4], … IP Registers Flags G Requires specialized compilers SP or programming languages with Memory implicit concurrency.
AALALULU ALU ALU ALU GPU processing
Graphics processor as a vector unit. Data managementgeemmenmeent G Unifying architecture languages are used (OpenCL, CUDA, GPGPU).
© 2021 Uwe R. Zimmer, The Australian National University page 479 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Processor Architectures Hyper-threading CodeCod management Emulates multiple virtual CPU cores DecoderDecoder by means of replication of: Int. SequencerSequencer • Register sets • Sequencer IPIPIP • Flags RegistersRRegistersegistersegisegi FlagsFFlagslaggss • Interrupt logic SPSSPP
Memory while keeping the “expensive” resources like the ALU central yet accessible by multiple hyper-threads concurrently. ALU G Requires programming languages with implicit or explicit concurrency.
DataData management Examples: Intel Pentium 4, Core i5/i7, Xeon, Atom, Sun UltraSPARC T2 (8 threads per core)
© 2021 Uwe R. Zimmer, The Australian National University page 480 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Processor Architectures Multi-core CPUs CodeCod management Full replication of multiple CPU cores DecoderDecoder on the same chip package. Int. SequencerSequencer • Often combined with hyper-thread- ing and/or multiple other means (as IPIP IP introduced above) on each core. RegistersRRegistersegistersegisgi FlagsFFlagslaggss • Cleanest and most explicit implementation SPSSPP
Memory of concurrency on the CPU level.
G Requires synchronized atomic operations. ALU G Requires programming languages with implicit or explicit concurrency.
Historically the introduction of multi-core DataData management CPUs ended the “GHz race” in the early 2000’s.
© 2021 Uwe R. Zimmer, The Australian National University page 481 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Processor Architectures Virtual memory Code management Translates logical memory addresses Decoder into physical memory addresses Int. Sequencer and provides memory protection features.
IP • Does not introduce concurrency by itself. Registers Flags G Is still essential for concurrent programming as hardware memory protection SP
Memory guarantees memory integrity for
Virtual memory individual processes / threads. Physical memory Physical ALU
Data management
© 2021 Uwe R. Zimmer, The Australian National University page 482 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Alternative Processor Architectures: Parallax Propeller
© 2021 Uwe R. Zimmer, The Australian National University page 483 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Alternative Processor Architectures: Parallax Propeller (2006)
8 cores with 2 kB local memory
Low cost 32 bit processor ($8) 40 kB shared memory
8 semaphores No interrupts!
© 2021 Uwe R. Zimmer, The Australian National University page 484 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Alternative Processor Architectures: IBM Cell processor (2001)
8 cores for specialized high- bandwidth fl oating point operations and 128 bit registers
theoretical 25.6 GFLOPS at 3.2 GHz Cache
Multiple interconnect topologies
64 bit PowerPC core
© 2021 Uwe R. Zimmer, The Australian National University page 485 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Multi-CPU systems
Scaling up: • Multi-CPU on the same memory multiple CPUs on same motherboard and memory bus, e.g. servers, workstations
• Multi-CPU with high-speed interconnects various supercomputer architectures, e.g. Cray XE6: • 12-core AMD Opteron, up to 192 per cabinet (2304 cores) • 3D torus interconnect (160 GB/sec capacity, 48 ports per node)
• Cluster computer (Multi-CPU over network) multiple computers connected by network interface, e.g. Sun Constellation Cluster at ANU: • 1492 nodes, each: 2x Quad core Intel Nehalem, 24 GB RAM • QDR Infi niband network, 2.6 GB/sec
© 2021 Uwe R. Zimmer, The Australian National University page 486 of 490 (chapter 9: “Architecture” up to page 487) Architecture
Summary Architecture • History • Architectures • Pipelines • Parallel pipelines • Out of order execution • Vector machines • Multi-core CPUs • Virtual memory
© 2021 Uwe R. Zimmer, The Australian National University page 487 of 490 (chapter 9: “Architecture” up to page 487)