<<

Jaguar

Alex Avery, Cody Smith Agenda

● AMD Processors ● Jaguar Overview ● Example Hardware ● Core Pipeline ● Instruction Fetch and ● Instruction Decoding ● Scheduling ● Integer & FP Execution ● Memory ● Cache What is a Microarchitecture?

Microarchitecture is the Organization

Microarchitecture + Instruction Set Architecture =

A Microarchitecture describes the electrical circuitry of the device, it is how the ISA is implemented. AMD Processors

● Bobcat (2011) ● Piledriver (2012) ● Jaguar (2013) ● Steamroller (2014) ● (2014) ● Excavator (2015) Jaguar Overview

● Targets 2-25W Devices ● Low cost ● 28 nm Technology ● Up to 4 Cores ● Split L1 Cache - 32 KiB instruction and 32 KiB data per core ● Unified L2 Cache - 1-2 MiB, 16 way ● Out-of-order and ● Integrated ● Two-way integer execution ● Two-way 128-bit floating-point execution Example Hardware

● Gaming Consoles ○ Xbox One ○ PS4 ● Desktop Processors ○ 5350 ○ 3850 ● Laptops/Mini PCs ○ A6-5200 ○ E2-3000 ● Tablets ○ A6-1450 ● Embedded Processors ○ GX-420CA Jaguar Core Pipeline Instruction Fetch and Cache

● 6 Stages ● 32KB 2 way set associative L1 cache ● Pseudo least recently used (LRU) replacement algorithm ● 32B Instruction fetch window ● Branch predictors exploit characteristics of both direct and indirect branches as well as branch density Instruction Decoding

● Can decode two ● Variable length x86 instructions are decoded into complex micro-operations (COPs) ● Can handle 128-bit vector units as well as x86 Advanced Vector Extensions (AVX) Scheduling

● Out-of-order execution ● After instructions are decoded into COPs, they are dispatched ● Each COP allocates a Retire (RCU) entry Integer Execution

● Separate Integer and Floating Point Units ● 2 Symmetrical integer pipelines ● Integer addition/subtraction takes 3 cycles ○ Read operands ○ Execute ○ Write back ● 6 Cycle multiplication ● Separate hardware divider Floating Point Execution

● Designed for 128-bit wide execution ● Targets SSE and AVX vector extensions ● 2 Asymmetrical FP pipelines ● 4-7 cycles per addition/subtraction ○ Read operands (2 cycles) ○ Execute (1-4 cycles) ○ Write back (1 cycle) ● Co- architecture ○ Dedicated decode, rename, out-of-order scheduler and retire queue Memory

● Separate load and store pipelines ● Aggressive re-ordering ○ Loads can occur out-of-order ○ Loads can be moved ahead of stores before the target address is resolved ● Memory Ordering Queue and Store Queue handle memory ordering L1 Data Cache

● 32KB ● 8-way associative ● Parity protected writeback cache ● Pseudo-LRU replacement algorithm ● Can handle a 128-bit read and a 128-bit write each cycle ● Average latency of 3 cycles for a L1 hit L2 Cache

● 1 - 2 MB (depending on application) ● 16-way set associative ● Unified, shared by 2 to 4 cores ● ECC Memory (Error Correcting Code) for tag and data arrays ● Forms an EDC/ECC cache structure ● Minimum of 25 cycles per hit Jaguar Benchmarks

● Athlon 5350 ● Athlon 5150 ● Sempron 3850 Athlon 5350 vs. Core i3 3220 vs. J1900 Athlon 5350 vs. i7 5930K

The Athlon 5350 is much lower performance, however:

● Much better efficiency ● Much lower cost ● Better ● Better performance per dollar

● Entirely new core design ● New design family ‘Summit Ridge’ ● Simultaneous Multithreading ● New Cache System ● FinFET manufacturing Resources http://www.anandtech.com/show/6976/amds-jaguar-architecture-the-cpu-powering-xbox-one-playstation-4-kabini-temash http://www.realworldtech.com/jaguar/ http://www.tomshardware.com/reviews/microsoft-xbox-one-console-review,3681-3.html https://nathanlamont91.wordpress.com/2015/03/22/my-report-on-the-amd-jaguar-quad-core-cpu/ https://www.deepdyve.com/lp/institute-of-electrical-and-electronics-engineers/the-floating-point-unit-of-the-jaguar-x86-core- 1TVYueOORA http://www.xbitlabs. com/news/cpu/display/20120904201534_AMD_Discloses_Peculiarities_of_Next_Generation_Jaguar_Micro_Architecture. html