Torrenza and the Pareto Distribution
Total Page:16
File Type:pdf, Size:1020Kb
CoEHT Symposium February 16, 2007 Douglas O’Flaherty The Heterogeneous Processing Imperative ≤ 1981 Single16-bit 486 Core x8 By the end of Spreadsheets,the decade,6 word-processing homogenous PERF. So 2 multi-core1990s becomes increasingly inadequate f tw 32-bit a Single r e AMD64 Core E-mail, GUI, PowerPoint,C web browsers om p l Dual- ex i Opteron ty Java, XML, web a services CoE HT Symposia n C d PERF. D ore 2000s i ve 64-bit rsi 3D, digital mediat Single Core y ER/PERF. POW 64-bit ogeneous HD, DRM HomMulti-CPU essing DIVERSITY 2010s The End of “One Size Fits All” Computing Platform Co-Proc HeterogeneousCPU+xPU Industry Landscape: The Insight Gap Accelerators Devices optimized to enhance the performance on a particular function “The Insight Gap” Data Driven X86 PerformanceComputing Time 3 CoE HT Symposia Torrenza and the Pareto Distribution • New features are introduced in niche markets y Some features will never reach broad market appeal, but will have stable niche markets over time y The sum of those niche market opportunities is itself a considerable market opportunity • Features with broad market appeal are quickly moved up the value chain y FPGA to custom logic on add-in board y Custom logic integrated into chipset y For very high value features, chipset logic moves into processor • Our goals for Torrenza y Enable new markets by changing system economics with a standard platform y Evaluate new features for their potential in the mass market % of Target Market Value Migration Mass New Features Enter Market in Niches Market Niche Markets ◄ General Features Specialized Features ► 4 CoE HT Symposia Early vs. Late Market Entry Managing Risk and Cost Business Risk Cost of Entry Perfect Timing Market Starting to Buy Capability Capability Value Starting to React Cost and Risk Mature Venture Capital Markets and Start-ups Live Here Market Maturity and Size • New capabilities carry inherent business risk • Managing risk is a business philosophy y Venture Capital accepts high risk for potentially high returns y Most mature enterprises only accept low risk, but end up paying top-dollar for what are essentially new divisions • Torrenza gives the industry an early assessment of emerging capabilities 5 CoE HT Symposia Torrenza Platform Challenges End-Customers Barriers to Application Performance y Heterogeneous computing & application software y Synchronization & Communication overhead OEM & Channel Barriers to Support y Proliferation of SKUs y Economic challenges to differentiate Innovators Barriers of Entry y Inflexibility of legacy platforms y Need innovation roadmap from concept to deployment 6 CoE HT Symposia Torrenza Platform Challenges End-Customers Enrich the IT industry with AMD64 Barriers to Application Performance y Heterogeneoustechnology ascomputing the & application software y SynchronizationOPEN, STANDARDS & Communication BASED overhead Innovation Platform for our OEMcustomers & Channel and end users. Barriers to Support y Proliferation of SKUs • Freedom of Choice y Economic challenges to differentiate • Longevity and Stability Innovators Barriers• Differentiated of Entry and Upgradeable y Inflexibility of legacy platforms y Need innovation roadmap from concept to deployment 7 CoE HT Symposia Acceleration Adoption Model C C o o CPU r r HTX AMD AMD e e Opteron™ Opteron™ Chipset Processor Processor NB PCI-E AMD Opteron™ Socket Extensible System Bus Highly Integrated Solution Packaging is an Economic Option 8 CoE HT Symposia Continuum of Solutions “Fusion" "Torrenza" Accelerator Accelerator C CPU P U HTX Accelerator AMD NB Processor PCI-E Chipset Silicon level Package level integration Accelerator integration PCIe Accelerator (MCM) Add-in Chipset Accelerator Accelerated Processors Opteron Socket Fusion – AMD’s code name for: Accelerated "Stream" Socket Processors (integrated acceleration) general compatible Torrenza – AMD’s code name for: purpose GPU accelerator slot or socket based acceleration Stream – Specific example of a GPGPU Slot or Socket Acceleration accelerator under Torrenza 9 CoE HT Symposia Torrenza Initiative Elements Innovation Socket Platform Support y Propagating Freedom y Standards platform of Choice through support for technology sharing Innovation Socket accelerators Silicon Platform Enablement Support Silicon Enablement AMD Technology AMD y An ecosystem of Technology y Enhancing the silicon development platform & protocols and support for accelerators 10 CoE HT Symposia Torrenza 2006 Timeline AMDOEM Announcements Torrenza AMD Innovation Fusion Socket IBM Servers Cray XMT Cray DARPA June 2006 CTM: AMDPhase Stream 3 Torrenza 2-way & 4-way CloseThreadstorm™ to ProcessorAdaptive Launch With HTX slots The MetalProcessors Supercomputing TokyoUni. Institu Researchte HyperTransport™“Roadrunner” OpenFPGASC06 Demos of TechnologyProgram Technology COE AMD Opteron™ 5 HTX Boards Sun x64Altera, Servers Sun, U. Mannheim Cornerstone Processor & 2 Socket Fillers Xtreme Data (HT & cHT) Sponsor Clearspeed™ IBM Cell™ Blades Embedded Design 11 CoE HT Symposia Advancing AMD’s Torrenza Initiative Network Enterprise Processing SOA Java Technologies XML SMP Content Storage Security Offload FPGA IPTV g in s g s n ce i o d Media r o Math Enablement P c s n a I/O r T VoIP IMS Telco 12 CoE HT Symposia Reconfigurable Computing Commercial Implementations Center of Excellence HT, cHT & HTX Reference Designs Research Program Platform & Programming Support 13 CoE HT Symposia This presentation is an exploration, not a roadmap of a committed plan. On Acceleration Protocols 14 CoE HT Symposia Offload Model: Asynchronous (Survey of current accelerators) y Majority of current base – E.g. GPU, Ethernet NIC, Appliances y Have developed worked out latency hiding mechanisms – Large on-card memory enables buffering for large block transfers – Highly managed buffer & context techniques (e.g., HW multithreading) y DMA is key function – Includes one or more system memory copies y Driver techniques to address issue in market e.g. Chimney y Explicitly coded Communication Channels – Using available IO functions 15 CoE HT Symposia Co-Processing Model: Synchronous (Survey of accelerators) Significant number of new entrants & evolving solutions – Computational Offload y HPC, image processing, analytics, algorithm in chip – Reliable delivery at wire speed y Messaging, Classification, Security, Virtualization y Applicability of accelerator is dependant upon problem size & intensity – Acceleration must overcome start-up time (transfer latency & synchronization) – More synchronous communication between system thread and device processing y Compute efficiency requires moving data more often – Accelerators are taking less time to process and need to increase concurrency – Larger, more complex context increases memory footprint – Overlapping transfers hides latency and increases accelerator throughput y More random access to system memory – Increasing number of cores, threads & virtual machines within system – Many problems are not regularly ordered y Latency hiding is most effective on large, or highly predictable, work units Goal for auto-generated offload requires support of smaller work-unit granularity Need improved communications channels 16 CoE HT Symposia Accelerator Communication/Synchronization Models • Traditional approaches for I/O acceleration have typically been based on Asynchronous implementations – I/O device performance DMA for both inbound and outbound data transfers – Notification from I/O to host is by interrupt – Notification from host to I/O is typically via kernel-mode driver y These approaches are appropriate for many I/O-related functions, but are too restrictive for computational acceleration – An efficient Synchronous mode of operation makes fine-grain offloading much easier and more effective – Host processor “baby-sits” accelerator, performing command/control and some computation – Polling is appropriate – the accelerator is doing most of the work, so overlap is a bonus, not a requirement –Latency, Overhead, and Concurrency are critical attributes of the interconnect 17 CoE HT Symposia Sync Latency: Orders of Magnitude Short Message Latency Type of Connection (microseconds) (CPU Cycles) IP over WAN ~100,000 240,000,000 IP over LAN ~1,000 2,400,000 Unix Pipes in SMP >10 24,000 (SUSE 9.1 on 2-socket Opteron) Unix Sockets in SMP ~20 (TCP) 48,000 (SUSE 9.1 on 2-socket Opteron) ~15 (UDP) 36,000 User-mode Sockets in SMP ~2.5 6000 (Scali) Simple lock-free shared memory 0.21 - 0.25 (Opteron 4 socket) 500 - 600 producer/consumer handoff 0.15 - 0.18 (Opteron 2 socket) 350 - 430 0.095 - 0.135 (Opteron 4s) 225 – 325 Cache-to-cache intervention 0.065 - 0.095 (Opteron 2s) 150 – 225 One-way HT probe latency <0.025 <60 (per hop) On-Chip core-to-core latency ~0.0025 ~6 18 CoE HT Symposia Sync Latency: Orders of Magnitude Short Message Latency Type of Connection (microseconds) (CPU Cycles) IP over WAN ~100,000 240,000,000 IP over LAN ~1,000 2,400,000 Unix Pipes in SMP >10 24,000 (SUSE 9.1 on 2-socket Opteron) Unix Sockets in SMP ~20 (TCP) 48,000 (SUSE 9.1 on 2-socket Opteron) ~15 (UDP) 36,000 User-mode Sockets in SMP ~2.5 6000 (Scali) Simple lock-free shared memory 0.21 - 0.25 (Opteron 4 socket) 500 - 600 producer/consumer handoff 0.15 - 0.18 (Opteron 2 socket) 350 - 430 100x slower than hardware! 0.095 - 0.135 (Opteron 4s) 225 – 325 Cache-to-cache intervention 0.065 - 0.095 (Opteron 2s) 150 – 225 One-way HT probe latency <0.025 <60 (per hop) On-Chip core-to-core latency ~0.0025 ~6 19 CoE HT Symposia Why are Current Architectures so SLOW? •