Fact Sheet: Intel Unveils Biggest Architectural Shifts in a Generation
Total Page:16
File Type:pdf, Size:1020Kb
Intel Unveils Biggest Architectural Shifts in a Generation for CPUs, GPUs and IPUs Intel powers the next era of computing for data center, edge and client for the workloads and computing challenges of tomorrow. Aug. 19, 2021 – At Intel’s Architecture Day 2021, Raja Koduri and Intel architects provided details on two new x86 core architectures; Intel’s first performance hybrid architecture, code-named “Alder Lake,” with the intelligent Intel® Thread Director workload scheduler; “Sapphire Rapids,” the next-generation Intel® Xeon® Scalable processor for the data center; new infrastructure processing units; and upcoming graphics architectures, including the Xe HPG and Xe HPC microarchitectures, and Alchemist and Ponte Vecchio SoCs. These new architectures will power upcoming high-performance products and establish the foundations for the next era of Intel innovation aimed at meeting the world’s ever-growing demand for more computing power. Raja Koduri addressed the importance of architectural advancement to meet this demand, saying: “Architecture is alchemy of hardware and software. It blends the best transistors for a given engine, connects them through advanced packaging, integrates high-bandwidth, low-power caches, and equips them with high-capacity, high-bandwidth memories and low-latency scalable interconnects for hybrid computing clusters in a package, while also ensuring that all software accelerates seamlessly. … The breakthroughs we disclosed today demonstrate how architecture will satisfy the crushing demand for more compute performance as workloads from the desktop to the data center become larger, more complex and more diverse than ever.” x86 Cores Efficient-core Intel’s new Efficient-core microarchitecture, previously code-named “Gracemont,” is designed for throughput efficiency, enabling scalable multithreaded performance for modern multitasking. This is Intel’s most efficient x86 microarchitecture with an aggressive silicon area target so that multicore workloads can scale out with the number of cores. It also delivers a wide frequency range. The microarchitecture and focused design effort allow Efficient-core to run at low voltage to reduce overall power consumption, while creating the power headroom to operate at higher frequencies. This allows Efficient-core to ramp up performance for more demanding workloads. Efficient-core utilizes a variety of technical advancements to prioritize workloads without being wasteful with processing power and to directly enhance performance with features that improve instruction per cycle (IPC), including: • 5,000 entry branch target cache that results in more accurate branch prediction • 64 kilobyte instruction cache to keep useful instructions close without expending memory subsystem power • Intel’s first on-demand instruction length decoder that generates pre-decode information • Intel’s clustered out-of-order decoder that enables decoding up to six instructions per cycle while maintaining energy efficiency • A wide back end with five-wide allocation and eight-wide retire, 256 entry out-of-order window and 17 execution ports • Intel® Control-flow Enforcement Technology and Intel® Virtualization Technology Redirection Protection • The implementation of the AVX ISA, along with new extensions to support integer artificial intelligence (AI) operations Compared with the Skylake CPU core, Intel’s most prolific central processing unit (CPU) microarchitecture, in single-thread performance, the Efficient-core achieves 40% more performance at the same power or delivers the same performance while consuming less than 40% of the power1. For throughput performance, four Efficient-cores offer 80% more performance while still consuming less power than two Skylake cores running four threads or the same throughput performance while consuming 80% less power.1 Performance-core Intel’s new Performance-core microarchitecture, previously code-named “Golden Cove,” is designed for speed and pushes the limits of low latency and single-threaded application performance. Workloads are growing in their code footprint and demand more execution capabilities. Datasets are also massively growing along with data bandwidth requirements. Intel’s new Performance-core microarchitecture provides a significant boost in general purpose performance and better support for large code footprint applications. The Performance-core features a wider, deeper and smarter architecture: • Wider: six decoders (up from four); eight-wide µop cache (up from six); six allocation (up from five); 12 execution ports (up from 10) • Deeper: Bigger physical register files; deeper re-order buffer with 512 entry • Smarter: Improved branch prediction accuracy; reduced effective L1 latency; full write predictive bandwidth optimizations in L2 The Performance-core is the highest performing CPU core Intel has ever built and pushes the limits of low latency and single-threaded application performance with: • A Geomean improvement of ~19% across a wide range of workloads over current 11th Gen Intel® Core™ processor architecture (Cypress Cove) at ISO frequency for general purpose performance1 • Exposure for more parallelism and an increase in execution parallelism • Intel® Advanced Matrix Extensions, the next-generation, built-in AI acceleration advancement, for deep learning inference and training performance. It includes dedicated hardware and new instruction set architecture to perform matrix multiplication operations significantly faster • Reduced latency and increased support for large data and large code footprint applications Client Alder Lake Client SoC Intel’s next-generation client architecture, code-named Alder Lake, is Intel’s first performance hybrid architecture, which for the first time integrates two core types – Performance-core and Efficient-core – for significant performance across all workload types. Alder Lake is built on the Intel 7 process and supports the latest memory and fastest I/O. Alder Lake will deliver incredible performance that scales to support all client segments from ultra-portable laptops to enthusiast and commercial desktops by leveraging a single, highly scalable system-on-chip (SoC) architecture with three key design points: • A maximum performance, two-chip, socketed desktop with leadership performance, power- efficiency, memory and I/O • A high-performance mobile BGA package that adds imaging, larger Xe graphics and Thunderbolt 4 connectivity • A thin, lower-power, high-density package with optimized I/O and power delivery The challenge of building such a highly scalable architecture is meeting incredible bandwidth demands of the compute and I/O agents without compromising power. To solve this challenge, Intel has designed three independent fabrics, each with real-time, demand-based heuristics: • The compute fabric can support up to 1,000 gigabytes per second (GBps), which is 100 GBps per core or per cluster and connects the cores and graphics through the last level cache to the memory o Features high dynamic frequency range and is capable of dynamically selecting the data path for latency versus bandwidth optimization based on actual fabric loads o Dynamically adjusts the last-level cache policy – inclusive or non-inclusive – based on utilization • The I/O fabric supports up to 64 GBps, connecting the different types of I/Os as well as internal devices and can change speed seamlessly without interfering with a device’s normal operation, selecting the fabric speed to match the required amount of data transfer • The memory fabric can deliver up to 204 GBps of data and dynamically scale its bus width and speed to support multiple operating points for high bandwidth, low latency or low power Intel Thread Director In order for Performance-cores and Efficient-cores to work seamlessly with the operating system, Intel has developed an improved scheduling technology called Intel Thread Director. Built directly into the hardware, Thread Director provides low-level telemetry on the state of the core and the instruction mix of the thread, empowering the operating system to place the right thread on the right core at the right time. Thread Director is dynamic and adaptive – adjusting scheduling decisions to real-time compute needs – rather than a simple, static rules-based approach. Traditionally, the operating system would make decisions based on limited available stats, such as foreground and background tasks. Thread Director adds a new dimension by: • Using hardware telemetry to direct threads that require higher performance to the right Performance- core at that moment • Monitoring instruction mix, state of the core and other relevant microarchitecture telemetry at a granular level, which helps the operating system make more intelligent scheduling decisions • Optimizing Thread Director for the best performance on Windows 11 through collaboration with Microsoft • Extending the PowerThrottling API, which allows developers to explicitly specify quality-of-service attributes for their threads • Applying a new EcoQoS classification that informs the scheduler if the thread prefers power efficiency (such threads get scheduled on Efficient-cores) Xe HPG Microarchitecture and Alchemist SoCs Xe HPG is a new discrete graphics microarchitecture designed to scale to enthusiast-class performance for gaming and creation workloads. The Xe HPG microarchitecture powers the Alchemist family of SoCs, and the first related products are coming to market in the first quarter of 2022 under the Intel® Arc™ brand. The Xe HPG microarchitecture