Compute Express Link™ (CXL™): Memory and Cache Protocols

Compute Express Link™ (CXL™): Memory and Cache Protocols Robert Blankenship Intel Corporation 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 1 Topics What is CXL? Caching 101 CXL Caching Protocol CXL Memory Protocol Merging Memory and Caching for Accelerators 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 2 What is CXL? • CXL runs across the standard PCIe physical layer with new protocols x16 PCIe x16 CXL Card Card optimized for cache and memory • CXL uses a flexible processor Port that X16 Connector can auto-negotiate to either the PCIe channel SERDES standard PCIe transaction protocol or Connector the alternate CXL transaction protocols etc. • First generation CXL aligns to 32 GT/s PCIe 5.0 specification Processor • CXL usages expected to be key driver for an aggressive timeline to PCIe 6.0 architecture 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 3 CXL Uses 3 Protocols • IO (CXL.io) → Same as PCI Express® • Memory (CXL.mem) → Memory Access from CPU Host to Device • Cache (CXL.cache) → Coherent Access from Device to CPU Host 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 4 Representative CXL Usages 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 5 Caching 101 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 6 Caching Overview Caching temporarily brings data closer to the consumer Host Memory Access Latency: ~200ns Shared Bandwidth: 100+ GB/s Improves latency and bandwidth using prefetching and/or locality Data • Prefetching: Loading Data into cache before it is required Read • Spatial Locality (locality is space): Access address X then Local Data Cache X+n Access Latency: ~10ns • Temporal Locality (locality in Time): Multiple access to the Dedicated Bandwidth: 100+ GB/s same Data Data Read Accelerator 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 7 CPU Cache/Memory Hierarchy with CXL ~10 GB – Directly Connected Memory CXL.mem CXL.mem • Modern CPUs have 2 or (aka DDR) ~10 GB ~10 GB more levels of coherent Home Agent cache Coherent CPU-to-CPU Symmetric Links • Lower levels (L1), smaller CPU Socket 1 in capacity with lowest ~10 MB – L3 (aka LLC) latency and highest Wr Wr CXL.Cache ~500 KB – L2 ~500 KB – L2 Cache Cache ... CXL.$ ~50KB ~50KB bandwidth per source. CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe • Higher levels (L3), less ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU bandwidth per source but CPU Socket 0 much higher capacity and support more sources ~1 MB ~1 MB • Device caches are expected Device Device to be up to 1MB. Note: Cache/Memory capacities are examples and not aligned to a specific product. 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 8 Cache Consistency How do we make sure updates in cache are visible to other agents? • Invalidate all peer caches prior to update • Can managed with software or hardware CXL uses hardware coherence Define a point of “Global Observation” (aka GO) when new data is visible from writes Tracking granularity is a “cacheline” of data 64-bytes for CXL All addresses are assumed to be Host Physical Address (HPA) in CXL cache and memory protocols Translations using existing Address Translation Services (ATS). 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 9 Cache Coherence Protocol • Modern CPU caches and CXL are built on M,E,S,I protocol/states • Modified – Only in one cache, Can be read or written, Data NOT up-to-date in memory • Exclusive – Only in one cache, Can be read or written, Data IS up-to-date in memory • Shared – Can be in many caches, Can only be read, Data IS up-to-date in memory • Invalid – Not in cache • M,E,S,I is tracked for each cacheline address in each cache • Cacheline address in CXL is Addr[51:6] • Notes: • Each level of the CPU cache hierarchy follows MESI and layers above must be consistent • Other extended states and flows are possible but not covered in context of CXL 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 10 How are Peer Caches Managed? • All peer caches managed by the “Home Agent” within the cache level Hidden from CXL device • A “Snoop” is the term for the Home to check cache state and may cause cache state changes • CXL Snoops: • Snoop Invalidate (SnpInv): Cache to degrade to I-state, and must return any Modified data • Snoop Data (SnpData): Cache to degrade to S-state, and must return any Modified data. • Snoop Current (SnpCurr): Cache state does not change, but must return any Modified data 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 11 CXL Cache Protocol 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 12 Cache Protocol Summary • Simple set of 15 cache-able reads and writes from the device to host memory • Keep complexity of global coherence management in the host 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 13 Cache Protocol Channels 3 channels in each direction: D2H vs H2D Data and RSP channels are pre-allocated D2H Requests from the device H2D Requests are snoops from the host Ordering: H2D Req (Snoop) push H2D RSP 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 14 Read Flow • Diagram to show message flows in time • X-axis: Agents • Y-axis: Time 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 15 Read Flow • Diagram to show message flows in time • X-axis: Agents • Y-axis: Time 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 16 Mapping Flow Back to CPU Hierarchy ~10 GB – Directly Connected Memory CXL.mem CXL.mem (aka DDR) ~10 GB ~10 GB Home Agent Coherent CPU-to-CPU Symmetric Links CPU Socket 1 ~10 MB – L3 (aka LLC) Wr Wr CXL.Cache ~500 KB – L2 ~500 KB – L2 Cache Cache ... CXL.$ ~50KB ~50KB CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU CPU Socket 0 CXL~1 MB ~1 MB DeviceDevice Device 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 17 Mapping Flow Back to CPU Hierarchy • Peer Cache can be: ~10 GB – Directly Connected Memory CXL.mem CXL.mem • Peer CXL Device with (aka DDR) ~10 GB ~10 GB Cache Home Agent • CPU Cache in Local Socket Coherent CPU-to-CPU Symmetric Links Peer Cache • CPU Cache in Remote CPU Socket 1 Socket ~10 MB – L3 (aka LLC) Wr Wr CXL.Cache ~500 KB – L2 Peer~500 KBCache – L2 Cache Cache Peer Cache ... CXL.$ Peer Cache ~50KB ~50KB CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU CPU Socket 0 CXL~1 MB ~1 MBPeer DeviceDevice DeviceCache 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 18 Mapping Flow Back to CPU Hierarchy Memory Memory • Peer Cache can be: Controller Controller ~10 GB – DirectlyMemory Connected Memory CXL.mem CXL.mem • Peer CXL Device with Memory Controller(aka DDR) ~10 GB ~10 GB Cache Controller • CPU Cache in Local Socket HomeHome Agent Home • CPU Cache in Remote Socket Coherent CPU-to-CPU Symmetric Links Peer Cache • Memory Controller CPU Socket 1 can be: ~10 MB – L3 (aka LLC) Wr Wr CXL.Cache • Native DDR on Local ~500 KB – L2 Peer~500 KBCache – L2 Cache Cache Peer Cache ... CXL.$ Peer Cache Socket ~50KB ~50KB CXL.io CXL.io PCIe • Native DDR on Remote L1 L1 L1 L1 PCIe ~50 KB ~50 KB ~50 KB ~50 KB Socket ... • CXL.mem on peer Device CPU CPU CPU CPU CPU Socket 0 CXL~1 MB ~1 PeerMB DeviceDevice DeviceCache 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 19 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home I S are three phases: Memory • Ownership Controller • Silent Write • Cache Eviction Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 20 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Ownership Memory • Controller • Silent Write Ownership • Cache Eviction S I I E Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 21 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Memory • Ownership Controller • Silent Write Ownership • Cache Eviction S I I E Legend Write Cache State: <Silent Wr> E M Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 22 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Memory • Ownership Controller • Silent Write Ownership • Cache Eviction S I I E Legend Write Cache State: <Silent Wr> Modified E M Exclusive Eviction Shared M I Invalid Data Allocate Tracker Deallocate Tracker 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 23 Example #3: Steaming Write CXL Peer • Direct Write to Host Device Cache Home • Ownership + Write in a I S Memory single flow. Controller • Rely on completion to S I indicate ordering • May see reduced Data bandwidth for ordered Legend traffic Cache State: Modified • Host may install data into Exclusive Shared LLC instead of writing to Invalid memory Allocate Tracker Deallocate Tracker 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 24 15 Request in CXL • Reads: RdShared, RdCurr, RdOwn, RdAny • Read-0: RdownNoData, CLFlush, CacheFlushed • Writes: DirtyEvict, CleanEvict, CleanEvictNoData • Streaming Writes: ItoMWr, MemWr, WOWrInv, WrInv(F) 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 25 CXL Memory Protocol 2020 Storage Developer Conference. © CXL™ Consortium.

Load more