Compute Express Link™ (CXL™): Memory and Cache Protocols
Robert Blankenship Intel Corporation
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 1 Topics
What is CXL? Caching 101 CXL Caching Protocol CXL Memory Protocol Merging Memory and Caching for Accelerators
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 2 What is CXL? • CXL runs across the standard PCIe physical layer with new protocols x16 PCIe x16 CXL Card Card optimized for cache and memory
• CXL uses a flexible processor Port that X16 Connector
can auto-negotiate to either the PCIe channel SERDES standard PCIe transaction protocol or Connector the alternate CXL transaction protocols etc.
• First generation CXL aligns to 32 GT/s PCIe 5.0 specification Processor • CXL usages expected to be key driver for an aggressive timeline to PCIe 6.0 architecture
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 3 CXL Uses 3 Protocols
• IO (CXL.io) → Same as PCI Express®
• Memory (CXL.mem) → Memory Access from CPU Host to Device
• Cache (CXL.cache) → Coherent Access from Device to CPU Host
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 4 Representative CXL Usages
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 5 Caching 101
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 6 Caching Overview Caching temporarily brings data closer to the consumer Host Memory Access Latency: ~200ns Shared Bandwidth: 100+ GB/s Improves latency and bandwidth using prefetching and/or locality Data • Prefetching: Loading Data into cache before it is required Read • Spatial Locality (locality is space): Access address X then Local Data Cache X+n Access Latency: ~10ns • Temporal Locality (locality in Time): Multiple access to the Dedicated Bandwidth: 100+ GB/s
same Data Data Read
Accelerator
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 7 CPU Cache/Memory Hierarchy with CXL
~10 GB – Directly Connected Memory CXL.mem CXL.mem • Modern CPUs have 2 or (aka DDR) ~10 GB ~10 GB more levels of coherent Home Agent cache Coherent CPU-to-CPU Symmetric Links • Lower levels (L1), smaller
CPU Socket 1 in capacity with lowest ~10 MB – L3 (aka LLC) latency and highest Wr Wr CXL.Cache ~500 KB – L2 ~500 KB – L2 Cache Cache ... CXL.$ ~50KB ~50KB bandwidth per source. CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe • Higher levels (L3), less ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU bandwidth per source but
CPU Socket 0 much higher capacity and support more sources ~1 MB ~1 MB • Device caches are expected Device Device to be up to 1MB. Note: Cache/Memory capacities are examples and not aligned to a specific product.
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 8 Cache Consistency How do we make sure updates in cache are visible to other agents? • Invalidate all peer caches prior to update • Can managed with software or hardware CXL uses hardware coherence
Define a point of “Global Observation” (aka GO) when new data is visible from writes
Tracking granularity is a “cacheline” of data 64-bytes for CXL
All addresses are assumed to be Host Physical Address (HPA) in CXL cache and memory protocols Translations using existing Address Translation Services (ATS).
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 9 Cache Coherence Protocol • Modern CPU caches and CXL are built on M,E,S,I protocol/states
• Modified – Only in one cache, Can be read or written, Data NOT up-to-date in memory • Exclusive – Only in one cache, Can be read or written, Data IS up-to-date in memory • Shared – Can be in many caches, Can only be read, Data IS up-to-date in memory • Invalid – Not in cache
• M,E,S,I is tracked for each cacheline address in each cache • Cacheline address in CXL is Addr[51:6]
• Notes: • Each level of the CPU cache hierarchy follows MESI and layers above must be consistent • Other extended states and flows are possible but not covered in context of CXL
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 10 How are Peer Caches Managed? • All peer caches managed by the “Home Agent” within the cache level Hidden from CXL device
• A “Snoop” is the term for the Home to check cache state and may cause cache state changes
• CXL Snoops: • Snoop Invalidate (SnpInv): Cache to degrade to I-state, and must return any Modified data • Snoop Data (SnpData): Cache to degrade to S-state, and must return any Modified data. • Snoop Current (SnpCurr): Cache state does not change, but must return any Modified data
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 11 CXL Cache Protocol
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 12 Cache Protocol Summary • Simple set of 15 cache-able reads and writes from the device to host memory • Keep complexity of global coherence management in the host
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 13 Cache Protocol Channels 3 channels in each direction: D2H vs H2D Data and RSP channels are pre-allocated D2H Requests from the device H2D Requests are snoops from the host Ordering: H2D Req (Snoop) push H2D RSP
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 14 Read Flow • Diagram to show message flows in time • X-axis: Agents • Y-axis: Time
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 15 Read Flow • Diagram to show message flows in time • X-axis: Agents • Y-axis: Time
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 16 Mapping Flow Back to CPU Hierarchy
~10 GB – Directly Connected Memory CXL.mem CXL.mem (aka DDR) ~10 GB ~10 GB
Home Agent
Coherent CPU-to-CPU Symmetric Links
CPU Socket 1 ~10 MB – L3 (aka LLC)
Wr Wr CXL.Cache ~500 KB – L2 ~500 KB – L2 Cache Cache ... CXL.$ ~50KB ~50KB CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU
CPU Socket 0
CXL~1 MB ~1 MB DeviceDevice Device 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 17 Mapping Flow Back to CPU Hierarchy • Peer Cache can be: ~10 GB – Directly Connected Memory CXL.mem CXL.mem • Peer CXL Device with (aka DDR) ~10 GB ~10 GB Cache Home Agent • CPU Cache in Local
Socket Coherent CPU-to-CPU Symmetric Links Peer Cache • CPU Cache in Remote CPU Socket 1 Socket ~10 MB – L3 (aka LLC)
Wr Wr CXL.Cache ~500 KB – L2 Peer~500 KBCache – L2 Cache Cache Peer Cache ... CXL.$ Peer Cache ~50KB ~50KB CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU
CPU Socket 0
CXL~1 MB ~1 MBPeer DeviceDevice DeviceCache 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 18 Mapping Flow Back to CPU Hierarchy Memory Memory • Peer Cache can be: Controller Controller ~10 GB – DirectlyMemory Connected Memory CXL.mem CXL.mem • Peer CXL Device with Memory Controller(aka DDR) ~10 GB ~10 GB Cache Controller • CPU Cache in Local Socket HomeHome Agent Home • CPU Cache in Remote Socket Coherent CPU-to-CPU Symmetric Links Peer Cache
• Memory Controller CPU Socket 1 can be: ~10 MB – L3 (aka LLC) Wr Wr CXL.Cache • Native DDR on Local ~500 KB – L2 Peer~500 KBCache – L2 Cache Cache Peer Cache ... CXL.$ Peer Cache Socket ~50KB ~50KB CXL.io CXL.io PCIe • Native DDR on Remote L1 L1 L1 L1 PCIe ~50 KB ~50 KB ~50 KB ~50 KB Socket ... • CXL.mem on peer Device CPU CPU CPU CPU CPU Socket 0
CXL~1 MB ~1 PeerMB DeviceDevice DeviceCache 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 19 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home I S are three phases: Memory • Ownership Controller • Silent Write • Cache Eviction
Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 20 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Ownership Memory • Controller • Silent Write Ownership • Cache Eviction S I
I E
Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 21 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Memory • Ownership Controller • Silent Write Ownership • Cache Eviction S I
I E
Legend Write Cache State:
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 22 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Memory • Ownership Controller • Silent Write Ownership • Cache Eviction S I
I E
Legend Write Cache State:
Exclusive Eviction Shared M I Invalid Data Allocate Tracker Deallocate Tracker
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 23 Example #3: Steaming Write CXL Peer • Direct Write to Host Device Cache Home • Ownership + Write in a I S Memory single flow. Controller
• Rely on completion to S I indicate ordering • May see reduced Data bandwidth for ordered Legend traffic Cache State: Modified • Host may install data into Exclusive Shared LLC instead of writing to Invalid memory Allocate Tracker Deallocate Tracker
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 24 15 Request in CXL • Reads: RdShared, RdCurr, RdOwn, RdAny • Read-0: RdownNoData, CLFlush, CacheFlushed • Writes: DirtyEvict, CleanEvict, CleanEvictNoData • Streaming Writes: ItoMWr, MemWr, WOWrInv, WrInv(F)
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 25 CXL Memory Protocol
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 26 Memory Protocol Summary Simple reads and writes from host to memory Memory Technology Independent • HBM, DDR, • Architected hooks to manage persistence Includes 2-bits of “meta-state” per cacheline • Memory Only device: Up to host to define usage. • For Accelerators: Host encodes required cache state.
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 27 Memory Protocol Channels 2 channels in each direction . M2S Request (Req), Request w/ Data (RwD) . S2M Non-Data Response (NDR), Data Response (DRS) which are pre- allocated. No Ordering, except special accelerator flow on Req channel
M2S Req Master M2S RwD Subordinate Home Memory Agent S2M NDR Controller (Host) S2M DRS (Device)
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 28 Example #1: Write CXL Memory Only Device
CXL Device Memory Host Memory ECC handled by device Controller Media
New MetaValue
Sent when data visible to future reads
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 29 Example #2: Read Memory Onl y De vice
CXL Device Memory Host Memory Meta Value Change Controller Media requires device to write. New MetaValue
Old MetaValue
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 30 Example #3: Read no Meta
Memory Onl y De vice
CXL Device Memory Host Memory Host may indicate no Controller Media Meta-state update required on reads No Meta Update Needed
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 31 Example #4: MemInv
Memory Onl y De vice
CXL Device Memory Host Memory Used to read/update Controller Media Meta-state without reading the data itself. New MetaValue
Old MetaValue
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 32 CXL Accelerators using Caching and Memory Protocols
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 33 Mixing Protocols for Accelerators Device memory is coherent, and host manages coherence.
Can device directly read its own “Device- Attached Memory” without violating coherence?
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 34 Bias Table Table holds 1-bit state indicating if host has a cached copy • Device Bias: No host caching, allowing direct reads • Host Bias: Host may have a cached copy, so read goes through the host
Host tracks which peer caches have copies
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 35 Other Differences Snoop indication added to Memory Requests Host does not track device caching of its own “Device-Attached Memory”
Reads/Writes can return “Forward” indication from the host avoiding ping-pong of data
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 36 Host Bias Read CXL Peer Device MemRdFwd message Cache Home sent after coherence I S Internal resolved Memory Bias=Host Controller S I
Legend Cache State: I E Modified Exclusive Bias=Device Shared Invalid Allocate Tracker Deallocate Tracker
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 37 Device Bias Read CXL Peer Device No messages on CXL Cache Home interface I I Internal Memory Bias=Device Controller
I E
Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 38 Device Cache Evictions CXL Peer Device E/M in cache imply Cache Home M I Bias=Device so no Internal indication to host Memory Bias=Device Controller
M I
Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 39 Host Bias Streaming Write CXL Peer Device MemRdFwd message Cache Home sent after coherence I S Internal resolved Memory Bias=Host Controller S I
Legend Cache State: Bias=Device Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 40 Device Bias Streaming Write CXL Peer Device No message to host Cache Home I I Internal Memory Bias=Device Controller
Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 41 Thank You
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 42 Join Today!
www.computeexpresslink.org/join
Follow Us on Social Media
@ComputeExLink www.linkedin.com/company/cxl-consortium/ CXL Consortium Channel
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 43 Audience Q&A
2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 44