Compute Express Link™ (CXL™): Memory and Protocols

Robert Blankenship Corporation

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 1 Topics

What is CXL? Caching 101 CXL Caching Protocol CXL Memory Protocol Merging Memory and Caching for Accelerators

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 2 What is CXL? • CXL runs across the standard PCIe physical layer with new protocols x16 PCIe x16 CXL Card Card optimized for cache and memory

• CXL uses a flexible Port that X16 Connector

can auto-negotiate to either the PCIe channel SERDES standard PCIe transaction protocol or Connector the alternate CXL transaction protocols etc.

• First generation CXL aligns to 32 GT/s PCIe 5.0 specification Processor • CXL usages expected to be key driver for an aggressive timeline to PCIe 6.0 architecture

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 3 CXL Uses 3 Protocols

• IO (CXL.io) → Same as PCI Express®

• Memory (CXL.mem) → Memory Access from CPU Host to Device

• Cache (CXL.cache) → Coherent Access from Device to CPU Host

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 4 Representative CXL Usages

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 5 Caching 101

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 6 Caching Overview Caching temporarily brings closer to the consumer Host Memory Access Latency: ~200ns Shared Bandwidth: 100+ GB/s Improves latency and bandwidth using prefetching and/or locality Data • Prefetching: Loading Data into cache before it is required Read • Spatial Locality (locality is space): Access address X then Local Data Cache X+n Access Latency: ~10ns • Temporal Locality (locality in Time): Multiple access to the Dedicated Bandwidth: 100+ GB/s

same Data Data Read

Accelerator

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 7 CPU Cache/ with CXL

~10 GB – Directly Connected Memory CXL.mem CXL.mem • Modern CPUs have 2 or (aka DDR) ~10 GB ~10 GB more levels of coherent Home Agent cache Coherent CPU-to-CPU Symmetric Links • Lower levels (L1), smaller

CPU Socket 1 in capacity with lowest ~10 MB – L3 (aka LLC) latency and highest Wr Wr CXL.Cache ~500 KB – L2 ~500 KB – L2 Cache Cache ... CXL.$ ~50KB ~50KB bandwidth per source. CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe • Higher levels (L3), less ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU bandwidth per source but

CPU Socket 0 much higher capacity and support more sources ~1 MB ~1 MB • Device caches are expected Device Device to be up to 1MB. Note: Cache/Memory capacities are examples and not aligned to a specific product.

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 8 Cache Consistency How do we make sure updates in cache are visible to other agents? • Invalidate all peer caches prior to update • Can managed with software or hardware  CXL uses hardware coherence

Define a point of “Global Observation” (aka GO) when new data is visible from writes

Tracking granularity is a “cacheline” of data  64- for CXL

All addresses are assumed to be Host Physical Address (HPA) in CXL cache and memory protocols  Translations using existing Address Translation Services (ATS).

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 9 Protocol • Modern CPU caches and CXL are built on M,E,S,I protocol/states

• Modified – Only in one cache, Can be read or written, Data NOT up-to-date in memory • Exclusive – Only in one cache, Can be read or written, Data IS up-to-date in memory • Shared – Can be in many caches, Can only be read, Data IS up-to-date in memory • Invalid – Not in cache

• M,E,S,I is tracked for each cacheline address in each cache • Cacheline address in CXL is Addr[51:6]

• Notes: • Each level of the CPU follows MESI and layers above must be consistent • Other extended states and flows are possible but not covered in context of CXL

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 10 How are Peer Caches Managed? • All peer caches managed by the “Home Agent” within the cache level  Hidden from CXL device

• A “Snoop” is the term for the Home to check cache and may cause cache state changes

• CXL Snoops: • Snoop Invalidate (SnpInv): Cache to degrade to I-state, and must return any Modified data • Snoop Data (SnpData): Cache to degrade to S-state, and must return any Modified data. • Snoop Current (SnpCurr): Cache state does not change, but must return any Modified data

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 11 CXL Cache Protocol

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 12 Cache Protocol Summary • Simple set of 15 cache-able reads and writes from the device to host memory • Keep complexity of global coherence management in the host

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 13 Cache Protocol Channels 3 channels in each direction: D2H vs H2D Data and RSP channels are pre-allocated D2H Requests from the device H2D Requests are snoops from the host Ordering: H2D Req (Snoop) push H2D RSP

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 14 Read Flow • Diagram to show message flows in time • X-axis: Agents • Y-axis: Time

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 15 Read Flow • Diagram to show message flows in time • X-axis: Agents • Y-axis: Time

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 16 Mapping Flow Back to CPU Hierarchy

~10 GB – Directly Connected Memory CXL.mem CXL.mem (aka DDR) ~10 GB ~10 GB

Home Agent

Coherent CPU-to-CPU Symmetric Links

CPU Socket 1 ~10 MB – L3 (aka LLC)

Wr Wr CXL.Cache ~500 KB – L2 ~500 KB – L2 Cache Cache ... CXL.$ ~50KB ~50KB CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU

CPU Socket 0

CXL~1 MB ~1 MB DeviceDevice Device 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 17 Mapping Flow Back to CPU Hierarchy • Peer Cache can be: ~10 GB – Directly Connected Memory CXL.mem CXL.mem • Peer CXL Device with (aka DDR) ~10 GB ~10 GB Cache Home Agent • CPU Cache in Local

Socket Coherent CPU-to-CPU Symmetric Links Peer Cache • CPU Cache in Remote CPU Socket 1 Socket ~10 MB – L3 (aka LLC)

Wr Wr CXL.Cache ~500 KB – L2 Peer~500 KBCache – L2 Cache Cache Peer Cache ... CXL.$ Peer Cache ~50KB ~50KB CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU

CPU Socket 0

CXL~1 MB ~1 MBPeer DeviceDevice DeviceCache 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 18 Mapping Flow Back to CPU Hierarchy Memory Memory • Peer Cache can be: Controller Controller ~10 GB – DirectlyMemory Connected Memory CXL.mem CXL.mem • Peer CXL Device with (aka DDR) ~10 GB ~10 GB Cache Controller • CPU Cache in Local Socket HomeHome Agent Home • CPU Cache in Remote Socket Coherent CPU-to-CPU Symmetric Links Peer Cache

• Memory Controller CPU Socket 1 can be: ~10 MB – L3 (aka LLC) Wr Wr CXL.Cache • Native DDR on Local ~500 KB – L2 Peer~500 KBCache – L2 Cache Cache Peer Cache ... CXL.$ Peer Cache Socket ~50KB ~50KB CXL.io CXL.io PCIe • Native DDR on Remote L1 L1 L1 L1 PCIe ~50 KB ~50 KB ~50 KB ~50 KB Socket ... • CXL.mem on peer Device CPU CPU CPU CPU CPU Socket 0

CXL~1 MB ~1 PeerMB DeviceDevice DeviceCache 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 19 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home I S are three phases: Memory • Ownership Controller • Silent Write • Cache Eviction

Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 20 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Ownership Memory • Controller • Silent Write Ownership • Cache Eviction S  I

I  E

Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 21 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Memory • Ownership Controller • Silent Write Ownership • Cache Eviction S  I

I  E

Legend Write Cache State: E  M Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 22 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Memory • Ownership Controller • Silent Write Ownership • Cache Eviction S  I

I  E

Legend Write Cache State:  Modified E M

Exclusive Eviction Shared M  I Invalid Data Allocate Tracker Deallocate Tracker

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 23 Example #3: Steaming Write CXL Peer • Direct Write to Host Device Cache Home • Ownership + Write in a I S Memory single flow. Controller

• Rely on completion to S  I indicate ordering • May see reduced Data bandwidth for ordered Legend traffic Cache State: Modified • Host may install data into Exclusive Shared LLC instead of writing to Invalid memory Allocate Tracker Deallocate Tracker

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 24 15 Request in CXL • Reads: RdShared, RdCurr, RdOwn, RdAny • Read-0: RdownNoData, CLFlush, CacheFlushed • Writes: DirtyEvict, CleanEvict, CleanEvictNoData • Streaming Writes: ItoMWr, MemWr, WOWrInv, WrInv(F)

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 25 CXL Memory Protocol

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 26 Memory Protocol Summary Simple reads and writes from host to memory Memory Technology Independent • HBM, DDR, • Architected hooks to manage persistence Includes 2-bits of “meta-state” per cacheline • Memory Only device: Up to host to define usage. • For Accelerators: Host encodes required cache state.

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 27 Memory Protocol Channels 2 channels in each direction . M2S Request (Req), Request w/ Data (RwD) . S2M Non-Data Response (NDR), Data Response (DRS) which are pre- allocated. No Ordering, except special accelerator flow on Req channel

M2S Req Master M2S RwD Subordinate Home Memory Agent S2M NDR Controller (Host) S2M DRS (Device)

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 28 Example #1: Write CXL Memory Only Device

CXL Device Memory Host Memory ECC handled by device Controller Media

New MetaValue

Sent when data visible to future reads

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 29 Example #2: Read Memory Onl y De vice

CXL Device Memory Host Memory Meta Value Change Controller Media requires device to write. New MetaValue

Old MetaValue

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 30 Example #3: Read no Meta

Memory Onl y De vice

CXL Device Memory Host Memory Host may indicate no Controller Media Meta-state update required on reads No Meta Update Needed

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 31 Example #4: MemInv

Memory Onl y De vice

CXL Device Memory Host Memory Used to read/update Controller Media Meta-state without reading the data itself. New MetaValue

Old MetaValue

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 32 CXL Accelerators using Caching and Memory Protocols

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 33 Mixing Protocols for Accelerators Device memory is coherent, and host manages coherence.

Can device directly read its own “Device- Attached Memory” without violating coherence?

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 34 Bias Table Table holds 1-bit state indicating if host has a cached copy • Device Bias: No host caching, allowing direct reads • Host Bias: Host may have a cached copy, so read goes through the host

Host tracks which peer caches have copies

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 35 Other Differences Snoop indication added to Memory Requests  Host does not track device caching of its own “Device-Attached Memory”

Reads/Writes can return “Forward” indication from the host avoiding ping-pong of data

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 36 Host Bias Read CXL Peer Device MemRdFwd message Cache Home sent after coherence I S Internal resolved Memory Bias=Host Controller S  I

Legend Cache State: I  E Modified Exclusive Bias=Device Shared Invalid Allocate Tracker Deallocate Tracker

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 37 Device Bias Read CXL Peer Device No messages on CXL Cache Home interface I I Internal Memory Bias=Device Controller

I  E

Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 38 Device Cache Evictions CXL Peer Device E/M in cache imply Cache Home M I Bias=Device so no Internal indication to host Memory Bias=Device Controller

M  I

Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 39 Host Bias Streaming Write CXL Peer Device MemRdFwd message Cache Home sent after coherence I S Internal resolved Memory Bias=Host Controller S  I

Legend Cache State: Bias=Device Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 40 Device Bias Streaming Write CXL Peer Device No message to host Cache Home I I Internal Memory Bias=Device Controller

Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 41 Thank You

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 42 Join Today!

www.computeexpresslink.org/join

Follow Us on Social Media

@ComputeExLink www.linkedin.com/company/cxl-consortium/ CXL Consortium Channel

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 43 Audience Q&A

2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 44