A 24­port 10G Switch

(with asynchronous circuitry)

Andrew Lines

1 Agenda

Product Information Technical Details Photos

2 Tahoe: First FocalPoint Family Member

The lowest­latency feature­rich 10GE switch chip Tahoe • 10G Ethernet switch ­ 24 Ports • Line rate performance ­ 240Gb/s bandwidth SPI CPU JTAG LED ­ 360M frames/s ­ Full­speed multicast

Frame Processor • Fully­integrated single chip (Scheduler) ­ 1MB frame memory ­ 16K MAC addresses ® ® • Lowest latency Ethernet ) 4) s s ­ ­4 X X ­ 200ns with copper cables u u (C (C

x x I I ™ U U e e • Rich Feature Set

RapidArray A X XA N N (packet storage) ­ Extensive layer 2 features • Flexible SERDES interfaces ­ 10G XAUI (CX­4) ­ 1G SGMII

Asynchronous Blocks

3 Tahoe Hardware Architecture

Modular architecture, centralized control SPI CPU JTAG LED Interface Interface Interface Interface

Management

Frame Control LCI

Lookup Handler Stats

RX Port Logic Scheduler TX Port Logic P M M P Ser Ser C A A C Des Des S C C S

Switch Element Data Path     ®  ®  s   s ™ u u

x RapidArray   x e (1MB Shared Memory) e N   N RX Port Logic TX Port Logic P M M P Ser Ser C A A C Des Des S C C S

4 Tahoe Chip Plot

Fabricated in TSMC 0.13um Ethernet Port Logic ­ SerDes RapidArray Memory ­ PCS ­ 1MB shared ­ MAC

Nexus Crossbars ­ 1.5Tb/s total ­ 3ns latency

Scheduler ­ Highly optimized ­ High event rate MAC Table ­ 16K addresses

Management Frame Control ­ CPU interface ­ Frame handler ­ JTAG ­ Lookup ­ EEPROM interface ­ Statistics ­ LEDs

5 Bridge Features

Robust set of layer­2 features

• General Bridge Features • Security ­ 16K MAC entries ­ 802.1x; MAC Address Security ­ STP: multiple, rapid, standard • Monitoring ­ Learning and Ageing ­ Rich monitoring terms ­ Multicast GMRP and IGMPv3 • logical combination of terms • VLAN Tag (IEEE 802.1Q­2003) • Src Port, Dst Port, VLAN, ­ Add / Remove tags Traffic Type, Priority, Src ­ Per port association default MA, Dst MA, etc. ­ 4K­entry VLAN­ID table ­ Monitoring action ­ Per VLAN, per­port STP • Drop, Mirror, Redirect, Count, Change Priority • Scheduling, Pause, Congestion ­ 16 rules per frame ­ 16 traffic classes for WRED • Statistics ­ 4 queues per port scheduling ­ RFC 2819 compliant ­ WRR or strict priority ­ All counters are 64 bits ­ Pause support ­ 13 counter groups • RMON and SMON • Fulcrum extensions

6 and Fat Tree Support

True IEEE­compliant Link Symmetric hashing guarantees Aggregation used to group links a conversation resolves to the Link Aggregation between line and fabric switches same fabric switch chip features Ingress to Fabri Fabri Fabri fabric hop • Configuration c c c uses Link ∙∙∙ Aggregation ­ 12 trunk groups Chip Chip Chip hardware to ­ Any ports in a group load balance ­ Up to 12 members Intra­switch Link (ISL) • Hash: Ethernet CRC ­ Programmable Input ­ SA, DA, Type, VLAN­ Line Line Line Line Line ID, Priority, Source port Chi Chi Chi Chi Chi p p p p p ­ SA­DA hash symmetry ∙∙∙ forcing ­ Group renumbering ∙∙∙ ∙∙∙ • Other HW hooks ­ Slow protocol traps

MAC A MAC B

7 Two Versions Sampling in Q1 2006

Announced pricing at SC|05 First company to break through $20/port for 10GE

• FM2224 ­ 24 10GE Interfaces ­ 1433­ball BGA ­ 40mm ­ $450

• FM2112 ­ 8 10GE Interfaces and ­ 16 1­2.5GE Interfaces ­ 897­ball BGA ­ 32mm ­ $265

8 24­Port Reference Design (Now Shipping)

Evaluation Platform

CSL

13 14 15 16 17 18 19 20 21 22 23 24

1 2 3 4 5 6 7 8 9 10 11 12 ETH

9 Agenda

Product Information Technical Details Photos

10 Tahoe Hardware Features

• Multiple Frequency Requirements ­ 3.125GHz serial links (licensed from RAMBUS) ­ 312.5MHz 32­bit datapaths (sync and async) ­ 750MHz MAC Table, Scheduler, Main Memory, Statistics, cross­chip interconnect (async) ­ 360MHz Frame Processing (sync) ­ 66MHz Management (sync) • Mixed design styles ­ 3 synchronous blocks: synthesize, place, and route ­ Many custom async blocks (most of the transistors) ­ Licensed cores: SERDES, PLL, TTL pads, fusebox

11 Tahoe Chip Statistics

• TSMC 0.13um LVOD FSG 1.2V • 105M transistors • Over 3000 unique cells • 1.5MB total SRAM (all asynchronous) • 0.5­1.5W per port depending on activity (36W peak) • Flip­chip BGA package

12 Sync and Async together?

• Use existing 3rd party IP cores for synchronous I/O, such as high­speed SERDES from RAMBUS. • Use standard synchronous synthesis, place, and route flow to implement logically complex units with lower speed requirements. • Use async flow only where it has the biggest advantages – SRAMs, crossbars, chip­wide interconnect, FIFO's, and high­speed blocks. • Must partition the problem in Architecture. • Some day everything will be Async, but not yet!

13 Simple Sync­to­Async Conversion

• Synchronous Request / Grant FIFO protocol

S2A A2S

Synchronous Asynchronous Asynchronous Synchronous Datapath Datapath Datapath Datapath

Request Request A A Grant Grant

clock clock

Seamlessly Bridges Different Clock Domains

14 Digital Verification

• Often overlooked in Academia, but crucial in Industry! • There are nearly as many engineers in verification as there are in design. • Use industry­standard approach of a full­chip simulation with test­bench, test suite, regression engine. • Try to get full line and conjunct coverage. • Convert CSP/PRS into for chip­level simulation combined with synchronous blocks. • Also use simple closed­environment self­tests to check that different levels of async decomposition match, but this is not sufficient.

15 Design For Test

• Must be able to check for manufacturing defects in async blocks. • Introduce special “scan­buffers” which integrate a serial shift register into an async buffer. • Connect the scan­buffers into 16 serial scan­chains. • Can issue an inject, drain, or skip command to each scan­buffer on a scan­chain. • External clocked interface to standard testers. • Commercial fault­grading tool (ZOIX).

16 Async SRAM in FocalPoint

Use TSMC 6T state bit layout Multi­bank design connected with async crossbars and busses Supports up to 32 write ports and 32 read ports in parallel Bank runs at 600MHz, but interconnect sustains 750MHz

17 SRAM Test and Repair

• Scan­buffers integrated into most SRAM banks. • On­chip accelerated testing for largest SRAM. • Tester produces a defect map. • Burn fusebox to use spare addresses to repair bit or address­line errors. • In many SRAMs, can simply remove a block of bad “segments” of storage from the free memory pool. This can repair many more types of errors. • Yield looks quite good so far, as expected.

18 Agenda

Product Information Technical Details Photos

19 FocalPoint Test Platform

20 FocalPoint EP Board

21 FocalPoint EP Rack

22 Wishlist

• CSP vs CSP formal verification • CSP vs PRS formal verification • ATPG tools for async circuits • Static timing for async circuits • Async synthesis from CSP • 65nm advice

If you've working on any of these, talk to me!

23