Gfix 0 EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH Qi ;¤ 6, O CERN/DRDC 93-20 Q//0 rn, RD24 status Report S CP 5 May 1993 Cat; N 99.3; Q. ·e2.s;> RD24 Status Report Application of the Scalable Coherent Interface to Data Acquisition at LHC

A. Bogaerts‘, J. Buytaert, R. Divia, I-I. Mtlllerl, C. Parkman, P. Ponting, D. Samyn CERN, Geneva, Switzerland

B. Skaali, G. Midttun, D. Wormald, J. Wikne University of Oslo, Physics Department, Norway

F. Cesaroni, S. Falciano, G. Medici INFN Sezione di Roma and University of Rome, La Sapienza, Italy

P. Creti, M. Panareo INFN Sczionc di Leccc, Ita1y CERN L1ERAR1Es, GENEVA

We-A» A»E¤m~v IHEP, Protvino, Russia IllllllllllIlIlllllllIllllllllllllllllllllllfIIIJ SCOOO00O82 K. Lachsen, E. Rongved, B. Solberg Dolphin SCI Technology A.S., Oslo, Norway

A. Mistry, A. Guglielmi, A. Pastore Digital Equipment Corporation (DEC), Joint Project at CERN

F-H. Worm, J. Bovier Creative Electronic Systems (CES), Geneva, Switzerland

S. Robison, D. North, G. Stone Apple Computer, Inc. Cupertino USA

Abstract

The RD24 project has designed SCI node interface hardware and software to test the applicability of SCI to Data Acquisition Systems. The results demonstrate working SCI protocols, very high bandwidth and scalability. First samples of the SCI NodeChip“* ASIC were delivered to CERN in March 1993. Interconnected via SCI cables, two nodes were used to form a simple SCI ringlet. First SCI traflic, at 500 MB/s link speed, was observed on 16.4.93 at CERN. Using a diagnostic hardware and software set for SCI, we have verified the correctness of SCI packets and protocols. RD24 has designed and completed: a RISC processor node with F“1Fo interface and a combined memoryl DMA node. Work continues on an SCI-+ interface, a TURBOchannel interface, an Apple Quadra interface, an SCI to Fastbus interface and a bridge to transmit SCI over coaxial cables or optical fibers. SCI-based architectures have been simulated. The results predict feasibility of a scalable SCI event builder, based on SCI switches with simulated throughput up to 40 GBytes/s.

I. joint spokesmen OCR Output

OCR OutputOCR OutputOCR OutputOCR OutputRD24 Status Report Goals and Motivation

1.1 Scope of SCI for Data Acquisition A uniform SCI network consisting of up to 64 K nodes between all the components of a DAQ system, such as front-end memories (after lst level trigger and data compression) and processors farms, can be achieved provided that all required components and interfaces become available. In its first phase, RD24 started the design and test of single-ringlet components in close collaboration with industry. In its second phase, RD24 plans to extend to multiple ringlets and ultimately, to large SCI systems. The scaling properties of SCI, a subject under study via architecture simulation within RD24, are particularly important to allow building large SCI systems from basic SCI constituents, such as nodes, ringlets and bridges. Several physical SCI implementations are possible. The 1 GByte/s version requires a GaAs NodeChip'*'“ [2]transmitting packets over 16 ECL signal pairs at 500 MHz. A functionally equivalent, low cost CMOS chip will become available ir1 4/93. Consuming less power, it is targeted at initially 100 -150 MByte/s. These chips roughly match the 1.4 Gi gabitl s speed of already existing GiGa chips [9] for serial encoding of 16/17bit data over coaxial cables or optical iibers. SCI may be used as an I/O system, and simultaneously implement a distributed shared a memory over a 64 bit wide, physically distributed, address space. Both DAQ and control applications can use SCI. The I/O approach optimizes speed for block transfers as required by data driven systems. The shared—memory approach optimizes access to scattered data and simplifies software. Use of caches, a novelty for DAQ systems, improves access from trigger processors or event displays to SCI buffers containing the event. These buffers may reside anywhere in an SCI system and therefore make data-copying redundant. The optional cache coherency improves those data accesses, preserves consistency of cached data, and provides synchronization for data driven systems. A system—wide real time clock provides tirneslot identification of events. Broadcasting can be used for initialisation. SCI is seen by RD24 as a uniform system for DAQ. It can uniformly cover all key areas between front-end memories and event logging computers (figure 4). 1.2 Bandwidth and Scalability With a raw transmission speed of lGbyte/s, the theoretically highest data rate is obtained using the dmove64 transaction, a pipelined, non coherent 64 byte write operation. The

FLAG 16xDATA§ IS SCI _ _ _ S‘$“““

$16 bit DATA * 2ns = 1 GB/s

DATA

0, 16, 64, 256 bytes ADDRESS (64 bits) + COMMAND Figure 2: Raw packet transmission overhead reduces the net bandwidth to approx. 60%. We have not been able to measure this, but this figure has been confirmed by other point-to-point based systems and our simulations. The theoretically maximum usable bandwidth per node is therefore 600 MByte/s per node. OCR Output RD24 Status Report Goals and Motivation

We have successfully operated the first SCI chips at 1/2 of the nominal 1 GByte/s link bandwidth. Best samples have already been operated at 700 MByte/s. RD24 has used 1/2 speed in order to stay in safe operation areas and also to ease interface requirements. One of our interface designs, a DMA node between a dual port memory and SCI, has achieved data transfers in excess of 100 MByte/s. In practise, with further improvements in this design (which depends on memory speed and the hardware state machines) we hope to achieve 250 MByte/s later this year. The aggregate bandwidth of an SCI ringlet, assuming random traffic between nodes, corresponds to roughly twice the nominal bandwidth of the link. Using the DMA node as data producer in a data-driven front end SCI ringlet, the merging of many data sources is effectively possible as replacement of -based data merging stages. Using the above argument, a 100 MByte/s CMOS node ringlet can merge data sources up to approx. 200 MByte/s. Tens of these

$3;/M=¤¤~SSSSSS SSSSSS SSSSSS N$¢¤¢;¤0E s 200 Mbyte/s gunim _ CMOS ringlets MOS to SCI-FI bridge

dlSt3l1C¢ Optical | n n n n scr erugrete 2°·1“°°m

2 GByte/s SCI-FI to GaAs ringlet GMS *”°'dg° to event nuirerer Figure 3: Data merging using SCI can be connected via optical SCI-fiber over hundreds of meters distance into a merger ringlet of 2 GByte/s total bandwidth using the GaAs node chips. This approach requires bridges from CMOS nodes to SCI-fiber and from SCI-fiber to GaAs. Both bridges could be built from a combination of NodeChips and the existing GiGachip from Hewlett Packard [9]. An application of SCI for larger experiments would necessarily require a system -—-· consisting of multiple ringlets. The throughput scaling from single to multiple ringlets is the main topic for the second phase of RD2A·. These architectures will require SCI-SCI bridges to interconnect ringlets, and ultimately fast SCI switch components as building blocks for the GByte/s "superswitches" needed for LHC [4,5]. RD24 has started work on simulation of SCI bridges and switches in collaboration with industry projects, which aim to implement such components in silicon. 1.3 A preview of a uniform SCI system High rate data acquisition systems can be composed of different bus standards and links [18]. RD24 is convinced that a uniform SCI system has a particular advantage for simplicity, performance, maintenance and cost, if SCI covers a large area between farm processors and data buffers after the l“ level trigger. A detailed view of such a system is shown in figure 4. Data access and merging over low cost CMOS ringlets and optical ringlets to a crossover network should be feasible and applicable in 1993/94. DAQ systems with transfer rates in the order of 100 MByte/s could use a SCI crossover network self routing as event builder. To build such a network, one requires SCI-to·SCI bridges for which a commercial solution is expected by the end of 1993. Very high rate DAQ systems however, such as LHC experiments, must OCR Output Status Re s and Motivation

Detector 1 Detector 2 Detector N

Pipelm ••••

Trigger l

Compressed sze ».» N d“t“°°"°°ti“g Dam ceneeun ..R¤¤g*¢*;*·: R. C as ;_..:_;_,.:si;::i_:.:s. . _:_:Z:2:_:_._.e; male a merge a mr optical 1q14V'W \ 1.“.. SCI rinzlew Esrgs

2 possible Event builders? Data Merger

SCI SUPERSWITCH OR (opt 2)

Global access SCI ringlet

SCI CROSSOVER NETWORK (opt 1) CPU FA to event builder

:c‘ Intc ,._. _.._ . _._.,._,_._._._._._. ., R N etw ork S

cgr? , Event Display bm M°nit0l'i¤S

Figure 4: A uniform SCI DAQ system make use of a superswitch with scalable performance up to tens of GByte/s. Such an event builder practically can only be based on VLSI switch components. Availability of such switches (RD24 is active in their speciiication) is not expected before 1995. Multi—CPU farms need to be interfaced to SCI via bus bridges. The currently tested interfaces are a first approach, allowing for many enhancements. Workstation and Personal Computer interfaces are already being developed now. SCI nodes with integrated event memories and caches require chip support which is not expected before 1994. Data logging of 100 MByte/s can be solved in SCI via a global access ringlet for all event memories and caches (See "Cache Coherency" on page 25.) to transfer data streams in order of 100 MByte/s to mass storage and remote networks. OCR Output RD24 Status Report Individual progress and activity reports

Chapter 2. Individual progress and activity reports Compared to our plans reviewed in proposal [1] many activities of RD24 have been delayed by late availability of the NodeChipm. Also activities around the Futurebus+ to SCI bridge report large delays due to unavailability of hardware. The interface to the RD12 fast dual port memory had to be postponed due to manpower problems and incompatibility with SCI like interface buses. 2.1 Shared-memory interface to a R3000 RISC processor (CERN) This node design [6] is based on firmware to control packet handling and still offers good I/O performance (up to 20 Mbyte/s) due to firmware optimization, processor speed and use of its cache.The interface is very versatile and other "bus X-to-SCI" interfaces can easily be derived from it, by providing a "bus X-to-R3000" interface and adapting the specific parts of the firmware which handle this communication channel. In this way the firmware is transparent to the processor on bus X, making it possible the that processor has shared memory access to SCI’s address space. Such derivatives are VME-to-SCI and 'I\1rbochannel·to-SCI interfaces. 2.1.1 RIO-SCI VMEbus module (CERN and CES) This is a commercial derivative of CERN’s RISC node. The RIO is a popular R3000· based VMEbus module for I-IEP and avionics applications. The combination of the RIO with the CERN firmware node interface allows existing RIO systems to be upgraded to use SCI via add-on cards. The RIO in this respect is a flexible SCI starter module with competitively good performance.

2.1.2 TURBOchannel -SCI interface (DEC joint project and CERN)

2.1.3 Project Description The TURBOchannel/SCI (TC/SCI) interface has been originally designed to implement a simple bus bridge between the TURBOchanne1 bus (used by DEC in most of the MIPS based and Alpha AXP based workstations) and SCI. The actual implementation of the TC/SCI is composed of 2 special boards which connect the TURBOchannel bus to a specialized I/O card hosted by the RIO card. The RIO I/O card (RIO/SCI) is used to act as an SCI node and allows the execution of TURBOcharmel to SCI ·— transactions. The TC/SCI interface can perform (using the TURBOcharmel terminology) both I/O (operations started on the system hosting the TURBOchannel bus) and DMA operations (when the operation is started by the SCI ring) thus implementing a bidirectional bridge. The initially selected enviromnent, a VAXstation 4000 running VMS has been subsequently replaced by a DECstation 5000/200 with Ultrix (the DEC Unix implementation). For this enviromnent, we have planned to develop 3 software libraries: firmware residing on the TC/SCI interface. This software, mandatory to comply to the TURBOcharmel speciications, is intended to perform initialization of the hardware and some basic testing; a specialized device driver required by the Ultrix operating system, used to manipulate the TC/SCI board; A user level library, intended to significantly simplify the writing of user applications by providing routines which make possible the mapping of TURBOchannel addressing spaces to SCI addressing spaces and vice versa. This user library communicates with the Ultrix device driver which in turn communicates with the OCR Output RD24 Status Report Individual progress and activity reports

TC/SCIboard to create the appropriate mapping tables. A typical application developed for the TC/SCI board will initially call the library routine to create some mapping between the TURBOchannel and the SCI addressing spaces. The library routine will return an address in the application addressing space which can be used to directly write and/or read memory locations in the SCI addressing space. This can be done as simply as declaring a "variable" (i.e. a pointer, using the C syntax) which is, thanks to the library routines, directly mapped in the SCI space. The application can also call the library routine to perform the inverse mapping, that is the SCI to TURBOchannel. Any transaction (i.e.: memory location write) performed by any SCI node in the SCI addressing space mapped in the TURBOchannel will be performed by writing directly in the application memory space (thus implementing a shared memory system). However, few precautions has to be taken to avoid that optimizing compilers will prevent the application program to correctly read the modified data. The announcement of the new DEC Alpha AXP based workstation has introduced an additional target to the project. In fact our current plan is to move the TC/SCI board to AXP based systems rumiing OSF/1 (the DEC Unix for the AXP platform). On the software side we ,__ are pretty confident that the similarities between Ultrix and OSF will make the software porting relatively easy. 2.1.4 Project Status Unforttmately, due the very late delivery of the SCI node chip, we were unable to complete the design of the hardware interface ir1 time for this status report. On the software side, however, we have developed a prototype of the device driver and we have written all the routines which are part of the user level library. Some testing has been also performed thanks to the fact that the software was based on a previous development made at the DJP for the TURBOchannel/VME interface bridge. 2.2 I/O like DMA interface to 68040 CISC processor (CERN, Protvino and Dolphin) A 68040 ·to -SCI node design was initially started as application exercise at CERN, with a scientific associate sponsored by the Norwegian Research Council and Dolphin. A VMEbus Radstone Technologyl 68040 single board computer was selected. This work resulted in a complete Verilog description [24] of this interface which was never built but served as a A. valuable leaming exercise for RD24. A new approach to a 68040 interface was started at CERN with an engineer from H-IEP Protvino in Russia. We chose a fast dual-port memory approach, aiming to achieve data rates in excess of 100 Mbyte/s from dual ported memories in DAQ front end buffers, as required for data driven architectures. Such interface is under control of a MC68040 bus. This is not a disadvantage as Motorola aim their 68K CISC line to embedded control: a final application in DAQ could be a data driven front·end memory using a Motorola 32 bit micro·controller. A replacement of the 68040 interface by another 32 bit processor bus can be envisaged if necessary. 2.2.1 Memory node (CERN and IHEP Protvino) This processor-less interface combines a NodeChip“* with a dual ported application memory. This design has been combined with the physical implementation of the CERN DMA node and allows SCI to read data from a memory, however has no provision for caching. A

1. Radstone Technology plc, Towcester, Northants, UK OCR Output RD24 Status Report Individual progress and activity reports cache controller chip is expected to become available in 1994. 2.2.2 Design report “Dual ported DMA SCI node" (CERN and IHEP Protvino) A dual port interface [25] to the 68040 bus (FIC from CES and potentially DMA enhancement for Apple Quadra Interface) was designed to allow generation of fast data block movement in excess of 100 MByte/s under control of a 68040 processor.

SCIDMA

CBUS Interface

MC6804O SCI DMA SYSTEM NODE cmp

Start block DMA § 3 2 1 0 al*r*1A¤¤R ¤·=¤°d¤ DPM address bus hi; address I I 3

us low address 16 Kb DPM F/S | Byte count

Figure 5: DMA memory node interface

The DMA uses a sequence of fast dmove64 packet protocols with wsb (write selected byte) synchronization protocols at start and end of 64 byte data boundaries. The 16 KByte deep dual port memory can also be addressed via SCI as a simple memory node, mapped into the 68040 address space. The CADENCE/Verilog design used the Logic Automation behavioral model of the 68040 and the Dolphin Cbus/Verilog model of the SCI node chip. 2.2.3 FIC VMEbus processor DMA/DPM node interface (CERN and CES) This is a commercial derivative of CERN’s combined memory/DMA node for the dual 68040 processor FIC 8234 [19] module from CES. This is a VMEbus twin 68040 processor, typically running user applications under the OS/9m or Lynxm operating systems. The addition of a transparent interface part to the DMA part is desirable to allow for a more general use of this module as SCI-embedded trigger processor, or host processor for SCI-attached systems. 2.3 Apple Macintosh Quadra interface (ATG group Apple and CERN) The Advanced Teclmology Group of Apple Computer, Inc. is developing a research prototype SCI interface for the Apple Macintosh Quadra series of personal computers. This interface will reside in the intemal processor direct slot (PDS) of 68040-based systems and provide a transparent bus-bridge capability, mapping selected 68040 bus transactions to and from the appropriate SCI transactions as implemented by the first generation SCI node controller chip developed by Dolphin SCI Technology AS [2, 3]. The initial design will support direct l~l6 byte 68040-initiated transfers to other SCI node OCR Output RD24 Status Report Individual progress and activity reports controllers. An outgoing address map table provides the capability of accessing 256 separate lMbyte segments within the 64 bit SCI address space. Incoming SCI transactions are stripped to the low 32 bit address field, and the appropriate 1~16 byte data transfer performed on the 68040 processor bus, returning data as required to the initiator. A simple upgrade of the interface control chip will add additional support for 64 byte transfers and implement a subset of the SCI compare & swap transaction to provide atomic transaction capability between systems. Support for the CERN-developed 68040 DMA interface model will also be integrated. The SCI interface itself comprises a standard-sized Quadra PDS ‘04·0 module (4"H x l1"L) with an attached daughter board. Early versions will connect to the Dolphin GaAs-based SCI evaluation VME module via flat cable. Later this year integration of the CMOS-based SCI controller chip will be a simple process of replacing the daughtercard/cable assembly with a self contained CMOS SCI daughtercard module. Currently the initial design is undergoing simulation and verification in both Verilog and Synopsys environments. Several test place and route passes to the target Crosspoint FPGA have been completed; our performance goals (33 MHz 68040) and size restriction (4K gates) look achievable. We expect to integrate our interface module to the Dolphin SCI VME evaluation board when it arrives at the end of May, and spend June in debug and test (see figure 6). 2.4 Design of a two node SCI ringlet demonstrator (CERN and Dolphin) A major milestone has been achieved. A two- node SCI ringlet was successfully operated with CERN designed hardware and software to achieve node-to node communication over a 4 meters, 500 Mbyte/s ringlet. First samples‘ of GaAs NodeChips“*, designed by Dolphin SCI Technology and produced by Vitesse, were connected to CERN’s R3000 (RIO) interface and Dolphin’s VMEbridge prototype. Operating via software on both nodes, we have successfully tested the execution of non cached SCI protocols. We now possess the basic building blocks to build larger SCI systems and to investigate SCI's multi ringlet scalability. A 350 K GaAs ASIC from Vitesse was chosen by Dolphin to tit SCI packet protocols, buffers and cache coherence support into a single chip with direct connection to 16 bit SCI input and output links. The cables consisted of twin-coaxial flat cables from Gore‘. RD24 used first samples of node chips which were tested at half of their nominal speed (250 MI-Iz instead of 500 MHz). For SCI link diagnostics we use a specially designed LinkProbe'*'¥ from Dolphin in combination with a Tektronix DAS9200 to monitor SCI packets and protocols at SCI speed. The first packets between Dolphin’s VME bridge board and the CERN R3000 node were observed on 16.4.93. 2.4.1 Design environment CERN’s CADENCEI Verilog design environment was used for both the R3000 and the 68040 nude interfaces. For logical design, a sequence of schematic capture and full simulation including processor bus cycles and nodechip Cbus behavior was used. This included the use of Dolphin’s Verilog model of the node chip's 64 bit application bus, the Cbus [16]. 2.4.2 Firmware-driven R3000 SCI node A FiFo based interface between an R3000 RISC processor bus (RIO from CES [19]) and the SCI NodeChip"¥ was designed and successfully used to generate SCI packets. The RISC firmware allows dispatching of data between SCI and memory or other I/O channels, i.e VMEbus and TURBOchannel. The firmware handles SCI packet protocol assembly and disassembly in a tight loop to allow transparent use of SCI via software on a host (VME or DECstation).

l. Serial N' 8 and 10 2. WL Gore & Associates (UK) Ltd. OCR Output ' ' Status Re progress md activity re

68 4 0 0 BUS CB Us

SCI csus2:i?;j’f`__]1=11=o ° I ng ° `lii ‘», ·;j@??Q :¤=; 2 ·=·=¥ r5???aé;1;-;;;zZ€§§? ’ M “ IIr ""8

.,__, _; .,V , ., ”’“’l°"‘¥’Fra»>» CBUS Cmomuxin $`E?EZE¢§E°;EZt:E`¥i` `4‘ Q ZL 1.4, 3L1°}» 34 E Crollolfg Il bils16 bits Z€n£¥ Addtba 2;-»I¤;.. gun ;,, i ¤... 250/d *%};; E 4 ¤S ¤y¤i¤

32 bits 32 bits 30 ns cycle 100 ns cvcle 133 MB/s · _ . . 68040 BCLK domain Q CBLIS CCLK dcmum : SCI CLK domain 40 Mg/S

Macintosh 68040 PDS to CBUS SCI Interface 68040 Physical Address

68040 -> SCI Address Map Table

gk 20

256 NodeId(l5:0) Offset(15:0) Page(11:0)

U 12 16 is

I Physical Address 68040 Outgoing Address Translation

Figure 6: Apple’s SCI-Quadra Interface OCR Output RD24 Status Report Individual progress and activity reports

YIIIIL Vllllh

ughter board Node one logic

mac/vax suv M •r·m·¤·n¤··; .;,;_- o ¤ Comiimniation mailbox

I/ D Caches VMEbus master] »Pi_ CPU (R3000) (FW and DW!) slave logic

RIO mother board VMEbuS

Figure 7: R3000 Firmware SCI node 2.5 First SCI software For SCI related software, we followed three objectives: diagnostic utilities and basic tools; fast data movers and testers; SCI servers. 2.5.1 Diagnostic utilities and basic tools Our test setup is composed of several modules, all have been integrated and validated. A small cluster of Sun stations is used as tile server, to cross—compi1e software and to connect with the remote microprocessors. On VMEbus we kept our enviromnent as simple as possible; download, test and login

<-—-—-——

Sun swim ric 8234 I I R10 scr SCYVME I Station har brldg!

imerface Q I vosma

SCI test ringlet M, ,,,,,,,,,

Figure 8: RD24 test ringlet all the CPUs are running single task monitors with some capabilities for debugging and diagnostics. For the PT-SBS915 (S-Bus to VME interface [21]) we have developed a general·purpose library to open windows in a SunOS process' virtual address space, windows transparently mapped onto VME. These windows are normally non-sharable and have as parameters the

I0 OCR Output RD24 Status Report Individual progress and activity reports

VME target base address, the VME protocol and other parameters. The library recognizes modules by reference (e.g. "RIO" and "vmeRam") and maps them at run-time in the user’s space at the proper VME address. A system-wide translation table is under control of the station’s administrator to define the VME hardware configuration. A small test package has been written to validate this library via simple read/write test cycles to any VME RAM area. In parallel, a RAM tester has been developed for Sun stations. It controls and verifies

Waldo; to eooooot. CIIIIO¢(I‘I

••·• SCI tutor confkuadon ····••

Vorslon: 1 eompilol who GMI C vonho 2 for SPARC oo o SUN sttlco (N400 uodnlrod endo an Apr II 1l!3, marzo: tosuu, zllon sllo, gonode uno! Vo are In CONNECTED auto (Coooochr uio) Test control: mu loops 1, duo sho IIDIIOU, Iota norton Ssioasloo. It sin O inn ood: Suuuouuot . on orror ohortho tos!. tmuhr undo: nroornnnod I/0 Syneh Hog ls on lilo Su Noodov (canon}: [0] UONIDIIO (1] IIINDUIOI [1] Illlllllllll [ II! 55n§5¤ no ¤•m••» an •¤•m• r¤1••••¤••s rn ¤¤m•..

$•nd|n| rank! umm:. jon cullluloni ako! I I•••(s). lululll by\•(s) Ballot S.¤l1 IGI s (19!S$¤.|Il15$I bytos/I)

EE ·*¤<>•·* ’*'°‘°•”'°|;;|lE_·•§ {Zim § {IEEE

Stn: COIWECIEI

luffor sho (bytes) Loco room Tronsfor sito

·¤¤•·•• EZHZE [IEEE] am potion ua sae-man 0- •¤•r- """"’ "“'

ssnssn *°•*¤*••¤¤¤ I m Doin sood On) u“·" "¤" J

*•¤•···•¢· v·¤···

Figure 9: RAM tester client control panel accesses to a generic bank of RAM (intemal Sun station RAM, VME RAM and SCI RAM). The goal of this package is to probe all RAM accesses via all types of protocols, data pattems and transfer sizes, with the help of a simple validation process (optionally rurming on a target machine). The control interface has been developed using the Motif toolkit for X11. The result was a set of menus, push buttons and status windows duplicated over the test client and the test server (figure 9). The first VME-based module target of our developments has been the RIO. The initial task was to access remotely the resources of the RIO from the Sun’s console. After a first attempt via the RS—232 connection, we worked on the VME link offered by the PT-SBS9l5 and the RIO on—board monitor from IDT [29]. With the help of system programmers from C.E.S. we have jointly implemented a link to the RIO via the VME backplane using a generic memory mapped access to the RIO monitor’s resources. This link allows remote reset, download of S Records (output of the cross "C" compiler) and remote login from a Sun station process. As it needs only memory-mapped access to VME, the same client package has been ported with very little effort on a FIC 8232 running OS9 and a test suite on LynxOs is planned for the near future. 2.5.2 Fast data movers and testers When the development of the FIFOs passed the reference checks, we designed some simple fast data movers. The target here is to transfer data at the maximum speed between one

ll OCR Output RDQA Status Report Individual progress and activity reports or two SCI nodes for performance evaluation and functional validation. Optimization has been applied to reduce nm—time delays. Three versions of this package are available: a client, a server and a client/server (for single-node loopback tests). This model is used with some hardware shortcuts to reduce latencies between the reception of Move request packets and their acknowledge from the Cbus side. All the developed models have the capability of check] nocheck of the data packets. The rigidity of these testers made them unusable for the design and debugging phases. Here something more flexible and with different capabilities was required. An SCI list compiler and executor has been written for a "mixed" Sun station/remote microprocessor enviromnent. A list of SCI-oriented commands is accepted from the input chamiel (a file or the

Rebercossunsciul nous: ' SCI Cbus listar (parser, RIO, Sun side, CSR not available) start Type? at any time for help S<¤¤·¤=· i··¤···»···=· ii ‘ ...... t...... \. zi sm o (from the Sun’s terminal)

2ji; iagarsss 000400¤0u¤0¤0000 Doynhld in me RIO ofthe 2 gng of Input ust r cmds listand thelist Executor *¤¤¤=··**······* ...... , ...... , no tm .,.. “*°n$t°n UW 1¤¤1.is¤1: emmeeting to sto. rua; ·/· to aiseemace aio RJ000 go 0x800A0O04 SCI Cbus llstor (executor, RIO, R3000 sxdo) start Start list BXOCUILOD +++ Start: of test loop Start of test loop ++~ End of list execution +++ 0 Cbus traneactxonn +++ 0 ottozs ++• Figure 10: SCI list compiler (Sun) and ExecEr (@000) terminal) and converted in an internal optimized format. The resulting list can be executed either from the Sun station via the S-Bus to VME bridge (at low speed) or directly from the RIO CPU (to reduce latencies). A set of ASCH directives are available to start SCI transaction, handle addresses and pointers (to remote and local I/O data buffers), compare blocks of data and output report information. The list executor can ignore, report or acknowledge errors coming from the Node Chip. Faulty actions can be retried from 0 (no retry) to an infinite number of times. As the list executor is under control of a CPU, the throughput cannot reach the SCI capabilities and latencies are quite high. However, the RIO to SCI I/O board double buffering option helps pipelining consecutive SCI commands. 2.5.3 SCI servers The last activity for our phase one software plans concemed the SCI servers for firmware driven environments (SUN stations, TURBOchannel stations and RIOs). The goal of these servers is the control of a generic-purpose state machines with high-level capabilities (cache coherent transactions, bridging between incompatible hardware links, multi·port transfers). Firmware gives a high flexibility and easy debugging at the price of a high latency between packets and a moderate throughput. 2.5.3.1 The RIO server The Ochatmel/SCI connection is handled by a server driven by a synchronous polled state machine. Priority is given to incoming SCI packets (to reduce extemal latencies and to avoid time-outs), where the target address is translated via a paging algorithm. The address space visible from SCI must be defined by external CPUs. Outgoing memory-mapped I/O requests from TURBOchannel follow a similar paged translation scheme. DMA requests coming from TURBOchannel and VME give explicitly all the parameters to the RIO server, so no translation/verification is required.

12 OCR Output RDQA Status Report Individual progress and activity reports

2.5.3.2 SCI Exerciser and Tester (SET) The SCI Exerciser and Tester (SET) [17] is a program suite developed as a Master‘s degree at the University of Oslo in close collaboration with Dolphin. The library emulates a coherent SCI memory plus a CPU/cache pair, and supports all selected, locked, coherent and non coherent transactions.The two software models use the Dolphin S2Vbridge“* in packet mode to interface to the NodeChip. Currently the SET programs are being adapted to the RIO to SCI I/O board.

cpu! cpu! memory Software rumiing cache cache model under Unix model model

S2Vbridgel |S2Vbridgc] |S2Vbridge| SCI hardware SCI

Figure 11: SET functional diagram This gives the ability to make an SCI test environment very similar to a real multiprocessor system with multiple cpu and memory nodes. Such a test system will have the same functions as a real system, but because most of the external device logic is a software model, it will work at a lower speed. 2.6 SCI hardware and development tools from Dolphin SCI Technology Dolphin [22] is recognized as the leading irnplementer of SCI products. It’s product suite currently includes the SCI NodeChip"'“, an SCI mezzanine board, an SCI starter and evaluation kit and a complete SCI development environment consisting of tools and test equipment. During the project, CERN has had access to the Dolphin products as they have become available, and Dolphin has received feedback on its SCI developments. 2.6.1 SCI Node Chips and SCI HW development tools (Dolphin and Univ. OSLO) Dolphin’s SCI NodeChipT¥ (DST50lA) [2, 3] integrates a defined subset of the SCI physical and logical protocols in one VLSI chip and offers a bandwidth per SCI node of 50OMB/s. The NodeChip is implemented in the GaAs FX series (FX350K) gate array technology from Vitesse Semiconductor Corporation, CA. Engineering samples of the chip mounted on the SCI Mezzanine Board and cables were delivered to CERN in March 1993. Between the l4th and 16th of April, the RD24 project, as the iirst site extemal to Dolphin, successfully completed a demonstration of a CERN- made SCI node communicating with Dolphin‘s VMEbus to SCI node (the SCI Evaluation kit). Dolphin’s development tools provide system designers with performance evaluation tools and tools for implementation (SCI NodeChip and Cbus Verilog models) and verification of application designs, as provided to the RD24 project. CERN and Dolphin have jointly designed, as an application example in 1992, a Cbus to MC68040 DMA interface. 2.6.2 Bridge to VMEbus processors (Dolphin and Univ. OSLO) Dolphin has prototyped an SCI-VME adapter and chaired the IEEE 1596.1 SCI/VME Bridge architecture committee [14]. A prototype of the SCI-VME bridge is operational and has been used at CERN for first SCI tests, under control of a commercial VMEbus processor. The final VME bridge will implement memory and cache and therefore depends on availability of an SCI cache controller chip. The SCI-VME bridge is one of the key modules to interface SCI to a large base of existing VMEbus equipment and to preserve large DAQ software investments

13 OCR Output RD2A» Status Report Individual progress and activity reports in VME. 2.6.3 SCI Tracer hardware (Dolphin) As pan of its product suite, Dolphin is developing an SCI Tracer for debugging and diagnosis of SCI systems [27]. The RD24 project (UiO, Dept. of Physics and CERN) has provided valuable input to the definition of the SCI Tracer functionality, and has developed control software in close collaboration with Dolphin [26, 28]. The SCI Tracer is a special logic analyzer which can acquire and analyze SCI symbols from an active SCI link at the full speed of 1 GByte/s. The hardware is implemented in a multiboard VME module. It consists of: an SCI board a map board with memory maps for trigger pattems to control the tracing process a storage board with 256 KBytes of store memory. Different types of triggers can be recognized and used as the basis for storing sequences of SCI packets. During the project, the first version of the SCI Tracer - the LinkProbeT¥ - has been made ___ available as a product. This is a front-end to a logic analyzer (such as the Tektronix DAS 9000), and is up gradable to a complete SCI Tracer. 2.7 Diagnostics, Fastbus-SCI node and Simulation (University of Oslo) The SCI activities at the University of OSLO are strongly coupled to both the CERN and Dolphin activities. 2.7.1 SCI 'lracer software The software consists of the Tracer Control Language (TCL) and its compiler. TCL is a C-type language which is used to define the tracing conditions. The compiler translates the instructions to bit maps that are downloaded into the Tracer memory maps. The compiler checks the syntax of the source program as well as the semantics in relation to the SCI specifications. The first version of the TCL system has been running since autumn 1992, it has been extensively tested in a simulated environment [26]. 2.7.2 SCI to Fastbus Interface The interface is developed for a CERN Host Interface (CHI) Fastbus master [23]. The ... connection is made via the CI-H XBus, which is essentially the MC68030 bus. The design is based on the same concept as for the RIO interface, namely FIFO buffers between the NodeChip and the XBus, controlled by an embedded processor. The embedded processor is AMD29200. Testing of the XBus - Cbus interface logic will start in summer 1993, a first version of the interface is expected to be operational before the end of 1993. 2.7.3 Simulation The University of Oslo is strongly participating in the CERN MODSIM simulation activities for large architectures (see 2.9 on page 15.). 2.8 Interface strategy to Futurebus+ at INFN Rome and Lecce Both Rome and Lecce received, only in March 93, profile-F Futurebus+ starter kits from Nanotek, consisting of 14 slot powered crates powered at 600 W. It contains an arbiter module and an interface module based on an R3000 processor. This set is meant to be suitable for test and new hardware development in a laboratory. Also expected are two Futurebus+ 16 Mbyte

14 OCR Output RDM Status Report Individual progress and activity reports

memories. The installation of the system was difficult because the ventilation of the crate is missing and a cooling system with extemal fans had to be arranged. Hardware manuals are absent. The mechanics were of poor quality, and the insertion of the processor module in the crate caused a damage on the connectors of the module. The material was rejected and replacement is expected. The functioning of the R3000 based Interface was checked using the IDT/sim monitor [29] which permits the user to operate the CPU under controlled conditions, examining and altering the contents of memory, manipulating and controlling resources for the R3000, loading programs from host machines and controlling the path of execution of these loaded programs. We plan to purchase the IDT/c, the Cross C Compiler, and the IDT/kit, the Kemel Integration Toolkit in order to complete the R3000 development software.The study and the evaluation of Futurebus+ will continue according to the delivery of the material.

ringlet at CERN. A starting point is the CADENCE/Verilog model of the Node chip-Cbus· ,.... FIFO approach: we can use it to start the development of the SCI-Futurebus+ Bridge. This is still highly interesting in the light of long term experiments at LHC and medium term experiments like KLOE that plan to use processors on Futurebus+ platforms and possibly SCI as interconnect system. An increased participation in simulation using MODSIM software is planned as Rome has licences and manpower for such developments on a SPARC station. 2.9 Simulation of SCI DAQ architectures (CERN, University of Oslo) SClLab, a set of SCI modelling tools, was developed to simulate the data flow of large LHC type DAQ systems containing ~ 1000 nodes. A resolution down to SCI packet size (~ 100 ns on average) allows accurate simulation of congested data pathways. 2.9.1 Simulation tools MODSIM H, an object oriented simulation package from CACI Products Company, La Jolla, Ca., was chosen after a market survey by RD13. It is also used at the SSC Laboratories. The low level SCI protocols which govem the exchange of packets between SCI nodes have been completely written in MODSIM H. There are also provisions for building SCI networks consisting of rings interconnected by bridges or switches. For the cache coherency, the C-code produced by the IEEE [10] has been linked into the MODSIM code. Various scripts based on standard UNIX tools have been used to prepare the input and output data. Results can be further processed by PAW or most commercial display packages. Large SCI systems can be efficiently simulated on desktop workstations. The amount of code written at CERN has been kept to a strict mrnnnum. 2.9.2 Applications Generic SCI nodes such as processors, memories, cache controllers, bridges and switches can be parameterized by estimated or measured values of hardware. More complex modules need additional code. The object oriented approach leads to great advantages since new modules can "inherit" most properties from existing ones and the structure of the existing program does not need to be modified. SCH.ab has been used by RDl1 to simulate the Global Level 2 Trigger based on SCI hardware. We have made preliminary studies of the cache coherency [13] and investigated the scalability properties of switch based SCI systems.

15 OCR Output RD24 Status Report Individual progress and activity reports

2.9.3 Scalability of SCI (CERN, University of Oslo, Thomson, Dolphin) Since scalability is crucial for DAQ applications at LHC we present simulation results of an event builder based on an SCI switch. The throughput of the switch has been calculated as a function of the number of inputloutput channels. The result shows linear scalability for the simulated range of 2-64 channels. Such a switch can be built from elementary chips similar to SwitchLink described in Chapter "SCI to SCI Bridges and Switches" on page 20. The size, complexity and cost of a large switch depend obviously on the number of ports of the switch chip. We have investigated the use of 2Rx2R‘, 4Hx4R and 8Rx8R chips and found that this has negligible effect on the throughput of the composite switch. An example of a composite switch with 16 channels on the detector side and 16 channels on the CPU farm side is given in Figure 12. The example uses 32 elementary switch chips each with 4 ports (a 2Rx2R or 4-switch in our temiinology). Such a composite switch would provide 10 GBytes/s net data throughput, but this number scales with the size of the switch.

The number of nodes in an edge ring does nothavetobe8 asshownhere

Figure 12: 16Rx16R bi-directional composite switch made of 32 2Hx2R chips

The use of SCI switches for event building has many advantages. The throughput capability of each charmel is high (~ 600 MByte/s in our model) resulting in a small physical size of the superswitch; reliable SCI protocols ensure that data passes through the switch without loss or corruption; the switch is self·routing and provides flow control; all connections are bi-directional, eliminating the need for an additional network for synchronization between front-end data sources and the CPU farm on the output side; the switch can also be used to carry traffic generated by experiment controls, diagnostics and displays with little disturbance on the l. The terminology NRxNR or 2N- (e.g. 2Rx2R switch or 4-switch indicates a switch which interconnects 4 rings)

16 OCR Output RD24 Status Report Individual prognss and activity reports

main data flow.

2.9.4 Simulation results The amount of on-chip memory required to equalize fluctuations on the input stream (assuming a uniform random distribution) is small as illustrated in Figure 13. This shows the throughput of a 64Hx64R switch (built from 2Rx2R chips) as a function of on-chip memory (in units of SCI packet buffers). A value of 8 packets has been chosen for all subsequent simulations.

70

Raw Throughput 60

50 .Nst.U¤r¤s+shns+t.

40

30 - Using- dmove25$paekets·

20

10

0 5Sizeof c-’gOinput};utput Fl%%s in Sé?packet 35 Figure 13: Throughput of a superswitch as function of size of input/output FIF`Os

The net throughput of an NRxNR superswitch based on switch chips with 4, 8 or 16 ports using the SCI dmove64 transaction (a non-coherent, pipelined 64 byte write operation) is summarized in Table 1. The actual traffic on the links (raw throughput) is actually ~ 50% higher for the dmove64. The raw throughput includes address and control information, in addition to data. The actual ratio between raw and net data throughput depends on the choice of the SCI transaction and would be much less for the dmove256, and slightly higher for the nwrite64.

Table 1. net throughput (in Gbytes/s) of an NxN superswitch fomted from different elementary switch chips using the dmove64 transaction Superswitch sm NRxNn | 2 | 4 | 8 | 16 | 32 | 64 switch chip size 2nxZR | 1.3 | 25 | 5.1 | 10.0 | 19.6 | 38.7 switch chip size 4Rx4H 2.5 9.8 37.5 switch chip size 8Rx8R 5.0 37.5

The proof of scalability of a composite switch based on 2Rx2R elementary switch chips is given in Figure 14 which shows the raw and net throughput for two different SCI transactions (dmove64, nwrite64) as a function of the size of the switch (number of input/output channels).

17 OCR Output RD24 Status Repon Individual progess and activity reports

I , ,+ » dn‘I0vs64 ! z' , I I / z / x /1lzdmOVBG4 E Rawjiircugtput .°_f / / " /z/ if ,’ /,’/»n6vri1e64 ’ so 1% /NatThrpughputx »’’ 1 )'//' ,"#/ ’ x , * / 2* ,/ 2° Av /’ " ¤*,¢/ /,,¢¢///’¥¢/*‘" g‘" z¤ O\.;.T..L. O 10 _ 30 _ 40 5gza of superswrtch, NRxNR0 60 70

Figure 14: Throughput of a NRxNR supcrswitch based on 2R>QR switch chips Finally, a comparison ofthe raw and nct throughput of ditfcrcnt sizes of the supcrswitch based on 2Rx2R_ 4Hx4R and 8Rx8H chips for the dmove64 and nwrite64 SCI nansactions is given in Fig 15.

.1 ’ I x .’ x:4·switch //' X Q 50 .,/1 , } /,-’I_ ,qm’ove64 _ *: 8-switch /,· 4x Raw '|'|v§ughP¤§’ . · 0.16-switch / / / . · / ’ 5 ,jr/,,{*,j/4dmoveu, ` ·/ L 1 · 1 ./ r I 5./ ,_/ ; 1 ¢4.’ I / ° z {,· ,§·’15;’@n1964 ' § 30 X11 11,{15-5 /4/ ‘ 7/ ,_,/;_1 )¢/ vg!/QUGIJQ/_ HPUT - 41 {4/ 1? V$//*4/ · A / 6 20 _4. { ,. >¢’<’ ' ,¢’ {gy X3"' 15*-; gr 1¢ § 10 ·? /-1* ..5 I .4.9

b 10 20 so 40 so so vo Suze of superswitch, NRxNR

Figure 15: Comparison of NRxNR superswitches of 2Rx2R_4Rx4Ra.nd 8Rx8R chips.

is OCR Output RD24 Status Report Synchronization of distributed DAQ systems

Arch rtectu ral studies

Chapter 3. Synchronization of distributed DAQ systems When distributed DAQ systems controllers - with network spreading for hundreds of meters‘ - keep their elements in synchronism within different data flows, several synchronization problems need to be solved: event number generation, embedded microprocessors firmware and iile·system update, broadcasting of calibration and experiment control tables, data—driven flow control are some of them. Special solutions have been found in conventional systems, but no particular candidate suits optimally the requirements of distributed DAQ systems. SCI offers uniform solutions based on the Control and Status architecture, on its special protocols and other intemal features. 3.1 Uniformity Non-unifonn distributed DAQ systems impose several problems to software/firmware }__ maintenance and portability. Bridges are often non-transparent, slow and unreliable and imply non-compatible access protocols and software libraries which do not facilitate the design and management stages. SCI based DAQ systems can make good use of the uniformity between all the nodes on the network. Registers are accessed in the same way and at the same addresses, transparent bridges are capable of altemative routing to bypass hot-spots and unavailable portions of the network. Identical access to all the components simplify the system architecture. Uniformity between data structure is assured by the IEEE Pl596.5 proposal on shared data structures [1 1]. 3.2 CSR-1212 functions (SCI, Futurebus+ and Serialbus) The IEEE Std 1212 on "CSR Architecture" [12] prescribes a set of registers whose functions are accepted on all nodes at the same addresses and following the same protocols. The following functions are provided: Node reset, status information, unique node ID, indirect access to the intemal resources, interrupt generation and control, time-of-day clock, message mailboxes. 3.3 SCI Time stamps One of the "missing" features of networks used in current DAQ systems is a standard, precise and high·resolution time-stamp. SCI nodes have an optional set of clock registers where the time·of-day can be accessed with an accuracy of 322'seconds (~ 233 picoseconds) and a time-span (wrap around) of 136 years. This clock is initialized at node reset and kept in phase across the network via small software adjustments. 64 bit wide, the CLOCK_VALUE register can be easily used as a time stamp for events, and logging of processes with a minimum of software overhead (no libraries to access, pack and unpack the register). 3.4 Coherency and firmware updates Distributed microprocessor environments require firmware updates. Historically, two methods have been used: On-board updates (Eproms/Floppies/Disks); Off-boards updates (Network File Systems, Boot servers).

l. In some cases, distances of kilometers are reachable in SCI via optical fibers and special lasers at rea sonable latencies

19 OCR Output RD24 Status Repon SCI to SCI Bridges and Switches

These procedures do work with the current DAQ systems, where a small number of processors needs to be updated. The required time ranges from hours to days. Coherency of these upgrades needs to be checked by individual actions (have we changed all the EPROMs?) which are prone to omissions.

·¤Vi°¤ etwork device evxce controller

Figure 16: Conventional vs SCI-based microprocessor architecture Typical microprocessor’s architecture (see iigure 16) use a local bus-oriented scheme which leads to non-uniformity for local devices, network and I/O modules: different handling, special mapping, dedicated system-dependent device drivers. Using an SCI interface, the cormected SCI system can be seen as local RAM, handled transparently by the SCI controller. The CPU has now access to a fast and coherent link. After a simple bootload from EPROM, SCI can be used like a processor bus. Reconfigurations and updates become a selective cache refill procedure, at a small cost for the network [7]. Chapter 4. SCI to SCI Bridges and Switches SCI offers a point-to-point link for a bus-like service. Its ring structure is sensible to hardware failures, peak overloads and doesn’t scale too well when the number of nodes exceeds some practical Applications demanding high·speed low-latency point-to-point connections need to escape from the simple ringlet paradigm. For all these reasons, both SCI to SCI bridges and SCI to SCI switches will become vital elements of complex, reliable, high efficient and truly scalable SCI architectures.

wrrcu

Figure l7: SCI Bridge (2x2) and Switch (4x4)

An SCI to SCI Bridge (figure 17) can be seen as two SCI nodes in a back-to-back architecture. All the packets accepted by one of the two nodes (the "near" port) are passed over to the other node (the "far" port). Each port has the basic structure of a standard simplified SCI Node Chip with some special features. The p0ft-t0-port cormection is simple (as there is no contention and routing to be handled) but might introduce problems for synchronization and pipelining of consecutive packets. An SCI switch has the same basic element capable of routing incoming packets to more then one destination. Broadcast packets might have to be passed in parallel to more far ports,

20 OCR Output RD24 Status Report Data driven synchronization

according to the broadcast scheme in use. The same simplifications/complications of a Bridge Port apply also to a Switch Port, with some extra problems that arise for multi-masters contention, broadcasting and synchronization. The RD24 project has developed a basic building block called "SwitchLink". Scalable, pipelined, concurrent, SwitchLink is based on the existing SCI node chip design. It interconnects up to "N" ports, where N is a parameter imposed only by physical constraints (packaging, cooling, pinning). With its modular structure, it allows multi-chip implementations in mixed-logic configurations. Each SwitchLink basic element has a private link to each of the other ports and receives one input link from each of them. For a NxN SwitchLink there is a total of N private links, connecting each port to all its far ports. These connections are simplified by the SCI characteristic of being a "link"; they are not a crossbar switch - although they provide identical functionality - and are much simpler to implement then a full matrix switching network. Once a packet exits from a Node Chip output queue, it is immediately routed to its destination at minimum run-time cost, independent of the number of ports of the switch and of the number of destinations of the packet (broadcasting). A per-port controller regulates the traffic and handles a score table, where the status of the remote ports is tagged. Congested ports A are marked as "unavailable" and alternative routes can be probed in parallel. As an ultimate solution, packets can be "busied" (a standard SCI reaction for congested nodes) so that retries are performed from the source node. Complete freedom is given to the routing algorithm, as it has been kept independent from the SwitchLink design. Chapter 5. Data driven synchronization Data driven algorithms used on distributed DAQ systems are complicated, need small latencies and might have a negative impact on the performance of the network. The use of centralized "farmers’ status table" simpliiies these algorithms but introduces serious "hot spots". In a conventional architecture, updates to the central table can be notified via individual messages or global broadcasts. In this chapter we will have a look at three possible scenarios for data·driven architectures running over SCI links. 5.1 Centralized tables In the first approach we have one centralized table describing the status of the farmer

2.1

2.2

LEGEND DS: data source (front-end) P: Farmer node C: Controller T: Farm¤·s’ status table Figure 18: Centralized table access

nodes, table accessed coherently (read) from the data sources. The updating of the table (step 1) starts a cache invalidation process (steps 2, 2.1 and 2.2) in the two axis (farmers-sources and sources-sources). The sources reload their cache lines (step 3) and create a new sharing list (step 4). This method produces traffic in all possible directions for a number of packets proportional to O (data sources caching the line) + O (data sources that will cache the line),

21 OCR Output RD24 Status Report Data driven synchronization with some potential multiplicative factors introduced by networks and protocols. It needs cache coherence hardware support and might make good use of the SCI logarithmic extensions. 5.2 Distributed tables In the distributed table approach (figre 19), the farmers’ status table is duplicated over all

2.2

2.1

Figure 19: Distributed table access the data sources. Every update (step 1) translates into an experiment-wide broadcast, vertical (step 2.1) and horizontal (step 2.2). This approach might look too expensive for the network; in reality it is quite efiicient as SCI broadcast do not flood the network with messages and the broadcast traffic does not interfere with the experimental data stream. At any moment, we can have a maximum number of packets of O (number of farmers) * O (number of ringlets). Bridges and Switches can duplicate broadcast messages with very low latency (hardware binary-copy process). Disadvantages of tl1is method are: all the data sources have a table update, wether they need it or not; there will be contention on the data sources’ private resources; broadcast messages cannot be combined by bridges as switches; this system relies heavily on the SCI broadcasting schemes that might not be implemented on all generations and versions of SCI Node Chips. 5.3 Concentrated tables The experiment~wide broadcast can be avoided by duplicating the farmers’ table over a

Figure 20: Concentrated table access relatively small number of tables, one per front end ringlet or sub-partition (tig 20). The farmers update cache lines via coherent write (step 1), that is distributed to the distributed tables (step 2). The data sources then fetch their nearest copy (step 3) coherently or uncoherently. Cheap to implement, with small impact on the network, this scheme is flexible and can use broadcasts and/or coherency when they are available. The efficiency depends on the number and tl1e position of the distributed tables within the network.

22 OCR Output RDOA Status Report

EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH

CERN/DRDC 93-20 RD24 Status Report 5 May 1993

RD24 Second Phase

Application of the Scalable Coherent Interface to Data Acquisition at LHC A. Bogaerts‘, R. Divia, R.Keyser, H. Mtillerl, G. Mugnai, P. Ponting, D. Samyn CERN, Geneva, Switzerland B. Skaali, E.H.Kristiansen‘, D. Wormald, I. Wikne, B. Wu, H. Kohmann University of Oslo, Physics Department, Norway S.Gjessing University of Oslo, Institute of Informatics, Norway S. Falciano, F. Cesaroni, G. Medici INFN Sezione di Roma and University of Rome, La Sapienza, Italy P. Creti, M. Panareo INFN Sezione di Lecce, Italy A. Sytin, A. Ivanov, A. Ekimov II-IEP, Protvino, Russia K. Lochsen, E. Rongved, S.E. Johansen Dolphin SCI Technology A.S., Oslo, Norway A. Mistry, A. Guglielmi, A. Pastore Digital Equipment Corporation (DEC), Joint Project at CERN F-H. Worm, J. Bovier, A.Lounis Creative Electronic Systems (CES), Geneva, Switzerland S. Robison, D. North, G. Stone Apple Computer, Inc. Cupertino USA E. Sanchis-Peris, V. Gonzalez·Millan, J.M. Lopez-Amengual, A.Sebastia IFIC, Valencia, Spain

E. Perea Thomson-CSF Semiconducteurs Specitiques, Orsay, France

Abstract

We present the objectives of the next phase of RD24. This work will be centred around multiple SCI ringlets by using SCI bridges and switches. Cache and cache coherency protocols will be tested for application in DAQ architectures. A refinement of previously prototyped node interfaces, diagnostic tools and software is foreseen for Hrst application in HEP experiments. Simulation of LHC sized SCI systems will continue with continued parameters.

l. joint spokesmen 2. SINTEE OSLO. NORWAY

23 OCR Output RD24 Status Report Goals for the Second Phase

RD24 Second Phase

Chapter 6. Goals for the Second Phase 6.1 Future work on SCI node design A new, low cost CMOS version of the SCI NodeCl1ip designed by Dolphin and produced by LSI Logic [32] is expected in the fourth quarter of 1993. The interface to application hardware will remain compatible. Since both versions are of interest to DAQ applications we intend to pursue both lines. Serial transmission over optical libres (distances of ~ 10 Km) or coaxial cables (distances of ~ 25 m) is supported by parallel to serial converter chips [9] from Hewlett Packard, at speeds of up to 100 MByte/s. This speed is expected to increase as well. We intend to pursue serial transmission together with CERN, SL Division and Dolphin. Two complementary design goals for SCI interfaces are speed and transparency. Their combination is a very desirable for processor interfaces and design attempts to merge speed and transparent access have started. A future goal is a cost·effective direct interface to cache controllers of existing CPUs since this is a natural way of packetizing the data.This requires design activities on the side of computer and chip manufacturers, RD24 sees it’s role as advisory. Good candidates are the Mbus (SPARC) an the 88K bus (Apple PowerPC). Caches enhance the performance of a system considerably and they are now standard on nearly all CPU chips. Caching becomes particular important when due to large distances signal propagation delays become noticeable. When data is shared caches must be coherent. The nodechip from Dolphin supports cache coherency but an additional (CMOS) chip is required for interfacing to user logic. Dolphin have a Verilog design for the Cache and Memory Controller (CMC) but its availability is not expected in 1993. This chip would simplify the packetizing of data considerably. Simulation parameters can be improved using measured values of existing hardware. The IEEE C-code of which the cache coherency protocols are used, is now stable. RD24 is joined by new collaborators from both the Physics and Informatics departments of the University of Oslo which will boost our expertise in the area of simulations, switches, SCI topologies and cache coherency. Interfaces which could not be iinished in the Hrst phase due to late arrival of the nodechip will be completed (interfaces to Turbochannel, Futurebus+, Fastbus). Others will be enhanced and adapted to the expected CMOS chip (VME/SCI bridge, RIO data mover, 68040 interface). New collaborators (EIC, Valencia) intend to interface SCI to the DSP. The later expected availability of the CMC controller should lead to the design of second generation, shared memory interfaces with possibilities for caching. Currently, Dolphin offers an SCI-VME evaluation board with minimal functionality as part of their starter kit. An improved version could be used for DAQ and accelerator control applications. An improved interface to a workstation will also be important for applications and we intend to make use of an SBUS to SCI interface when this becomes available. 6.2 SCI bridges (Dolphin) Bridges can be constructed from two SCI node chips connected back to back. These may be of different speed to connect a CMOS ringlet (e. g. an optical fibre) to a GaAs ringlet. This requires minimal additional routing capabilities. Dolphin will provide minimal routing OCR Output RDQ4 Status Report Cache Coherency capabilities in the CMOS chip and build simple SCI bridges. A port of an SCI bridge or switch connects to a ring. This allows for diverse topologies, e. g. a 4- switch could be constructed from 4 SCI bridges. A large SCI system constructed from these bridges (figure 21) would still scale.

Figure 21: 4 Bridges used as a 4-switch

6.3 Switches (Thomson, Dolphin, CERN, University of Oslo) A true SCI switch providing full bandwidth for all ports instead of sharing a single ring will be superior and more economical. The design and production of such chips demands a serious involvement of several partners. RD24 has been invited to become an associate partner in a Eureka project submitted by Thomson and Dolphin. This should lead to switch producdon in three years. RD24’s role is to participate in the switch specifications. The participation from the University of Oslo has also been boosted to bring in the results of academic research in the areas of switches and MPP topologies. CERN, together with the Physics department of the University of Oslo will concentrate on an event builder and simulations. Chapter 7. Cache Coherency Coherent caching is an option of the SCI standard which is already incorporated in the nodechip. RD24 has brought together a number of ingredients which give us confidence that we can undertake investigations. These are: expertise, industrial support and a suitable demonstration project. 7.1 Expertise (University of Oslo) The Department of Informatics of the University of Oslo who have made major contributions to the SCI cache coherency algorithms, will join RD24. Dolphin have implemented the "typical set" of coherency protocols in the nodechip. The University of Oslo have developed software which emulates a coherent memory and cache controller using the VME—SCI bridge (this software has been used to generate SCI test pattems to debug the node chip). The IEEE C-code which dehnes the coherency protocols has been incorporated in the CERN simulation program. We are thus confident that we have the necessary knowledge in the collaboration. 7.2 Industrial Support The VME/SCI bridge from Dolphin can be extended with SCI cache and memory. This will allow existing VME based CPUs and software to evaluate the benefits from cache coherency. We expect the support from SCI to evolve. 7.3 Shared memory over long fibers for control application at CERN The CO group of the SL division intends to buy two SCI/VME bridges and a pair of transmit/receive boards supporting transmission over optical fibers. The interface logic to

25 OCR Output RD24 Status Report Collaborations with R&D projects connect the interface signals of the boards, involving the change of signal levels and skew adaptation, would be designed and implemented, if necessary, in the SL division. The resulting ringlet would be used for evaluation of SCI hardware and software in the context of the control of a large accelerator. Though not requiring the highest bandwidth of SCI, the accompanying functionality, namely network wide interprocess communication and cache coherency of the shared memory model is of great interest. 7.4 Coherent Interfacing to Workstations and MPPs Coherent interfacing to the CPU, cache and memory system of Workstations and MPPs is beyond the capabilities of RD24 and requires serious involvement from industry. A high speed, standard SCI interface to such computers would provide the necessary I/O bandwidth to serve as an interface to the experimental data. Memory mapped I/O and local caching would be beneficial to third level trigger applications. Ancillary control, monitoring and visualization software could profit from cache coherency. We invite collaboration with computer industry in these areas. Chapter 8. Collaborations with R&D projects 8.1 LASCALA SCI links will be used within DEC to communicate fast and with low latency between nodes that constitute a multiprocessor system. A fast interconnect will enable the boards in a multiprocessor environment to interchange results at a rate which approaches the CPU cycle. A project of this kind is being set up with academic and industrial partners, to create the first instantiation of this idea, based on the AXP DECchip 21064 and (probably) the future CMOS SCI node chip. The Chorus technology will be ported to this implementation to provide both distributed, real-time features and a Unix environment through Chorus-MiX. The Rutherford Laboratories are involved in the hardware interfacing and, together with RDI 1, application software. The partners work together in the LASCALA Esprit proposal. 8.2 Other R&D Projects We collaborate with RD1l and RDI 3 in the area of simulation tools. RDl1 has carried out simulations of the Global Second Level Trigger with SCH..ab. We collaborate with RDl1 in the area of interfacing to FE readout hardware. We are discussing possible SCI applications with RD3l (TPC readout)[30] and NESTOR (deep undersea neutrino experiment) [31]. Chapter 9. Milestones and Responsibilities Unless stated otherwise hardware and software is included in the list of items

26 OCR Output Status Report Milestones and Responsibilities

9.1 DAQ Components SCI Bridge Dolphin (hw) + CERN (sw) ...... 6/93 - 6/94 Univ. of Oslo Fastbus/SCI Bridge ...... 6/93 - 6/94

RIO Data Mover CERN + ...... 6/93 - 6/94

68040 Interface C ERN + CES + Apple ...... 6/93 · 6/94 Turbo Channel Interface CERN (hw) DEC JP (sw) ...... 6/93 - 1/94

Rome + Lecce Futurebus+/SCI Bridge ...... 6/93 - 6/94

DSP Interface IFIC Valencia ...... 6/93 - 6/94

Quadra Macintosh Interface APPLE ATG group ...... 6/93 - 6/94 SBUS Interface Dolphin 9.2 Accelerator Control

Long distance Optical Fibre CERN (SL) + Dolphin ...... 6/93 · 1/94 'SCI Bridge (Conuols) Dolphin (hw) + CERN (sw) ...... 6/93 - 6/94 Cache Coherency Dolphin (hw) + CERN (sw) 9.3 Cache Coherency CERN + Univ. of Oslo Simulations (SClLab) ...... 6/93 - 6/94 Emulations (VME/SCI Bridge) Univ. of Oslo + Dolphin ...... 6/93 - 6/94

CMC (Cache & Memory Contr)’ Dolphin ...... 1/94 · 6/94 Advanced Coherent Interfaces invitation to Computer Industry 9.4 System Integration

A multiple ringlet system built from components mentioned above.

High Speed Ringlet (1 Gbyte/s) all ...... 6/93 - 6/94

Low SpeedRinglet...... (100 Mbyte/s) all 6/93 · 6/94

SCI/SCI Bridge Dolphin ...... 1/94 - 6/94

Serial Transmission (FibreCoax), ...... CERN + Dolphin 6/93 - 6/94 Diagnostics Uni.Oslo(sw) + Dolphin (hw) ...... 6/93 - 6/94

Shared Memory ...... 6/93 - 6/94

LHC DAQ Architectures (Simulation)‘ CERN + Univof...... Oslo 6/93 - 6/94

l. Improvement, work already s 2. To be confirmed (requires CMC) 3. To be confirmed

27 OCR Output RD24 Status Report Milestones and Responsibilities

Acknowledg ments

We thank Michel Ferrat (ECP/EDA) who has helped us out on numerous occasions with the design of PC Boards on very short notice. Francisco Lozano-Alemany, technical student (ECP/DS) from Universidad Politécnica de Madrid (Spain) has written the iirmware for the RIO. D. Gustavson (SLAC) and D. James (Apple) · chairman and vice·cha.irman of the IEEE SCI standardization committee - made available their enormous expertise on SCI. A. Perrelle (ECP/DS) has been the interface between RD24 and the SCI community. O. Barba1at(DI) has streamlined the relations with our multiple industrial collaborators. A. Wiedermarm (DI) and A. Unnervik (FI) have worked out the legal and commercial framework of often complex triangular agreements. Finally, we thank the management of the ECP Division for solving manpower and financial problems.

28 OCR Output RD24 Status Report

The financial provisions for phase 2 (l993—1994) are as follows:

kCI-IF

CERN DRDC 85

CERN SL Div. 50 University of Oslo

IFIC Valencia 35

H*IFN Rome 47

INFN Lecce 30

Apple 30

Thomson 15

Total 332 OCR Output

29

OCR OutputRD24 Sums Repon

List of References

[ll Applications of the Scalable Coherent Interface to Data Acquisition in LHC, CERN /DRDC/91-45 [2] DST501A GaAs NodeChip Functional Specification V 1.3, August 1992 Dolphin SCI TechnoloQY, P.O Box 52, Bogerud, N-0621 OSLO, Norway [3] DST50lA GaAs NodeChip Electrical Specification V 0.3, January 1993 Dolphin SCI Technoloqy, P.O Box 52, Bogerud, N—O621 OSLO, Norway [4] Letter of Intent, Chapter 6 on Data Acquisition CMS, CERN/LHCC 92-3 [5] Letter of Intent, Chapter 5, Trigger and Data Acquisition ATLAS, CERN/LHCC/92-4 [6] SCI to TURBOchannel interface J.Buytaert,A. Bogaerts, R.Divia, A.Guglielmi, H. Muller OPEN Bus Systems ‘92, Oct. 92, Zurich [7] Scalable Coherent Interface and LHC: a good marriage? R.Divia, A.Bogaerts, J.Buytaert, H.Muller, C.Parkman, P.Ponting, D.Samyn CHEPS 92, Sept. 92, Annecy [8] A preview of the Scalable Coherent Interface standard (SCI) for very high rate Data Acquisition H.MUller ECP Seminar, 19 April, CERN (slides only) [9] HDMP—lOOO Tx/Rx Pair, Technical Data Sheet Hewlett Packard [10] IEEE Std 1596 SCI Logical, Physical and Cache Coherence Specifications Draft 2.0, November 18, 1991 (draft available on public server sunsci.cern.ch) [ll] IEEE P 1596.5, Shared—Data Formats Optimized for Scalable Coherent Interface Processors Draft 0.95, March 21, 1992 (on anonymous ftp server sunsci.cern.ch) [12] IEEE Std 1212, CSR Architecture (draft on anonymous ftp server sunsci.cern.ch) [13] Simulation of SCI Protocols in MODSIM A.Bogaerts, J—F.Renardy, D.Samyn Workshop on Data Acquisition and Trigger Simulations for High Energy Physics April '92 SSC Dallas [14] IEEE P1596.1, SCI to VME Bridge Architecture March 1992 (on anonymous ftp server sunsci.cern.ch) [15] SCI—VMEbridge Functional Specification advance information, please contact RD24 [16] Cbus specification V 2.0, August 1992

30 OCR Output RD24 Sums Repon

Dolphin SCI Technology P.O Box 52, Bogerud, N—O62l OSLO, Norway August 1992, (on anonymous ftp server sunsci.cern.ch) [17] SCI Exerciser and Tester (SET) User Manual H.Kohmann, University of Oslo, Physics Dept. {18] New buses and links for Data Acquisition H.Mul1er, A.Bogaerts, D.Linnhofer, R.McLaren, C.Parkman Nucl.Instr.and Meth.in Physics Research A315 (1992) 478-482 [19] RIO 8260 RISC I/O processor User manual Creative Electronic Systems S.A., CH-1213 Geneva [20] FIC 8234 Twin 68040 processor User manual Creative Electronic Systems S.A., CH—1213 Geneva [21] Model PT-SBS915, Sbus to VME adapter, User’s manual Performance Technologies, Inc. Rochester, New York, USA [22] Dolphin SCI Technology, P.O Box 52, Bogerud, N-0621 OSLO, Norway [23] The CHI, a new Fastbus Interface and Processor,H.Muller et al., IEEE Trans.o.Nucl.SCI, Vol. 37, No 2, April 1990 [24] SCI interface for Radstone Techno1ogy’s 68040 based 68-42 boards B.Solberg, CERN/Dolphin, August 1992 [25] SCI DMA option for the 68040 bus, Report (draft) A.Ivanov, CERN/IHEP Protvino [26] The SCI Tracer B.Wu, B.Skaali, I.Birkeli Open Bus Systems 1992, October 1992, Zurich [27] SCI Tracer Specification, Preliminary version 2.0 I. Birkeli, Dolphin SCI Technology, P.O Box 52, Bogerud, N-0621 OSLO, Norway [28] SCI Tracer Software B.Wu, Department of Physics, University of Oslo August 1992 [29] IDT PROM Monitor Integrated Device Technol0QY, Inc. Santa Clara, California (USA) [30] Development of a Time Projection Chamber with high two track resolution capability for experiments at Heavy Ion Colliders CERN DRDC RD31, May 1992 [31] NESTOR (NEutrinos from Supernovae and TeV sources Ocean Ranch), Athens, Rome, Moscow, Kiev, Hawaii, Wisconsin, SCRIPPS [32] Press release 7 December 1992, LSI Logic Corporation, Milpitas California, 95035, USA

31 OCR Output