Editorial The LJ(�ita/ Techllical.fourna/is a rete reed The f(JIIowing arc trademarks of Digital Jane C. Blake, Matugin�; Editor journ.tl published quarrerlv bv Digital Equipment Corporation: AlphaScn·er, Park, Kathleen ,vi. Stetson, Editor Equipment Corporation, 50 Nagog AJ phaStation, Cl. DC:Cnct, DECNfS, Helen L. Patterson, Editor AK02-3/IB, Acton, MA 01720-9843. DIGITAL, the DIGITAL logo, DIGITAl. Hard -cop'' subscriptions can be ordered bv UNIX, and PowcrSrorm. Circulation n sendi g a check in U.S. funds (made pavabk fkum.n is ,1 rr.Hkm.1rk ot"Son�· Corpor.nion. Catherine M. Phillips, Manager to Digit;11 Equipment Corporation) to the Kristine M. Lowe, Administrator publishcd-bv address. General subscription CRAY is .1 regrstcrcd tro1eicmark of Crow tJtcs are $40.00 (non- U.S. $60) for four i>'ucs Research, lnc. and $75.00 (non· U.S. S liS) for eight issues. Production Dircct3D is a rradcm;ukof31)bbs, lnc.l.td.

Inquiries, address changes, and compli­ mentary subscription orders can be sent to rhc Digital Technical.fournalat the published-by address or the electronic mail address, [email protected]. fnquiries can also be made by calling the Journal office at 978-264-7556.

Comments on the content of anv paper and requests ro (Onracr Juthors .1re welcomed .md mav be sentto the managing editor at the published-bv or electronic mail address.

Copyright 1998 Digital Equip ment Corpor;uion. Cop,·ing without tee is per­ mined provided tlur such copies .tre made for use in educational instimtions b,· facult\· members and arc nor distributed for com­ mercial ad,·antage. Abstracting with credit of Digital Equipment Corporation's author­ ship is permitted.

The information in the.fourna/is subject ro ch;�ngc without notice and should not be construed as a commitment bv Digital Equipment Corporation or by the corn pan· Cover Design ies herein represented. Digital Equipment Recent advances in physical technology Corporation assumes no responsibility for have signiticanrly impmved the capabilities anv errors that rnav appear in the Journal. of parallel SCSI in three parameters: speed , fSSN 0898-901X interconnect density, and contiguration expansion (device count, length, topology, Document;ttion Number EC-P8826-20 control). On the cover, the graphs reprc· Book production was done b)' Quantic sent three systems with different sets of Communications, Inc. parameter values, rhar is, three unique points in the spced-sizc-contiguration space. Hot plugging of devices and segments and dynamic speed changes can decrease or increase the p;1tamercr values without system disruption. The opening paper in this issue ckscribes these advances in paral lel SCSI tec hnology.

The cover design is by Lucind;t O'Neill of the DIGITAL Industrial and Graphic Design (;roup. The ed itors ti1Jnk .1uthor· Bill H,,m for his help in developing the cover concept. Contents

Foreword Richard Lary 3

Recent Advances in Basic Physical Technology for William E. Ham 6 Parallel SCSI: UltraSCSI, Expanders, Interconnect, and Hot Plugging

Development of Router Clusters to Provide Fast Failover Peter L. Higginson and Michael C. Shand 32 in IP Networks

Shared Desktop: A Collaborative Tool for Sharing 3-D Lawrence G. Palmer and Ricky S. Palmer 42 Applications among Different Window Systems

Challenges in Designing an HPF Debugger David C. P. LaFrance-Linden 50

Digital Technical Journal Vol. 9 No.3 1997 Editor's Introduction

This issue of tile Digital Technical ficaiJy on net\vorks requiring high named Aardvark, tJ1at "reconstructs" journal presents papers on a range availability, such as telecommunica­ for the HPF programmer a single of computing subjects, beginning tion and stock exchange net\vorks. source-level view, even though the with recent advances in storage tech­ The authors analyze failure cases and program has several flows ofconu-ol nologies, foUowed by net\vork routcr presentthe solutions that reduced and the data are distributed. David cluster enhancements, new desktop service-loss rime from approximately LaFrance- Linden discusses the chal­ software for sharing 3-D applications 30 to 45 seconds to 5 seconds in both lenges faced in creating the debugger across platforms, and an experimental LAN and WA.l'\1 environments. and describes useful techniques and High Performance Fortran debugger. Collaboration software for desk­ concepts, such as logical entities, that DIGITAL's storage engineers have top systems can be broadly defined can be generally applied to debugger been leaders in the definition of the to encompass a range of capabilities, design. parallel small system inter­ fi:om a simple transfer of data between Readers interested in past issues face (SCSI) ANSI standards and in users, such as e-mail sent over a net­ of the]ouma/are invited to visit the

related technology improvements. work, to real-time sharing ofrext, jou.rna!Web site at http:/ /vvww. Bill Ham's paper focuses on four graphics, and audio and video data. digital.com/info/dtj/. Our next advances in the physical features of Larry and Ricky Palmer have designed issue will address such topics as opti­ SCSI that resulted in major increases a sofhvare product, called Shared mization of NT executables on Alpha, in SCSI capabilities and minor distur­ Desktop, for users who want to share a new graphics program, and VLM. bances when incorporated in existing three-dimensional graphics applica­ A Special Issue on programming lan­ installations. The discussion spans tions and audio across net\vorks. guages and tools is being developed developments from SCSI-2 through Notably, the design differentiates for publication in the fall ofl998. UltraSCSI, including speed increases itself by supporting mulriple operat­ in the synchronous data phase; longer, ing systems, currentJy enabling real­ more complex configurations enabled time interopcration among vVindows by bus expanders; physical versatility and UNfX systems. The authors dis­ inherent in a decreased size ofthe cuss the decision to create a "view­ interconnect; and dynamic removal port;' which is a part of the desktop Jane C. Blake and replacement of devices on an screen, and issues they addressed dur­ Managing Editor active bus (hot plugging). ing implementation, including proto­ The subject of our next paper is col splitting, screen capture and data networks, and the emphasis of the handling, and dissimilar fi·an1e bufters. engineering is on customer require­ They conclude with ideas for possible ments for reliability and availability. enJ1ancemcnt of the product in the Router clusters, described here by future.

Peter Higginson and Mike Shand, In a previous issue of tl1ejou mal were developed to provide tast fail­ featuring technical computing topics over response in lP net\vorks and are (vol. 7 no. 3), Jonathan Harris et al. defined as a group of routers on the described DIGJTAL's Fortran 90 same local area net\vork (LAN) pro­ compiler that implements High viding mutual . New router Performance Fortran version 1.1, cluster protocols and mechanisms a language tor writing parallel pro­ restrict the loss of service that results grams. An outgrowth of that work from a failure on the net\vork, speci- is an experimental debugger, code-

2 Digital Technical Journal Vol. 9 No.3 1997 Foreword

Welcome to the winter 1997-98 issue neering proptietary storage systems of the Digital Technicaljoumal. This to engineering open storage systems. issue does not have a single theme; The SCSI bus was developed dur­ it contains a potpourri of papers on ing the early 1980s as one of many a wide range of technical topics. This attempts to standarcLze the provides the toreword writer with a to storage devices. It succeeded beyond small gift and a not-so-smaiJ headache. me expectations of its developers,

The gift is the opportunity to tout largely because it supported a device the continuing fecundity of DIGITAL's model tJut was abstract enough to be engineering community. All the extensible but inexpensive enough to Richard Lary be implemented in the technology of DIGITAL Storage Technical Director papers in this issue of the journal come from product development tJ1e time. For all its advantages, how­ groups in DIGITAL, and allthe tech­ ever, SCSI suffered from poor engi­ nology described herein is directly neering at the physical level. This was applicable to the problems of using a direct result of the way it was devel­ in the real world. The oped. The diverse corporate repre­ papers themselves cover a wide range sentatives that defined SCSI did not of topics: designing storage buses have the time or money to specif)' and and their infrastructure; buiJding tP build cuswm bus infi·astructure com­ routers that reduce network delays ponents (transceivers, cables, termi­ caused by link or router failure; sharing nators, etc.), so they used commonJy 3-D graphical and audio data across available parts. A lack of sophistica­ networks of computers with different tion in specifYing physicaJ interface windowing systems; and debugging parameters resulted in a specification programs written in languages tJ1at tJ1at aJiowed too much component incorporate data parallelism. variation. As a result, it was difficult The headache, of course, stems to build reliable, multi-box systems from this very diversity. Any attempt using SCSI. to derive some set of common under­ DIGITAL's attitude towards SCSI lying principles other than "make during this period was to ignore it better stuff" from this collection is and hope it would go away. We had doomed to sophistry. And my techni­ designed our own proprietary DigitaJ cal background is too narrow to pro­ Storage Architecture ( DSA), which

vide any significant embellishment to utilized an abstract and extensible any of the papers outside the domain device model and also incorporated of storage systems. So, with apologies m

Digital Technical Journal Vol. 9 No.3 1997 3 t!·om rhc proprier:trv sror:tge architec­ Our position ll'as untenable. vVe had shon-rcrm conllgurarion guidelines tures oforher brge SI'Stcms compa­ to ch:�.ngc our strategv and embrace the tor our packaging �1rchitecture and

nies, our customers wnc hap lw, �111d bus rlut ,,.c had so srudiousil' ignored. long-term technical proposJis tor rhe

the storage business was prollt:tble. We designed a modular packaging SCSI plll'sic:ll bus architecture. Bill We were reeling quire pleased with :�rchirecture tor SCSI de�·ices (known Bam has been the head ofSBTO ourselw -and we were prot{JLmdlv wmmerciallv as StorageWorks) �md since irs inception :md has also been

ignorant ofrhe power of:t successful :1 ser of storage arrav conu·ollcrs thar our rcprcscnt:trivc to rhc SCSI corn­ open marker sr:mchrd, since one had inrcrbccd these de1'ices to our svs­ mince on :1ll matters relating ro the

never existed in the sror:tge world. tcms (and systems t!·om orher major physical bus interconnect. During the l�mer lulfofthe 1980s, vendors as well). We :�lso became In the summer of l993, Bill com­ SCSI grew steadily in popubriry until active participants in rheSCSI stan­ pleted �1 study ofrhc signJ.I integrity ir domin:trcd rhc workst:trion and (brds process. Where DIC!TAL h�1d issues surrounding par:tllcl SCSI. His small-server nurkcrs. These svstcms prn·iouslv sent one or rwo engineers conclusions were startling. The SCSI

had at most�� kw disk drives on to SCSI standards meetings snicrly to standards commirrcc lud, over the

them, andSCSI's signal integrity gather inh:.>rnution, we starred ro send \'ears, made enough improvements in

problems were mJ.nage:�ble in th:tr up to halfa-dozcn engineers to listen, the b�1sic rr�1nsrnission line char:tcreris­ context. Thev II'CIT nor m;1nageabk learn, p:1rticipare in debare, help rics ofrhc SCSI bus rhar mosr of rhe in rhc larger and more demanding 11·ith the grunt II'Ork of rhe st�lJl(h·ds remaining signal integritY problems data center SI'Stems, and so SCSI 11·as process, and make propos:ds ro amend ,,.c,-c due rorhc 1·�1riarions in compo­ nor used there. The SCSI standards or extend the st:mdard in direcrions IKnt parameters J.llm\'\:d b1· rhc SCSI

group ,,·:-�s ;111·are ofrhe bus's deficien­ usdi!l ro us :md our customers. specillurion. Exercising righter con­ cies, holl'cvcr, :tnd as the decade pro­ Our nell' modular pacbging trol m-er componcnr l':lriarion-­

gressed, the group made amendments design alloll'cd our customers ro rhmugh building selected compo­

ro the srand;Hd ro elimin�ue man1· of i nsr:1ll and remo1·e storage de1·iccs nents or through purch:tsc specillc:t­ them. Bv the turn ofrhe decade, sc1 - rhcmsch·es and ro mig1·�ue sror�1gc rions ,,·irh our supplicrs-ll'ould nor

er::d independent subsystem n:ndors dc1·ices ben1·een S\'Stems, e,·en onh· produce excellent signal integrin· were selling subsystems utilizing SC:Sl bct\\'Ccn s1·srems built bv different in our packaging bur II'Ould all011' the de1·ices ;1S storage t(lr large J)!C!TAL wstcm 1·endors. This modularity maximum dock r:�rc of rhc bus ro be svsrcms. These subsystems did not, pro1·ed to be a ,·en· l'aluable te:trure doubled ll'hile m�1int:�ining excellent

in general, h�11·e the te:Jtures, pert{Jr- ro our customers. Ho11·e,,er, ir signal inrcgrit\' and b;1Ckll'ards com­

11Jancc, or robustness of our su bws­ required us to build a plll'sic.ll infra­ p�uibili n· with existing SCSI de�·iccs. rerns, burthey were signillc:�nrlv structure for theSCSI bus rh:tt had Bill's results also indiutcd rhar the

cheaper and improving �11! the rime. rbe robustness needed Lw our Lnge maximum clock r:1te could be incrc:1scd

By 1991, ir had become obvious ro svsrcms and that could accommodate CI'Cil further, wirh more 1\'0rk, us rh:tr we would nor be able to com­ a great de�11 oh-ariabilirv in contigur:�­ Tbis discovcrv c�1n1e ar a criric1l perc with these systems in rhe long rion, and to use a bus rhar was known rime in rhc evolution of the SCSI

run. They were leveraging an entire ro have residual sign�11 inregriry prob­ sr�md:t rd. 1\!luch of rhcSCSI stambrd indusrrv's investment and talent :md lems in its physical interconnect. vVc com mince's cH(Jrt in the c:-trly p�1rt of were reaping rhe cost hcndirs of ll'ere undcrstandablv ll'orricd about the 1990s was being spent in modif\1-

high-volume manubcnn·ing; whereas this, worried enough to charter :t ing rhc SCSI sr:�ndard so rhar serial we had to design and nunuhcrurc (�1r small group of engineers as �l SCSI buses could cu-rv rhe higher lcvd rclarivclv low volume) cverv component Bus Technical Office (SBTO) within SCSI bus protocols. The committee of even· DSA svsrcm oursell'l:s. tht: sror:tge group, and to dt:1·elop had sr:�ned this ll'ork under rhe

4 Vol. 9 No.3 1997 assumption rhar parallel SCSI was maximum clock rate if the components mittee debut, UltraSCSI is solidly "our of gas" in pertormance, and in the physical interconnect met the entrenched in the storage market. In the new serial bus variants would tighter specifications. Our motive in fact, storage market anaJysrs are now supplant it by mid-decade. However, doing this was purely selfish: we were projecting that the combined volume by 1993 not only was the definition not ready to choose among the serial of devices on all serial SCSI buses and implementation of the serial bus bus proposals, yet we would soon (yes, there are still three, but the going slower than expected, but there need more performance than parallel market has already picked one, Fibre were rlu·ee independent and incom­ SCSI could offer. A higher perfor­ Channel, as the winner) will nor patible serial bus proposals, each with mance parallel SCSI would allow us exceed parallel SCSI device volumes unique useful features and unique to improve our storage subsystem until early in the next century. And drawbacks, each with a cadre of sup­ performance without having to stake the SCSI committee has finished porters among the industry represen­ our fortunes on a potential Betamax . extending the paraJlel SCSI specifi­ tatives. The marker would ultimately Bill's presentation at tbe SCSI cation to achieve a second doubling choose which serial buses would committee meeting was met with of maximum bus clock and is in the thrive; bur it was highly unlikely that enthusiastic approvaJ. It turned out­ midst of defining a third doubling. all three would thrive. Storage ven­ surprise!-that other system vendors Without hyperbole it can be said dors that made the wrong bus choice were tee ling as uneasy as we were that the technology embodied in Bill would suffer for it. Most gaJJing to about the serial SCSI buses. The pro­ Ham's paper has directly affected the the technophiles among us, the mar­ posal, christened UltraSCSI, was course of the computer storage indus­ ket's choice could not be predicted adopted as an extension to the parallel try, and it continues to affect posi­ from the technical merits of the con­ SCSI standard. Bill Ham and the tively DIGITAL's position in that tenders. I fit could, we'd all have SBTO then worked with component industry. Enjoy reading the paper Betamax VCRs in our homes today. vendors and the SCSI committee to and those that follow it in this issue. So, DIGITAL decided to have Bill develop the thinner cables, smaller present his results to the SCSI com­ connectors, and SCSI expander cir­ mittee at its November 1993 meeting cuits described in his paper, all with and recommend that the committee the aim of keeping parallel SCSI as a extend the SCSI specification to allow desirable alternative to the serial SCSI the bus to run at up to twice its old buses. Today, tour years after its com-

Digiral Technical Journal Vol. 9 No.3 1997 5 I "VtiJiam E. Ham

Recent Advances in Basic Physical Technology for Parallel SCSI: UltraSCSI, Expanders, Interconnect, and Hot Plugging

DIGITAL uses SCSI technology in most of its Introduction storage products and consequently has led Parallel SmaJJ Computer System Interface (SCSI) is the major standards and industry bodies to improve workhorse technology tor most of the storage applica­ the technology in the following areas: increased tions in DIGITAL products today. This device and synchronous data phase speed beyond fast SCSI; interconnect technology spans all system offerings longer, more complex electrical configurations fi-om the simplest to the most complex. SCSI was intro­ by means of expander circuits; versatile and duced to th e higher-end products in the early 1990s as more manageable connectivity th rough a the open systems fo llow-on to the DIGITAL propri­ etary Digital Storage System Interconnect ( DSSl) and smaller, improved physical interconnect; and Computer Interconnect (CI) technologies. dynamic device insertion and removal. Data As system demands have increased, SCSI has evolved phase transmission rate extension is ac hieved to meet the needs. DIGITAL has made considerable through understanding and controlling silicon contributions to the technology and led the effort to chip timing and transmission media parameters. achieve industry standardization. This paper details the most significant developments in the physical fe atures Using expander devices to confine transmission of parallel SCSI technology over the last several years line effects to shorter segments allows large th:1.thave allowed it to continue to serve DIGITAL cus­ increases in the maximum distance between tomers in an eHective, competitive way. The discussion devices and in the device population within the targets the following fo ur important areas: same SCSI domain. Expanders enable complex, l. Speed increases in the synchronous data phase, hublike configurations to be created without which resulted in the ANSI definition ofUltraSCSI changing existing SCSI devices or software. (Fast-20 SCSI) technology! The use of 0.8-millimeter connector technology 2. Development of software-invisible circuits, gener­ and consideration of cable losses has reduced ally called expanders, that enable segmentation of the physical size of the external shielded inter­ SCSI domains into easily managed pieces connect by approximately two thirds, decreased 3. New connector and cable technology, namely the Ve ry High Density Cabled Interconnect (VHDCJ) the number of parts required to support com­ device, that decreases the interconnect size and plex configurations by a factor of 10, and complexity by many told2 increased the interconnect density to the same 4. Dynamic removal and replacement of devices on an level used in serial SCSI. Finally, the mating and :.�ctive bus, which is referred to as hot plugging demating events that occur during device inser­ DIGITAL made su bstantial contributions in the tion and removal produce a spectrum of small, fo ur areas. This work included creating the expander undetectable, electrical distu rbances on the and interconnect standards projects; leading the work­ active bus that appear to be limited by the ing groups that defined the Fast-20, expander, and physics of the media and device capacitance. interconnect standards; providing data for the Fast-20 and hot-plugging projects; and proposing and gai ning approval fo r the hot-plugging standard. The author has taken a phenomenological approach throughout, because in most cases there are too many unknowns to achieve a rigorous analytical result. This

6 Digirc1l Technic3l Journal Vol. 9 No.3 1997 paper fo cuses on developments fr om SCSI-2 through repetition rate is 20 megahertz (MHz). Table l con­ UltraSCSI and specifically does not address the new tains the generally accepted terminology related to Low Vo ltJge Difterenrial (LVD) technology being data phase speeds. introduced tor the highest-speed applications. Because of the wired-OR property, each signal in the bus must be driven to a known state even if no Pedigree SCSI device is acmally driving the sign:d. SCSI uses the SCSI is defined in several ANSI srandards13' and in the logical 0 state (negated state ) as the undri\·en state and material that was developed to create these standards.5 " uses d1 e bus terminators to drive the signal to this state The standards were generated over the last decade in the absence of any driving devices. The device signal through a cooperative effort oL:�pproximately 60 major drivers must overcome this terminator-driven logic companies in the computer and computer support state of 0 in order to send a logical l (asserted state ) industry. As a result of this pedigree, d1e prime directive onto the signal line. fo r SCSI technolot,•y is interoperability of devices SCSI signals must support all frequencies, ti·om stat­ designed and manufactured by different companies. ically dtiven by the terminators onlv (DC) to the third The details ofthe physical designs used to implement harmonic of the fa stest signal edge in the synchronous SCSI may not be visible to users and researchers; these data phase. In many cases, the same wire must support details contain much of the marketing and technical all these fr equencies at diffe rent times during the SCSI differentiation between the products of the participat­ protocol. ing companies and are therefore hidden in the silicon The highest signal edge slew rates to r UltraSCSI design. The behavior at the device connector pervades are approximately 500 millivolts per nanosecond the SCSI specifications. The basic assumption is that as (mV/n s ). A 2-volt (V) transition requires approxi­ long as the properties arc compatible at these connec­ mately 4 ns/5.4 ns/meter (m) = 0.74 m tor a signal tors, device substitution is possible. Thus, SCSI devices edge (assuming 5.4 ns/m as the propagation velocity may be both interoperable and of differentdesigns. of the signal edge). Therefore, some relief exists because the connectors and cable assembly termina­ Basic Architecture tions are much smaller than the signal edge length; the This section reviews the basic architecture of parallel connectors and terminations do not need to have care­ SCSI. The SCSI bus is �l parallel, multidrop, wired- OR fu lly controlled properties. configuration. This allows the use of the technology available in the connector and cable assemblv industry to optimize Signal Multiplexing and Phases The parallel signal the interconnect properties without the considerable construction of the bus allows multiplexing of some design, manufacturing, and test burden imposed by signals during diffe rent phases of communication so controlled impedance requirements. that the same signal lines mav have \'Crv diffe rent fu nc­ tions in different phases. The physical behavior of sig­ Transmission Modes The transmission mode or· a nals is usuallv limited bv the phase during which the SCSI bus is determined by the properties of d1e shortest pulses are used and the demands tor signal terminators that, by definition, constitute the ends of integrity are the highest. The limiting SCSI phase is the bus. Terminators also supply most of the energy the data phase (pavload phase) that is executed with required to operate the single-ended transmission-mode the highest synchronous rate. For UltraSCSI, this peak devices and additionally provide the required matching

Table 1 Te rminology for Data Phase Speeds

Maximum Transfer Maximum Byte Maximum Byte Rate (Million Rate (Narrow) Rate (Wide) Data Phase Speed Name transfers/second)' (Mega bytes/ second) (Megabytes/second)

Asynchronous Unspecified Typically - 3 Ty pica lly - 6 Slow (synchronous) 5 s 10 Fast (synchronous) 10 10 20 Ultra (synchronous)2 20 20 40 Ultra2 (synchronous)l 40 40 80 Ultra3 (synchronous)' 80 to 100 80 to 100 160 to 200

'One transfer is 1 byte in narrow mode and 2 bytes in wide mode; 1 byte equals 8 data bits plus 1 parity bit. 2Ultra is synonymous with Ultra 1 and Fast-20. 3Ultra2 is synonymous with Fast-40. 'Rates not yet finalized; U ltra3 is synonymous with Fast-80 or Fast- 1 00.

DigiraJTec hnical Journ�l Vo i. Y);o. 3 lYY7 7 to the characteristic impedance ofrhe tr:msmission line. idcntific1tion (!D) line. After exami n i ng the asserted In d i ffe rential SCSI, the terminators prm·idc �1 small I D Jines to determine \\'hich dc\·ice has the highest ID, portion of the mnJII encrf:,'l' required ro opcrnc the �1 11 but the dn·ice \\'ith the highest ID release the BSY bus; the diftercntiJI drivers suppk the rcm�1indcr of l i ne . This lc1\ CS onh one device, the \\'inner, asserting the energy. the BSY line. While the current in the BSY line is read­ Drivers that want to rr:tnsmit an asserted state j usti ng itsclfri·om a mul ri plc-drivcr asserted condition s must overcome the biasing pro,·ided L) \' th e termina­ to a single-d ri\n asserted condition, noise pulse ( call ed tors. The dri\Trs operate loc1ih· on the bus �md :1 lter '' i red-OR glitches ) propagate throughout the length of the stare in their immediate ,·iciniry \\·hen the\' S\\ itc h the sig11<1i line md J11<1\' be detected coll ccti\·ely JS :tn on <1 nd off. For single-ended SCSI, the 0 stare is erroneous phase. Thcrd()rc, one of the archi tecrural approx i m:1te ly 2.5 V a nd the I stJte is <1 pproximareh· limits f()l' par<1ilcl SCSI is rhc time requi red tor these 0.5 V. For high -,·oltage diftercmial SCSI, the 0 state ,,·ired-OR gl itches to settle. This bus settle ri me is set b,· is approximately -1 V ro -2 V, :t n d rhc 1 state is protocol <1t 400 ns and must be i nterpreted r improving SCSI derive ro r negated States :md eJectrOm<1gnetic CJ1Crf';\' in the ri·om appropri<1tc h' managing the transmission lines, current flo\\'ing th rough the inductance of the trans­ t:tking Jlklll tagc of the multidrop :trchitccture oHcrcd mission l ine tor asserted states . Ultimately, the te rmina­ l)\' o para I lei \\ i red -0R structme, using statc-of-the-<1rt tors will set the state back to negated Jrtc r the drivers tech nology h-om the interconnect and silicon industry, cc1sc to source or sink current; ho\\'cver, this onlv hap ­ �u1d making innovative usc of the time req uired for the pens attcr the round- tri p prop:tgation dcl:l\' ti·om the ,,·ired-OR gli tc hes to settle. These techniques arc the dri\·cr to the brrhcst terminator if the bus docs not b�1sis ofthe de\ clopmcnt lw !)] G lTAJ , in the rour a reb ha\·c matched characteiistic impedance propnrics . �1ddresscd in this paper. Approximately the same energy transt(mnations Speed incrcJscs in the synchronous cb ta phase arc occur tor d iffe rential SCSI, but signiri e1 nr current is b:tsed prim:trih· on incrc1 sing the timing precision s u ppl ied bv the drivers fo r both the asserted and the in the silicon n·anscci\Trs h,· u si ng nc\\'cr sil icon tec h­ ncg::tted states. nolog,·. The inte rconnect properties rem<1in l argch' unc h ;mgcd ti·om those used ror fa st SCSI. Multidrop Requirements The mul ti drop :t rchitccturc Circuits rlut enable segmentation of SCSI domJins � requires a cominuous lo\1'- rcsistance p:t th Clilcd the 111 to easi lv m 1 11aged pieces are based on svstematic bus path bet\\'ecn the terminators and �1 IIm' s dc1·iccs isolation of tLU1smission line properties and usc of to be attac hed to this path. The number <1 nd proper­ \\·i red-OR noise pulse properties . No sort\\'are, inter­

tics of these artJchcd de\·iccs varv. \l'idclv. because of connect, or device ch anges needed to usc these circuits. many factors including the speed of opcr<1 tion , the New connector and c:tblc tcc hnolof:,'l' is based on overall l ength of the bus, and the transmission mode. <1 n inno,·,ni,·c 0 . 8 -mi l l imctcr (mm) rihhon-st\' l c con ­ Att<1c hcd dc\·ices �1 1\,·a,·s disturb the tL1nsmission line I1ccror tcc lmolog\· that optimizes the toLll SCSI clcc­ properties ofrhc bus path; the kc\' to successful opera­ tric11 requirements \\'ith the capabi l i ties of cable and tion is in the management of the magnitude of these connector design. d isturbances. Dvnamic removal and rcpbcemcnt of devices on an General ly, the more capacit�m cc or e lccn·ic1l length :tcti,·e bus, i.e., hot plugging, is based on the mul tidrop the is . ; the dC\·ice has, rnore disrupri,·c it PLlcing- d C\·i ccs architecture, '' hich enables devices to be 1dded or roo close together along the bus path c1n cause them replaced ,,·irhout affecting continuit�· bet\,·cen other to appear elecrric11ly as a single super disruptive device. de,·ices. Hor�) lugging depends on undcrstJ ndi ng and Pbcing them roo r�u· apart c1n rcsLiit in an mcrall bus ll1<1naging the clectric1l disturbances created during length that is too long. the insertion or t'Cll10\'�11. The rc m.1indcr ofthis p.1pcr pro\·idcs details ofthese Wired-OR Glitches During the arbin·,ltion phase, t(>ur arc:ts or' i mproveme nt. The end result of these when the SCSI d evices decide \\'hich devices \\'ill be extensions to the basic phvsical architecture of parallel sending pavload data to or rrom each other, multiple SCSI is �1 m:tjor increase in its capabilities, accom pa ­ dC\·iccs mav assert the same control line ( BSY) at the nied b\· onh' J ,·en· minor disturbance to the installed s�1mc time . E;JCh dc,·ice that \\'isbcs to communicate base, especia l!\· the sofu,·arc . asserts both the BSY line and its respective device

8 Dig:it,ll Te chnical )oum,ll \'ol. \! �"· 3 ]997 Increasing the Synchronous Data Phase Speed single-ended version. Therefore, most of the interest was in making the fast single-ended version work Beginning with the SCSI-2 standard , the synchronous adequately. transmission mode is available for transferring payload Taking si ngle-ended SCSI from asynchronous and data between SCSI devices. The devices select this slow synchronous ( 5 megatranskrs per second ) to the mode by mutual agreement before any synchronous fast synchronous technology was difncult. The prevail­ data is passed . The agreement is achieved by using the ing opinion was that the SPI standard represented asynchronous transmission mode, which is slow but the fi nal improvement to paraJJel SCSI. This view usualh' reliable. set the stage to r a number of alternate physical techno­ The synchronous data phase uses the DATA and logies based on the serial point-to-point transmission PARIIT bit lines fo r the data and either the REQ or schemes used in communications technologies, e.g., the ACI<.control line as a signal that the receiver uses Fiber Distributed Data Interface (FDDI) and , fc>r capturing the data. The term synchronous derives to be used fo r higher-performance storage applications. fi·om a specified timing relationship between the bit DIGITAL's Storage Bus Technical Office had seen line signal edges and the REQ or ACK signal edges. many instances of difncult implementations that were (The ra iling edge of the ACK signal is used when the the result of less-than-opti mal underst:mding and data phase transmission originates fr om the SCSI ini­ management of the specincation margins. No credible ti::Jtor, and the fa lling edge of the REQsignal is used study had been presented on the margi ns available in when the transmission originates from the target.) SCSI, so the thrust was to create baseline characteris­ There is no synchronous relationship between the tics of multid rop parallel SCSI to determine where internal timing references on different SCSI devices, so unused margin might exist. the receiver must bufrer the received data before intro­ Little data was available on tl1e precise reasons why ducing the data into its internal data management specific implementations of fast synchronous SCSI did structure. This buffe ri ng is usually accomplished by not work. The system would hang or report various me

Digiral Technic�! ]ounJJ! Vo l. 9 No. 3 1997 9 to rccci1·c. The transmitting side is c1llcd the nciter, Oscilloscope mc<1Sul-cments provide the basis fo r setting Jnd the recei1·ing side is ullcd the comp:trJtor. comp li a nce spcciticuions , since these measurements Recei1·ed data is committed to the compar:ttor bv can lx pcrt(mned at the connectors. The basic question using one bit line as the l:ttching ACK.signal in a man­ that needed :1 11 ans11 er ll'as, Can parallel SCSI be oper­ ncr exactlv like that specified in svnchronous SCSI ated at cb·atcd speeds with reasonable m argi n to bil­ trJI1smissions. The test environment allm1·s the posi­ ure> D!CITAL opti mized the special test em·ironrncnr tion of the ACK signal to be adjusted ll'ith respect to to Jnswcr this question. Other speciflcations that would the data signa l edges. be necess;1rv to ensure imeroperab le operation between Since the comparator knows the dJta p�1ttcrn that is UltraSCS! devices could be de1ived if it appeared possi­ transmitted , it is possible ro isoLnc the precise data bit ble to achieve the end result. that caused tile transmission error. This kind of error­ The dat

SCSI INTERCONN\ECT SYSTEM

EXCITER BOARD

ACK TIMING REFERENCE

COMPUTER SYSTEM

Figure 1 Spcci

10 Dil',iLll Te chnical )ounLll \'ol. l) l\:o 3 [l)l)7 boards in this case. A key fe ature ofrhis kind of testing Figure 2 shows a typical error r:�te plot fr om a sim­ is that the test does not necessarily stop when an error ple single-ended configuration made from ordinary is detected. In fact, the environment may detect errors SCSI interconnect hardware and transceivers being 100 percent of the time. This acceptable behavior tested at the maximum UltraSCSI rate. Each data Jllows mapping of the complete bit-error response of point represents a 3-second sample (60 million trans­ the svstem. fe rs ) at each ACK position. The ACK position is incre­ mented in 0.1 -ns steps fo r a total of 240 independent Sample Data from the Special Test Environment The tests in the plot. To minimize the testing time, we test environment allows a multitude of tests to be per­ tested only the time ranges fr om -3 to 9 ns and 44 to fo rmed . The test scheme described in this section is 56 ns. The i ndividual data points are not distingu ish­ the one that wJs used to establish the basic timing able in this presentation, and there is \'erv little scatter margins availabk ti·om normal SCSI silicon, cables, between neigh bori ng points. In Figure 2, the error rate connectors, and terminators. of l is used to indicate that no errors were detected , A random repeating data pattern with 16 thousand since the log ofO is not easy to plot. diffe rent bit combinations was used as the basic data Examination of the raw data reveals that the plot is pattern.This pattern was transmitted over a period of monotonic in detected error rate to the fourth decinul ri me, and the number of errors detected was recorded. place. This indicates an extremelv predictable situation In this test, an error is defined as one or more bits in the as fa r as be havior of the same set of hardware is con­ recei\·ed data transkr that do not match the transmitted cerned. That is, there is vi rtually no Gaussian jitter pre­ bit. To acquire a new error rate data point, the transmis­ sent, and a SCSI system could be designed to be quite sion test is repeated by using exactly the same number of reliable and stable at the maximum UltraSCSI rate. transfers in the same time period with the same data pat­ Extending the sample period to 5 minutes made no tern but with some test parameter changed. difference in the position of the key features. Using the VirtuaJ iy any parameter can be varied to r different 3-second sampling ti me, the entire data set could be tests. For a given plwsical configuration, the most use­ acquired automaticallv in approximately 12 minutes. fu l parameter fo r determining the timing margin is the The onset of errors is extremely sharp as the ACK position of the ACK pulse with respect to the data position approaches the critical position. One hundred edges. The basic d::tta then becomes tht number of picoseconds changes the observation from 0 to 864 errors detected and the position of the ACK pulse edge. errors near tht 8-ns position. On the otl1er end, the There are two basic random variables operating in 50.1 -ns time produced 7 errors, and the 50.2-ns time this scheme: the datJ pattern and tbe jitter induced bv produced 425 errors. No errors were detected at anv non-data-dependent sources. It is easy to separate these of the times between 50.1 ns and 7.9 ns. This data two variables by using extremes in the data pattern: shows that there are no strange etkcts that prevent vny tew transitions and the maximum number of tran­ SCSI fr om operating at the maximum UltraSCSI rate . sitions (every data edge bas a transition, i.e., alternating As the ACK position proceeds into the region of l/0 pattern ). Although this level of precision is avaii­ more errors, a condition is finally re:�ched in which all Jble, we will see that we reaJiy do not need to botl1er the transfers have errors. On the one hand, tl1e proba­ fo r parallel SCSI Jt the maximum UltraSCSI rate. bility that one transfer has the same data content as its

10,000,000,000 (1 OlO) (f) II 100,000,000 w lL (108) (f) z <>: 1,000,000 II (106) I- lL 0 10,000 II (104) w m 2 100 ::::l (102) z 1 (10°) -10 0 1 0 20 30 40 50 60 ACK EDGE POSITION (NANOSECONDS) KEY: GOOD TRANSFERS

-- TRANSFERS WITH AT LEAST ONE BIT ERROR

Figure 2 T\·pid Ulrr�SCS! Error Rare Plot

Digiral Technical Journc1l Vo l. 9 No. 3 1997 ll neighbor's is ver\' smaiJ \\'ith rhis random data pattern. The initial errors usually originate fr om the same On the orher hand, since a random data pattern is bit. This bit is rhe one with rhe most unfavorable tim­ being used, there is a reasonable chance that a bir will ing skew with respect to the ACK signal. The cliff is actuallv match thJt rransrnitrcd in one state but nor in not perkctlv sharp because there is a 50 percent the other state. The random dat�l pattern tends to chance that the data transmitted is the same as that spread out the ri m<.: between rhe first error and the last expected even under the error case and, more impor­ good transfer. In rhe limit, ti.>r perfectlv random data, ranth•, because there is some leve l ofjirter present. It is this time is a measure of the total timing imprecision in this jitter that sofTens the clitf Thus, the first errors the system. detected happen when the skew of the weakest bit This imprecision indud<.:s skew in rhe exciter and adds ro rhe tai l of the jitter distribution. Only a kw comparator boards, in the SC:SJ dri\ crs and recci\US, errors arc present because only a small part of the jitter and in rhe cable transmission media (including lo�1ds, if popu Ln ion extends fa r enough to trigger the error. any), and all forms ofjitter. For the test conditions shown SCSI systems wi l l experience virtually no errors in Figure 2, the total difkrcnce is 3.6 ns near rhc 5-ns because ofrbese mechanisms in service if one operates point and 5.4 ns ne�1r the 52-ns point. This shcm s that 1 ns or more a\\'a\' ri-om an error cliff. the skew specifications in rhe SCSI sondard arc C)\'er­ Note th:n these results ti· om the special test en\'iron­ specified as cornp:u·ed to actual hardware pcrrc>rm :m ce. ment almost alwavs yield margins higher than those The data shown in Figure 2 is reprcsentati\'e of a calcubtcd ti-om a set ofinteroperabiliry specifications. large ,·ariety of configurations up to apprm: im<1tel\' 3 This is because the inreroperabiliry specifications must meters long and loaded or up to much longer point­ :- tllow margi n fo r e:

10,000.000,000 ( 1 QIO) (j) a: 100.000,000 w LJ.. ( 1 08) (j) z I <{ 1,000,000 \ a: ( 106) f- \ LJ.. 0 10.000 I I a: (10'') I w CD I ::2; 100 => (102) z

1 (10°) -1 0 0 10 20 30 40 50 60 ACK EDGE POSITION (NANOSECONDS)

KEY: GOOD TRANSFERS

-- TRANSFERS WITH AT LEAST ONE BIT ERROR

Figure 3 Fasr-40 Error R.ue Plot

12 Digiral Tec hnical Journal \'ol. <.) '\o. ,, 1997 Additional Te sts Other tests that are usefulwith the asserted state at 1.3 V, and the asserted period begins special test environment are ground offset etTects, ter­ when the asserted state has been detected. On the nega­ minator power effects, correlation of time domain tion side, the signal must ri se to at least 1.6 V before the plots on the signals with error rate distributions, hot­ receiver can detect a negated state, and a negated state plugging testing (which results in good error detec­ must be detected ifthe input signal reaches 1.9 V. In the tion), and comparison of the impact of diffe rent cables SCSI-2 and SPI standards, any point between 0.8 V and and transceivers. Test results of this nature are not 2.0 V could be used as the timing measurement. included in thispaper because the impact of these vari­ ations depends on many parameters and the results Sample Ultra SCSI Signals may not be generally applicable. Numerous variations on the details of the signals can be produced in UltraSCSI configurations. This section Ti ming Specifica tion Methodology shows two types of signals as representative examples With the increased emphasis on ti mi ng precision fo r that validate UltraSCSI as viable under certain condi­ UltraSCSI technology, it was necessary to introduce tions. The firstcase explores a configuration that actu­ better specifications fo r the measurement of timing ally exceeds the recommended specifications. This is a parameters than those in the SCSI-2 and SPI stan­ complex cabled environment wid1 a cluster ofloads on dards. Figure 4 shows th e precise measurement points one end and some distributed loads on the other end. and fe atures used for the specificationof single-ended The second case shows the signals over a 25-meter, UltraSCSI signals. single-ended point-to-point bus. The effects of the finite slew rate on the signal edges are accounted fo r largely by specif)ring the voltage levels Complex Loads Figure 5 speciti.es a complex con­ that coincide with the receiver input levels. Thus, the fi guration and the single-ended SCSI signals that setup time ends when the receiver is able to detect an result at various positions along the bus. The logic

REO or ACK

1.9 v MAY BE .,IV DETECTED 1.6 v ------�------MAY BE i i ,�/(" DETECTED : : 1.3 v ------1- -r------i J SHALL BE i i i : , DETECTED : : : 1.0 v ------J_ ------_,_ -� ------.L- -- I I 1 I I 1 1 1 ' ' ' ' ' ' , I I 1 , I I I 1 I � SETUP ---.j �HOLD TIME � : i i.-ASSERTION _..; �NEGATION _____.; PERIOD PERIOD ' i i : i I J o 1

DATA DATA BUS VALID BUS

1.9 v

1.6 v

1.3 v

1.0 v

Figure 4 Single-ended UltraSCSJ Ti ming Measurements

Digital Technical Journal Vol 9 No 3 1997 13 sign al that is driving tile SCS I driH:r chip is the ;m d th

dc1·icc positi on 4, just after :1 rc i;Hin:ll· long run �Yi th rota I bus length mon.: than 6 to 8 centimeters (em)

no loads. This signal is below th e 1-V lc1 cl hut h;1S <1 bevond the b;Kkplanc. This reduced bus length is ve rv slow assertion slew rate rh:1t causes considcr;\ ble rather se,·ere when compared ro that allowed :u rhe loss of asserted state pulse width. This complex con­ maxi111um f:1 st SCS I tra nstCr rate (a total of 3 meters ).' fi guration works with the receivers used hut docs In the section Small, Improved Interconnect, we show

not have the timing margin req uired by the r;;1St-20 how to overcome this 1 . 5 - mete r, 8-device limit hv standard . using an acri1·c SCSI inrerconnect.

By varving the posi tion ofrhc loads so tint there arc Appl�·i ng the timing measure ment methods shown

no loads between the driver and the ti rst lo,ld (not in figure 4 to the wavctorms in Figure 5 illustr<\tes

shown ), the signal at the fi rst load ci c1·icc IS degraded that more c1rcful riming speci fi cation methods do even more than at position 4 in figure 5. This is indeed help signihunrh· to keep the ti ming margin one reason th:lt the o,·er:1ll length of single-ended h igh enough to usc.

UltraSCSI with mam· loads is restricted to 1.5 meters

3 METERS (10 FEET) OVERALL LENGTH (INDIVIDUAL MEASU REMENTS IN CENTIMETERS)

7.62 30.48 30.48 210.82 TERMINATOR

THREE 12.45-CM STUBS. FIVE 7.37-CM STUBS, 25 pF EACH 25 pF EACH

DRIVER DEVICE 7 6 5 4 3 2 0 POSITION

ACK SIGNALS

6 DRIVER INPUT 4 LOGIC SIGNAL 2 DRIVING SCSI DRIVER DEVICE 0 POSITION 4

2 7

z 4 0 (Jj > 2 6 0 0 a: w ()_ (f) � 0 5 > N

2 4 :�::=@<=:0==::::::=====""'O DRIVER INPUT 4 0

10 NANOSECONDS PER DIVISION

Figure 5

ClrrJSCSI SignJls in a Compl<.::--: J3us

14 Vu l. \) �" 3 1\.>97 Point-to-point Configuration If lo:tds :1rc present Summary of Developments in the Area of Increased only Jt the ends of the bus, the transmission line Syn chronous Data Phase Speed between SCSI devices improves elcctric:�lly. This The UltraSCSI (Fast-20) speed increase can be attrib­ occurs simply because the loads significantly disrupt uted to a systematjc examination of the margins present rhe characteristic impedance and cause ref-lections and in actuaJ SCSI hardware and to the elimination of the attenuation. The point-to-point signal at 25 meters excess margi ns. Advances in the integrated circuit indus­ has better Jmplitude and timing m:1rgins than signals trv enabled silicon designs to be specified \\'ith tighter in much shorter buses with closely spaced loads. controls on the driver and receiver ti ming and threshold

Figure 6 shows a typical example of a point-to-point properties than were possi ble when the SCSI-2 or SPI UltraSCSl signal . The format used in Figure 6 is the standards were developed. All the important changes sJme fo rmat used in Figure 5. needed ro r SCSI devices are CO IHJincd in the silicon designs fo r the dri,·ers and receivers. As a result, the user Differentia l UltraSCSI sees no ruffcrcncc between the :1ppe�1r:1nce of UitraSCSI Differential UltrJSC:SJ uses the same contigur:�tion rules and that ofordin:�rv SCSI. as rast SCSI (25-mctcr total length, 20-cm [ 8-inch] The system integrator must usc a more restrictive stu bs, 16-device load)' and uses the same timing val ues set of configuration rules than required fo r bst and as single-ended UltraSCSl. The larger signal amplitudes slow SCSI. Also, the only impact on softw are is the and the common mode rejection propertv of diffe ren­ add ition of :1 new speed agreement code for tbe r:ttcs tial transmissions help overcome the transmission line uniquelv supported by UltraSCSI. This negotiation weaknesses in hea,·il\' loaded and long buses. As \\'ith is done precise ly the same \\'a\' r( Jr UltraSCSI as tor :my high -voltage ditkrential system the costs-in terms any other to rm of SCSI. Finally, UltraSCSI devices of money, power, and space-are higher. are 100 percent backward compatible with ta st

25 METERS (82 FEET) OVERALL LENGTH

TERMINATOR ------TERMINATOR

DRIVER ONLY END LOADS FOR THIS TEST RECEIVER SHIELDED 34-PAIR EXTERNAL CABLE

ACK SIGNAL 6 DRIVER INPUT 4 cr------.

2

N 2

0 RECEIVER INPUT AFTER 25 M

NANOSECONDS PER DIVISION 10

Figure 6 Poinr-ro-point Ulrrc!SC:SI Sign81s

DigirJl Technical )ournJI Vo l. 9 .:--Jo. 3 I 997 15 Bus Expanders • Rcqui1·e th:�t terminator p011 er not be connected bct11·ecn the segments being coupled As noted previously in the discussion of complex loads, • Do nor need to know the nego tiated data phase there arc rather severe limits on the con figurations that speed or am' other ,·ariable propert\' of a crans:1ction can be achic1·cd ll 'ith single-ended UltraSC:SI '' hen • Re quire that there be no electrical or logicaJ connec­ implemented in ;1 single bus. The cxrcllSion to pJLlllcl ti on of the Dl HSENS line (a single-ended sig11:1l SCSI architecture that overcomes this constr:1int that indicates the tr:tnsmission mode being used on iiwoh-cs using acti1-c circuits that connect SCS I buses the bus segment) bet\1·ecn scgmcnrs being coupled elecrricallv but isolating them fr om c:1ch other Ill a transmission line sense. These circuits luvc the gen eral • Issue :1 SCSI bus RESET signal on one segment on name expanders, since tbev expand the contigur:nion detecting transmission mode (singlc-ended/LVD, capabilities of paralleJ SCSI. etc. ) changes on the othn segment

Each individual bus has t\\'O termin;Hors and its 011·n Simple cxp;mdcrs :1I'C becoming available from sevcr:tl transmission mode (single ended or diffcrenti:tl)and sources in the industrl' tor usc 11·ith Ultr:�SCSI. obevs transmission line-b:tscd contigur�1tion rules as if

it were the only bus in the system. When used 11·ith Domain Rules Using Simple Expanders expanders, these individual buses Jrc e11Jed bus seg­ When using only simple expanders in :1 donuin, six ments. The co llection of SCSI devices in all the bus rules must be obsen·cd :

segments that ;l iT clcctric:tlh· connected together is l. All bus segments in the domain must com ph· ll'ith caJied the SCSJ domain. One ex:1111ple of ;1 SC:SJ domain using expanders is shown in rigu rc 7. Note th eir Ind ividu:ll bus segment length limits and other that \\'hen using expanders, it is possible to h:11 c bus segmcn r-rclatcd rcq ui re ments.

segments that do not ha1·e ;JIW SCSI initiators or t:J r­ 2. An1· segment bct11·een t\I'O other scgmenrs must

gets but only serve to fo rm ;m clcctricl[ Intcrcon ncct support the highest ped(J rl));lncc levd that can be between other bus scgmcnrs. ncgori:ttcd bct11 een the t11 o other segments. t:or cx;1mplc, ti\ O 11·idc UltraSC:Sl segments musr not Exp ander Properties be scp;lrated bl' a segment that does not support

Exp:111dcrs arc available in t\\'O b:�sic t\ vcs: simple :1nd both ll'idc SCSI and Ultr:tS CSI. bridgi ng. Bridging opanders beh:11c ;JS a SCSI initi:t­ 3. The: t11�1xin1urn propagJtion dela�- bct\,·ecn �l n�· tor or tJrger, 11 hcrcas simple cxpJnders ha1·e :1 set of t11·o dc1 ices in the domain cannot exceed 400 ns. properties that ndce the m look like ;\ piece of ll'ire A special case exists ro r devices th:tt usc extrcmdv ll'ith dcl:11' to the protocol . Simple expanders Jong times fo r responding to BUS FREE (the so-e1llcd l3US SET DELAY )-thc onc-11·ay pro�1a­ • Cannot initiate SCSI IDs ;J nd arbitr;J ti ons and C:lll­ gation limit is 300 11s instc1d of400 ns. not originate messages, :�Ithough the expanders em read messages sent from initiators ;m d targets 4. The nu mber of ;J ddressa blc dc1·iccs cannot exceed 16 unless the dom:tin coma ins bridging expanders. • Alloll' minim;JI Jrbirration prop:tg:nion deL11· S. A br The REQ/A.CK ottsct ncgoti:tted bcmeen ;1 Jl\' t\\'O dc1·iccs must be large enough to ensure that • Do not intertCre with the REQ/ ACK oftsct count Jdcq u,ltc offset ;1 nd buftC1·ing is av;lilable to accom­ Al low min/max pulse ll'idrhs to be m:ti ntaincd • mod:Jtc the round-tri p ti me bet\\'ccn the de1 ices. • Require the fi ltering of the SCSI RESFT line For the maxinHII11 Ultr;JSCS I rate 11·ith a 400 ns

• Al low :t rbitrary placement of the initiJtor ;m d the mJ ximum one-ll'av dom

BUS SEGMENT BUS SEGMENT

TWO-PORT TH REE-PORT EXPANDER EXPANDER

TWO-PORT EXPANDER BUS BUS SEGMENT SEGMENT

Figure 7 SCS I Donuin Built Using Ex�);Jnders

I I ( 11\ 1 i r' ·1 l 'l···r1 · :ll ill l minimum offset is 18. (This of'fset level is derived cables have historically been large, bulky, and generally bv consideri ng a maximum round-trip time of800 difficultto manage.

ns at 50 ns per transfer [ 800/50 = 16J and some­ Spearheaded by activities mat began in 1995 in the what arbitrarily adding two transfers to account fo r SFF (formerly Small Form Factor) industry group, some additional delay due to the processing time in standard ization is under way of two new connector the silicon.) f::m,jues tJ1at offer unprecedented levels offunctionauty and true multisourcing of complete connectors to r Achieving the 400-ns one-way domain delay parallel SCSI. These fa milies are the Very High Density requires expanders that will not pass the wired-OR Cabled Interconnect (VHDCI)2 shielded connectors glitch (noted earlier in the introduction) between bus that reduce me overall size of an external connector by segments. This filtering of the glitch allows the bus two tJ,jrds and me Single Connector Attachment-2 segments to settle individually. ( SCA-2 )' unsl,jelded connectors that integrate into The propagation delay through an expander a single connector all the functions needed to run a directly subtracts fr om the physical distance between peripheral. The VHDCI fa mily revolurjonizes the devices. It is therefore desirable to use expanders external SCSI interconnect and the controller parts of with small delays. For a single-ended-to-single-ended the internal SCSI interconnect; the SCA-2 fa mily does application, the delay can be as low as 10 ns. For a the same tor me internal device interface. single-ended-to-differential application, the delay is For tl1e first time, complete connectors-not just typically around 100 ns, which is another significant the mating interface-are being standardized . This penaltv to using differential bus segments. fe ature is essential to achieving interchangeability and More detail concerning these rules and other prop­ second sourcing fo r connectors with the same style of erties is available in the draft ANSI document: SCSI termination-side contact. The VH DCI fa mily is speci­ Enhanced Parallel !nteJjace, 5 which was edited by the fied in 26 different fo rms, all with exactly the same author of this paper. mating interface, so that virtually anv kind of device

Summary of Improvements Related to Bus Expanders or cable assembly design can be accommodated. Interestingly, this array of choices for the connectors The use of simple expanders dramatjcaiJy extends the utility of single-ended UltraSCSI. The most obvious does not increase the complexity of the interconnect example is the ability to introduce point-to-point but rather opens up new wavs fo r prod uct developers segments where additional length is needed . A less obvi­ to design products while maintaining a simple and ous example is tJ1e abiuty to create star or hub configu­ physically interoperable separable connector interfKe. rations by clustering simple expanders into a local In tact, this ability to accommodate a variety of prod­ physical area. An example of a tJ1 ree-port SCSI hub uct design requirements without changing tJ1e separa­ is shown in Figure 7. Note the three simple expander ble interface is one reason that SCSI is becoming less circuits internally connected within the hub. Simple complicated . expanders also make it possible to mix single-ended and Similarly, the fa mily of SCA-2 connectors tor SCSI ditfe renrjal SCSI devices in tJ1e same domain, to achjeve internal devices and cables is fo llowing the VHDCI the full 16-device count, to add and remove bus seg­ standardization model, with a significant number of ments \\�thout shutti ng down tl1e entjre domain, and to intennatable fo rms being standardized. These connec­ achieve differential performance witJ1out incurring me tors offe r the ability to bring all the SCSI signals, all the extra cost of ditlerential . Bridging expanders offe r me power and ground connections, and all the optional same transmission isolation as simple expanders and signals, such as IDs, spindle sync, and power faj l, out of may allow increasing tJ1e number of devices in the the device through a single unshielded connector. This domain to as l,jgh as 946,5 but bridging expanders are fe ature dramatically shrinks the cost and complexity of not as well developed as simple expanders and \vii i not interconnecting an array of SCSI devices. be explored in deptJ1 in tlus paper. Using an SCA-2 connector, the device mav be Note that me improvement in signal integrity is dra­ inserted into a without using cables. If the matic when using expanders with backplane applica­ SCA-2 and backplane combination is not used , a SCSI tions. Therefore, it is good practice to use an expander cable (50-pin or 68-pin conductor), a to ur-lead power whenever connecting a SCSI cable to a backplane that cable fo r ground and power (5-V and 12-V), and one contains SCSI devices. or more smaller cables for the IDs etc., are required to r euery device in me system. Each of these cables is Smaller, Improved Interconnect routed ditterently, has different current carrying and other electrical requirements, and has very ditk rent Another recent development in paral lel SCSI techno]­ connectors. Although this cabled option is fl exible and ogy is the introduction of much smaller external phys­ ofters significant advantages in some systems, it is usu­ ical interconnects and more capable internal device al ly not the best solution in the device array and mod­ interconnects. The SCSI connectors and shielded ular packaging applications that are required fo r the

Digital Technical Journal Vo l. 9 )io.3 1997 17 higher-end applications. Therd(Jre, the SCA-2 is a sig­ upper and lower connectors \Vithout interrerence. The nincant factor in the dramatic reduction in complexirv specifications ot· the VHDCI interfa ce ensure that of higher-end SCSI de,·ice applications. neighboring PC opti on slots will nor ha\'(:in terference even if al l the SCS I ports have cable assemblies attached . VHDCI Connectors The VH DCI connector is useful for multipart appli­ The physical size of the VH DCI connectors is much cations such as RA ID (redundant a rray of inexpensive smaHer than the earlier versions, as seen in figure 8. disks) controllers. Figure 10 shows examples in which Because ofit s low profile, the VH DCI 68-pin bmilv is the ,,·ide ' ersion of the connector ramilv has allowed at approximately half the height and twicethe width of the l c1st a doubling of the number of ports possible in a latest external connecror, the High ­ single conrroller t( mn bctor. As illustrated in Figure Speed Serial Data Connector ( HSSDC). figure 9 shows 10, the dC\ ice design enables up to fo ur wide SCSI a comparison of the VHDCI and HSSDC connectors. ports on a single PC option card cutout. The same panel space is required tcx either· technologY. The VH DCI rete ntion scheme is also significanrlv The VHDCI connectors shown in Figure 9 are simplified by introd ucing a three-way retention post closely spaced, but the orientation of the poLlrizing ti.1rt he bulkhead connector. This post accepts ( 1) the shield connection is 180 degrees dirterent betweenthe com-cntional jackscre\\·s, (2) a squeeze-ro-rele;ise clip upper and lower connectors. This arrangement allows for positi\'e retention with rapid release, or ( 3) a detent an offSetcable assemblv to be used ,,·here one side is fl at. ri ng retention th;lt requires a stronger pull than that This same cable assembly mav be used on borh the re q u i red with no retention but no action other than

SCSI-1 LOW-DENSITY NARROW (50 PINS)

SCSI-3 HIGH-DENSITY WIDE OR NARROW (68 PINS)

VHDCI WIDE OR NARROW (68 PINS)

VHDCI NARROW (36-40 PINS) MICRO SCSI

Figure 8 External SCSI Connectors

Figure 9 Comparison of rht:68-Pin VHDCJ ;md fibre Channel HSSDC: Ccmnccrors

18 Di,!;itJ!Te chnical )ourrLli \'ol. \1 �o. � 1997 FOUR WIDE SCSI PORTS ON A SINGLE PC OPTION CARD

EISNISA CARD

Figure 10 Four \Vide SCSI Ports on a Single PC Option C�rci

pulling or pushing. The choice of retention type is and its contacts are imbedded in the housing where made in the cable assem bly. All 68-pin VHDCI cable they cannot move or become distorted. assemblies that compll' with the SFF speci ri cati ons work on all 68-pin VHDCI mati ng connectors. SCA-2 Connectors Figu re 11 shows the details of tl1e 68-pi n VHDCI The SCA familyuses an 80-position, leaf�style contact system. The lip in the jack post provides the securing to interface all active SCSI lines, three power voltages, point f'(JI' squeeze-to-release clips and fo r split-ring and de\·ice conu-ol signals. This connector is consider­ detent retention. The center of the jack post is threaded ably smaller than the collection of the three diffe rent for usc \\�th jackscrews. connectors used fo r power, options, and SCSI bus in a Al though smaller than th e high-density connector, cabled system. There are two basic versions of SCA the VHDCI connector is durable. It has no pins that connectors: SCA- 1 and SC:A-2. Both versions are can bend; its retention scheme uses the same-si ze unshielded and usefu l only within shielded enclosures. jackscrew thread as the high-density wide co nnector; The SCA- 1 has 80 positions with all contacts designed

CABLE/BOARD SIDE DEVICE SIDE

THREE-WAY RETENTION

Figure 11 01'er;111 View of the 6il-pin VHDCI System

Digital Ted1 nid Journal Vo l. 9 l'o.3 1997 19 to be the same length. The SCA-2 can be mated ro rhe The use of backplanes to r direct device attachment SCA-1 but has advanced grounding contacts and is possible because all the electrical connections fo r the sequenced signal and power contacts for supporting device are available in one connector on the device. bot plugging and blind mating (no visual fe edback This design eliminates the cables used to attach the during mating). Both versions are <1 \"ailable in mam· dn·ice and the space required fo r the connectors, rhus styles, which diffe r bv the termination-side structure significant!\· shrinking the size required to package and O\'erall orientations. multiple dc,·ices. The SCA- 1 is not a documented standard and is being replaced by tbe SCA-2. The SCA-2 connector ExternalSCSI Cable was introduced to SFF in 19957 as rbe first stepto\\'ard The extcnul cable fo r SCSI is shrinking also, through formal standardization. the usc or smaller-gauge wire, better dielectrics, and Two special features exist in the SCA-2 connector. Jess jacketing material, as ill ustrated in Figure 13. First, two contacts, one on each side of the connector, formerlv, wide SCSI required a cable of'approximatelv provide the fi rst make/last break to r the ground con­ 12.70 mm· (0.50 in) in diameter (a l26.677-mm2 nection. This design ensures that a common elecrric-d [ 0.196-in' J cross section) with 28-gauge wire. Today, ground is established between the device rtably ti ts within the device bound:�ries. gaugc wire sizes is a good optimization.

_... � ADVANCED _..- � DEVICE SIDE / / J GROUNDI NG / CONTACTS

_... / _..- �- / /...... -;/ / _..­ 7 / / / ; / / ADVANCED GROUNDING / CONTACTS

�'

CABLE/BACKPLANE SIDE

Figure 12 0\'CI'�\11 Vic,,· of the SCA-2 Connector Svsrem

1')97 20 Digit

30-GAUGE WIDE WITH 28-GAUGE WIDE IMPROVED DIELECTRIC

32-GAUGE NARROW 30/32-GAUGE NARROW (MICRO SCSI)

Figure 13 Extern�l SCSICabk Diameters

Summary of the Benefits Derived from a Smaller, directJy part of the advancements in parallel SCSI. Improved Interconnect The SCSI terminator power noise is determined by The VHDCI connector and smaller cables combine the size of the decoupling used on the SCSI termina­ to offe r a robust yet user-friendh, revolution in SCSI tors and the size of the capacitance on the device interconnect. The leaf-style contact of the SCA con­ being inserted. This noise is easily controlled bv nector eliminates problems with bent pins that fr e­ ensuring that these sizes meet the values specitied in quently bedevil the older wide SCSI connector. the SPI standard.4 The ability to usc up to fo ur wide UltraSCSI ports in The delicate case is when the SCSI signal lines arc a single PCl option slot increases the SCSI connec­ involved, which is the subject of this section. To deter­ tivity per PC! slot to 60 devices (from 15 devices). mine the nature and magnitude ofthese signal line dis­ l3yusing multiple PC! slots, hundreds of SCSI devices turbances, one must understand the fo llowing three can be connected to a single PC or . mechanisms: ( l) the overall sequence of events, ( 2) In addition, the SCA-2 connector implements the the electrical dynamics of connector contacts when essential contact sequencing required to perform SCSI used in tJ1e SCSI application, and (3) the electrical device hot plugging. consequences on the bus when the device makes/ breaks contact with the SCSI signal line. Device Insertion and Removal Bus Tra nsients There are two sequences of interest: insertion and removal. The removal process is easy to grasp afterthe The multidrop fe ature of the SCSI bus al lows device insertion process is understood . removal and replacement without disturbing the commu­ nications between other SCSI devices, if the electricaJ dis­ Single-ended Device Insertion turbances caused by the device being added or removed The initial conditions considered fo r SCSI device arc not detected by any other SCSI devices. Thus, it is insertion assume a SCSI device with its ground solidly architectmally possible to dynamically reconfigure the and continuously connected to the ground of the device population without interrupting existing data SCSI bus. This connection is easily accomplished, tor transmission processes bet:\veen operationaJ de,�ces. example, by using sequenced contacts where the The transients involved with device insertion and device ground makes connection well before any removal include mechanical vibrations, power distrib­ signal connection. In this state, the SCSI device pins ution instabilities, SCSI terminator power noise, elec­ present a maximum fu lly discharged capacitance of trostatic discharge (radiation and induced current), approximately 25 picofarads (pF). After the device and SCSI signal line noise . Al l except the SCSI signal signal pin contacts the bus, this capacitance becomes line noise and the terminator power noise are handled charged (by extracting charge fr om the bus) to the by the storage system design and therefore are not voltage on the signaJ line at the time of the insertion.

Digital Tec hnical )ournol Vo l. 9 No. 3 1997 21 These values range from �1 pproximatelv 3 V f()l. paper is not present!)' a,·aiLlblc from actual SCSI hard­ negated Jines to nearly 0 V t(Jr asserted lines. ware. The disturbances in the DSSI bus arc larger than Since the SCSI d evice being inserted is logicalk off those seen in the SCSI bu s, because the DSSI voltages (i.e., there is no driver current), the onll· current that arc slightk higher (3.5 V tc1r DSSI compared to 2.8 V needs to tl oll' is that required to charge the 25-pF fo r s1nglc-ended SCSI), �m d the instrumentation capacitance. This is sharpl v ditlc rent fr om man1· con­ capacit�mce ( -lO pF) adds signiricmtl1· to the device nections in electronics in which current tloll's capacitance because of the state of the art t()r scope through the contact after an electrical cont:tct has probing in 1990. Numerous tests with modernscope been established. probes (0.6 pf or less ) of SCSI hardware have shown In the case where no bus vo ltage ch:tnges occur that the SCSI disturbances �1re indeed qualitati,·ch· the except as a result of the device insertion, the insertion s�1me bur signiticanth· less in size than those sho\\'n transient begins ll'ith the initi: tl contact and ends " hen here ti·om the DSSI hardll'�lre. there is no fu rther bus \'Oitage ch�mge ll'ith time (the The mechanisms described appil' to am· s1·stem in steady stJte vo l tage ). Once the device pin 1'0l t�1ge \\'hich rhc insertion transient is caused bv the charging reaches the steady state bus voltage, no fu rther current of a sm�1ll clp�Kitance . Figure 14 shows the basic test flows through the contact. setup. A device is inserted into a connector with scope Therefore, once the de�·ice capacitance becomes probes �1 tt,Khed on either side of the mating intert:tcc

charged to the steady state signal line 1·ol tage , no tl11Ther �1 11d '' ith ,1 11 �1dditional probe �1ttached to the bus some disturbances to the signal line \'Oitage wi ll occur e1·cn 1f distance ri·om the connector. The 1·oltage on the the contact opens momentarih· dw·ing a channi ng dc1·ice side of the connector is used as the trigger event. The voltage on the dc1·icc capacitance ch.mgcs signal into a digit:d storage scope so that the e1·ems

during the transient from a discharged state (zero vol t­ bct()l'e, d u ri ng, and �1 ti-e r the mating event em be age ) to the steady state signal line voltage, with the cur­ examined . This is clearlv �1 si nglc-c,·ent t1·pc of mea­ rent alll'avs flowi ng into the de�·ice capacitance. surement, so �1 high sampling r:tte ( 1 bi llio n s�1n1ples If the signal line voltage ch�1nges after the insertion per second ) �m d signiticmt scope memon· is requi1-cd transient is compl eted (because ofe,·cnts such ,1 s heing to c1 pture the 1\·a,·etorms. The scope probes used ha1·e driven bv other devices, Lw noise, or bv the inserting a 1-megohm input resisunce. device beginning to usc its own driver), then current The connector used tor the tests in this section has will again begin to flow through the contact. This is <1 multiple p:tL1llel pins that :t il mate and dcmate in the norma l SCSI condition tor contacts in scn·ice. If the same ge11eL1I time period . There is no intention:�! signal line 1·oltage ch anges during the insertion tr�1n­ difference 111 the pi n lengths. The time rcbtionship sient because of events other than the connector con­ bet11·een the m�Hing CI"Cilts 011 t\\'O ncigh bming pins tact effects (e.g., signals ch:mging becau se of being ll'as cxr)lored ti rst. Bv choosing neighboring pins, the driven by other devices, other noise ), then it is more diftc rences betl\'ecn the pins is kept to a minimum difficult to determine exactly where the insertion tran­ so the time ditkrences observed should represent the sient ends. The beginning of the insertion transient best pin-to-pin synchronization in a m:tti ng el'ellt. will still be marked bv a charging of the device cap:H:i­ For this test, a probe was attached to each of t\\'O tance. Enmples presented later in this paper shcl\\ pins, and the connector alone (not part ot· a de1·ice ) both insertion events and dri,·ing c1·ents fi ·on1 other 1vas 111:1tcd to the bus segment connccto1·. Figure 15 devices occurring at the same time. sho\\'s the results. The time required fo r complete contact mating on Both pins appear to show instantaneous transitions all SCSI signals in the bus is up to six orders of magni­ bet\veen tile charged and discharged states on the time tude greater than the time required ro r a SCSI signal to scale th�1t \\'�1S required ro c1pture both C\'ents on the change stJte. Therefore , signJI ! cl·cl changes �He lik.el1· s:�me plot. The mating e1-cnts :11-c sep:1rated lJ I' approx­ during the insertion process. The electrical bcha1·im of imateh- 19 milliseconds, and there is no c1·idence of the contact as it continues \\'iping (sliding ah:er initi�1l an\' disc h�ngi ng ah:cr the initial chargi ng has occurred . contact is made) fr om its initial comact point to its Since the scope probes have �1 !-mego h m input resis­ final resting position becomes :1 critical part oF the tance, any lack of contact during the wi pe portion ot· process. The fo llowing subsections explore this beh�11'­ the mati ng ll'ill allo\\' the capacitance to d1sch�1rgc ior in detail. through the probe ll'ith a rime constant of approxi­ mateil' NC. ll'hcrc R is the scope probe resistance and Connector Insertion Dynamics The data prcscmed in Cis the sum of the connector pin and probe cap:lci­ this section were derived fi·om a DIGITAL DSSI bus in tanccs. Assu m ing a total of 10 pf, this gi1'Cs a decJv 1990. The DSSI bus is nearly identical to the SCSI bus, ti me constant ot'lO microseconds. and many of the results apply without modiricnion figure 16 shows another mating event on pin 1 at to SCSI. Si m i l ar data have been obscn·ed on the SCSI a 500 times more sensiti1·e time scale. ln this case, bus, but the complete set of d,lta presented in this some e1 idence of momentan· opens is seen \\'ith the

22 DigitJI Teclmic1l )ourml Vo l. 9 \:o 3 I 997 TERMINATOR BUS SEGMENT TERMINATOR

REMOTE SCOPE 1..----- SCOPE PROBE PROBE NEAR CONNECTOR ! CONTACTS CONNECTOR �IJ' CONTACTS SCOPE PROBE _L USED FOR TRIGGER

------CT ELEMENT BEING INSERTED/REMOVED LOCAL GROUND

Figure 14 Test Setup �or Insertion/Removal Transients

4 � 3 (j) 1- z 2 - MATING EVENT _j 0:: 0 c w (9 0 4: � 0 3 > _- MATING EVENT z z 2 0:: "'0:: 0

5 MILLISECONDS PER DIVISION

Figure 15 Tim,· R.cbtionship between i\'laring b·cnrs fo rT" o Pins in the Same Connector

/ DEMATING EVENT/

3 ------'"""'--\\ .-----..,\ ' � � (j) 2 ·,, '-,_ 1- � _j Q_ 0 c � w 0 -� (9 4: 1- _j 3 0 (\j - �� > 2 z z 0:: 0::

0 --�---· -- -�------�------�--�- ... 10 MICROSECONDS PER DIVISION

Figure 16 Contact Counce Events

expected decay dynamics. The actual time constant is a The initial mating event on pin 1 still appears to be bit longer than 10 microseconds because of some instantaneous on the time scale used in Figure 16, but capacitance in the connector pin. This bounce behav­ some slope is visible in the second bounce event. Also, ior may or may not be present during the initial stages during the second decay period, a shelf in the decay of the event shown in Figure 15, but clearly the behav­ indicates that a partial, high-resistance contact was ior is not visible in the figure. Toobserve the suite of btiefly experienced. Pin 2 is not close to making a con­ transients that exist in tbe mating process, one must tact at the time range shmvn in Figure 16. The figure examine the transients at sever�1l difkrent time scales. shows a small amount of cross talk in the pin 2 voltage In general, this requires repeating the mating events, waveform caused by the pin l transients. since the dynamic range of the scopes used was insuffi­ This data clearly shows that the details of the mat­ cient to capture all the detail in a single event. ing process are highly complicated and intrinsically

DigitalTe chnic<�lJournJI Vol. 9 No. 3 1997 23 unpredictable. Therefore, the best we can hope fo r is almost no cross ta lk into pi n 2. E1·ents with these char­ to establish some limiting cases tor the important Jcteristics arc some11 hat r:1re and are called grad ual par:�mctcrs. The limiting featu res shown in hgurc 16 transients in this paper. arc the extre mely rapid initi:�l mating e1-cnt and th e Figure 19 shows a closer look at the rapid transiellt decay times. vVe examine these r:�pid tra nsie nts 1n t\'pc of mating event. In this figure, we have added a detail later in this section. The dccav times arc deter­ de�·ice ca�1acitance ofapproxinutelv 20 pF to the scope mined by the actual contact resist:� nce :�nd the resis­ probe to r a totJI of approximately 30 pF. Notice that tJnce of the leakage path to local ground. For normal the transient req u i res 2 to 3 ns to substantiallv com­ SCSI devices, there is very little lc:1kage to ground on plete its chargi ng . There is a ratio of n earl y 10' between the device pin so the opens produced by the bounce the mating events on diffe rent pins in the same connec­ h:IVe no effect. tor and the rapid transient of a single contact. Some cases observed i ndi cate m u ch more complex bounce structures. Figure 17 sho11 s a case in 11·hich the Limiting Parameters for the Rapid Transient The mating connection is not established until more rhan q u estion of 11·hether the c1pid transient sholl'n in 700 microseconds have p:lSSed . Fi gure 19 is the II'Orst case needs tO be e xplored The data iJ1 Figures 15 through 17 11cre all acquired because the duration of the transient affects the d istur­ from the same connector contact during separate matjng bance on the bus. Some bou nding �e atures and some processes. Tvpically, the details of the mating event arc i m pli cations of the observed behavior of the rapid very difrcrent even under norn.inally identjcal conditions. transient arc noted in this su bsection . Another type of mating event is shown in Figure 18. Assumi ng th�n the transient event occurs in 2 ns :�nd This event requires approximately 10 microseconds to that the velocity of i m pi ngement j ust prior to the first make the transition fr om uncharged to ch:�rged, and mating event is 1 meter per second, then the distance there is no bounce. This par tic u lar cvem prod u ces traveled by the conract II'Ould be 2 nanometers (nrn).

4

3 "' (jJ z 2 � Ci: 0 C. L.lJ "' 0 � 0 3 > z � 2 Ci: z Ci:

100 MICROSECONDS PER DIVISION

Figure 17 Extended ''dating Bounce b-e nts

4

3 --�-- ----� -� -----�·�------�------�-�--�----...... _...�------

� ...... /. (jJf- z 2 -' ./ 0 Ci: C. · ------MATING EVENT w "' <( f- -' 0 3 > N z 2 Ci: z Ci:

10 MICROSECONDS PER DIVISION

Figure 18 Gr;ldu;ll Mating E1'Cnt,No .Bounce

24 Dil!i t:ll T�chnic:1l [ourn:ll Vol. ') �o. 3 \')')7 -3 if) 1-- _j 0 2- MATING EVENT...... ______w 2 <5 <( 1-- _j 0 >1 z 0::

0 f------�.J

5 NANOSECONDS PER DIVISION

Figure 19 Dcr�ilcd Sn·ucrmc of the R.apidTL 1nsicnt

This distJIKe is equi\·alent to a rcw atomic distances. mcrease in contact area. Attempts to use scanning 'The distance tLl\·eled during the gradual transient electron microscopy to examine the actual colltact shown in figure 18 would be approximatelv 10 area were not fr uitful in establishing the actual initial microns, and during the extended bouncv case shown physical contact area because of the se\'ere physic:II dis­ in figure 17, approximately l mm. The velocity rcJr the ruption that occurs on the microscopic level and l:mer two cases would likely be somewhat reduced because of the small sizes involved. becHise of the mechanical interference between Under these assumptions, the physical contact are1 the pins, and the actual distance traveled is probably is assumed to be ( 2 nm )' or4 x 10-14 cm2 in the fo llow­ signiricantlv less. There is little opportunity, however, ing calculations. The current densit\' to support the r()l' the \'eiocitv to be reduced tcJr the rapid transient, 50-mA rapid transient current is therefore approxi­ ;m d this distance of 2 nm is probablv at least the matelv 10'2 A/cm2 Typical current densities in copper correct order of magnitude. and other metals arc less than 10'' A/em'. The electro­ The tcJIIowing calculation shows the total current migration onset curre11t is of this same order. The cur­ b·els required to charge the capacita nce in 2 ns. rent densit\' in the ra pid SCSI transient is a million times greater than that which metal can support. Q= CX=30 10 12 pF 3.5 V X X To support the massive current densit\', the actual = 10.5 X 10 11 coulombs, contact area must be much larger than the initial plws­ where Q represents the total charge, C is the capaci­ ical contact area assumed in the above calculations. tance, and V is the voltage. Since this charge is trans­ The author believes that this can be explained by a krred in a time t of 2 ns, the average current is micromolten metal-to-metal joint that is fo rmed upon initiaJ contact and that the ti·ont of the melt propagates Q/1 = 10.5 x I0- 11 coulombs/( 2x IO''ns) (probablv through phonon interaction) at approxi­ = 52.5 milliamperes (mA). matelv the speed of sound in the metal. This process For a gradual transient that takes 10 microseconds, would create crudclv a thousandtold increase in the the ;1 \'erage current is approximately 10 microamps. effecti,·e insertion ,·elocitv and would result in a mil­ These calculations show that the most sc\·crc ampli­ lionfold increase In contact area, since the melt ,,·mi ld tude disruption to the signal on the bus occurs with propagate in all directions. the rapid transients, since relati,·eiv large current must This mechanism would prod uce reasonable current be supplied in a short time to charge the capacitor. densities and would tcmn an intimate metal-to-metal The next item to be examined is the current densitv interEKe with both contacts that would aid in reduc­ thJt must exist during the transient. Since the contacts ing the contact resisuncc. The micromelt size becomes move only 2 nm Jnd the surbcc ti n ish of actual con­ rapidly selt�limiting, with the expanding contact are

Digiral Tccbnic11 ]oum:ll Vo l. 9 No. 3 1997 in this paper. One special \'ariation, hoii'C\cr, is \\Urth In the process shown in Figure 20, it is probable that noting-the combination of rapid and gradual tran­ a gradual-tvpe contact is bei ng maintained somewhere sients in the same mating process. Sometimes the mJt­ else in the contact, since no voltage decay is evident i ng process starts with a gradual transient and then when the rapid transient ends. Indeed, it is to be shifts to a rapid transient. Figure 20 shows a complex expected that the rapid transient mechanism would mating process in which ( 1) a gradu<1ltr:m sient initi­ not operate afterthe capacitance is charged to a certain ates, (2) <1 rapid transient starts but does not complete, level, since there would not be enough energy diffe r­ (3) the rapid transient ends, (4) another gradual tran­ ence to initiate and sustain a rapid transient. Therefore, sient process starts, and ( 5) another rapid transient the gradual transient is the behavior derived fr om an fi nishes the charging process. extrapola6on of the normal mechanisms that produce This observation is consistent ll'ith se\'eral possible contact resistance . This detailed discussion is pursued microprocesses d ur ing which the initial r:�pidtr Jnsient because we must u nderstand the basic phvsical mecha­ extinguishes betore completion. nisms to gain confidence that we are considering the worst-case disturbances. • The micromelt becomes plwsicall\' torn apart 1)\' the advanci ng motion of the contacts. (This process Single-ended Device Removal is u nli ke ly because of the ex.cessi1cl\ ' slm1· pl1\'sical During the process ofremm·al, the device pin separates motion . ) fr om the bus. Since both the bus and the device are at • The micromelt explodes. (This process is likely. ) the same voltage just bdorc the separation, no current • The micromelt becomes resistive th rough the cont­ is Hawing unless the bus voltage changes when the con­ amination of the melt with insuiJting m ateria l . tact is in the process of separati ng. Therefore, in most

• The micromelt fr ont reaches a thin region and cases the separation process causes no disturbance. opens because of the lack of material. Bounce can occasionally be observed during the demating process when there is a leakage-to-ground • The microinelt fr ont reaches an insul :rtingr egion . patiJ present Oil the device side. Of course, if a voltage On fu rther movement of the contacts, a new rapid decav occurs and the contacts re-connect, the mecha­ transient condition is encou n tcrcd betll'een diffe rent nisms are essential[\' the same as tor the insertion tran­ metallic peaks or· the contacts, and a IlL\\' rapid tran­ sient. The ke1' poinr is that no additional mechanisms sient begins. Fi gure 21 shows a conceptu

Ul 3 � 0 2.. LlJ 2 C) � PARTIAL RAPID 0 TRANSIENTS > 1 z 0::

0

1 MICROSECOND PER DIVISION

Figure 20 Gradu<11 Jnd l\;1 pid Tra nsients in the SJtnc Maring l'roccss

26 Dig:ir�lTc dmictl Journal Vol. 9 i':o. .:l 1997 bus disturbance to the voltage on the device side. The MATING SURFACE 1 bus voltage is reduced while it supplies the necessary charge to tl1e device pin. After the device capacitance is

TIME A charged, the bus resumes its voltage level before the FIRST MICROMELT insertion transient (more or less). STARTS BETWEEN ASPERITIES ON In tilis test, the bus pulse is approximately 3-ns wide THE SURFACES at its midpoint; its peak amplitude is approximately 1.25 V. This pulse is significantly larger in amplitude than that produced fr om a device alone. MATING SURFACE 2 One of tl1e more interesting features of the signals in Figure 23 is the lack of commonality or tracking in the signals afterthe rapid transient has passed. In the MATING SURFACE 1 simplest interpretation, one would expect both sides TIME B of the connector to have nearly the same voltage FIRST MICROMELT (at tile least to be witilin the accuracy of the 0.1-ns OPENS BEFORE OPEN ------CAPACITANCE IS propagation r.i me bet\veen the probes). The fo llowing FULLY CHARGED discussion addresses the author's current thinking on (CONTACTS HAVE NOT MOVED the reasons tor tl1 is lack of tracking. SIGNIFICANTLY) Instrumentation effects, such as resonance or differ­ ences in probe properties, were ruled out by using MATING SURFACE 2 botil probes on the same signal and noting that there was little diffe rence in the signals reported fr om each

MATING SURFACE 1 channel. Later, ty pically aftera few microseconds, the voltages do become effectjvely the same.

TIME C: Because a significantvoltage difference is present for CONTACTS MOVE relatively long times, there must be a significantvol tage TO SECOND PEAK AND START THE source bet\veen tile contacts to support this observed SECOND MICROMELT difference. In the initial stages, the difference between THAT COMPLETES THE CHARGING the pin voltages is approximately 3 V. If the current is the one calculated in rbe section Limiting Parameters

MATING SURFACE fo r the Rapid Transient, that is, approximately 50 2 mA, rl1en the current-limiting impedance must be at least

KEY: 3/0 .05 = 60 ohms. This impedance, coupled with the -- DIRECTION OF MOTION parasitic capacitances and inductances, serves to blunt Q MOLTEN AND PA SSING CURRENT tile instantaneous electrical energy transfers that would be implied by a very low source impedance. If tl1e • RESOLIDIFIED AND NOT PA SSING CURRENT source impedance were very low, then both sides would have to track shortly afi:er the initial contact. Figure 21 Part of tilis limiting impedance is tile loaded or local ArchitcctLJrc ofCombinat ion Gradual/Rapid Maring Event transrnission line impedance oftl1e bus. The charactetis­ tic impedance is nominally approximately 100 ohms for same pins in exactly tl1e same connector used fo r tile an unloaded bus. Since tl1e bus connector is attached to e\·ents in th e top of the figure, and tl1ere is no evidence of tile middle of tile ti ne, both sides are available to supply :my activity on pin 2. Pin 2 demated long before any charge and tl1e effective charging impedance would activit\' was seen on pin l. Again, this underscores tile be approximately 50 ohms. A 30-pF capacitance would unpre�iictability of the detaiJs of any given event. have a charging t.i me constant of 1.5 ns. This time con­ stant fitstl1e observations well during tile rapid tra nsient Impact of Device Insertion and Removal on Bus Signals itself but does not fitthe riming parameters of tile volt­ This section contains several examples of the noise age clifferences observed well after the rapid transient. prod uced on the bus side of the connector. Actual Elevated local temperatures are almost certainly pre­ devices with approximately 25 pF of capacitance were sent during the rapid transient (near the melting point used to obtain tl1e data. This capacitance value is of the metals1 ), so it seems plausible tilat the mystery increased bv the probe capacitance. On tbe bus side, voltage source is basically thermal electromotive force there is also some increased capacitance caused by the (EMF)between th e pins. Al lowing a fe w microseconds probe used to acquire the bus side signal. Figure 23 to achieve thermal equilibrium and subsequent loss shows the basic impact of a rapid transient on the bus of the tilermal EMF aJso seems quite plausible. These side of the connector and the time relationship of the details are inviting fi.trther detailed investigation but

Digital Technical journal Vo1 9 No.3 1997 27 4 3---

_ __ DEMATING EVENT

100 MICROSECONDS PER DIVISION

(a) Demating En:nt \\'ith 60-microsecond Separarion

4

3 ·---. - {jJ DEMATING EVENT � z 2 -- WITH BOUNCE 0 a: 2:-

0 \. ___. _____

100 MICROSECONDS PER DIVISION

(b) Demating £,·cnr wirh Bounce

Figure 22 Dem:ning Evcnrs

{jJ 3 r------�

� 2 BUS SIDE / w (.9 <( f- 0..J > 1 � z DEVICE SIDE a:

NANOSECONDS PER DIVISION 5

Figure 23 Mosr Severe Noise Pu lse Observed

do not affect the practical concl usions as applied ro The point extracted from these charging-impedance parallel SCSI. and settling-ti me observations is simply that the over­ As added evidence fo r thermal dfects, experiments all cnergv transfer rate is limited by the microphysics with early LVD SCSI devices that use a 1.2-V bus level of the process. This means that Figu re 23 almost cer­ instead of the 3.5-V bus level shown in Figure 23 tainlv illustrates the worst-case distu rbances. transfer much less energy and have a much shorter It has been noted that the bus pulse is similar to that pro­ settling time before both sides of the contact track. duced by a stub on the bus and a signal with a f:1 St 1isejfa!J

These LVD re sults will be reported separatelv. time.In a sense, we really are charging a stub in either case,

28 Di)l;it�l Technical journal Vol 9 No.3 19'.17 and in both cases the loaded or local characte1istic imped­ of the attenuation and dispersion depend somewhat ance of the bus limits tlKextent ofth e clisturbancc. on the bus media used. To more accurate!\' measure the noise pulse pro­ The rapid transient bus pulses arc sholl'n on actual duced when a device is added to the bus, measurements data pu lses in Figure 25. The top trace in the figure were pertormed wi thout a scope probe attached to the shows a rapid u·ansient pu lse on a negated part of

The pulse measured in Figure 24 has approximately range below 2 V. This negated stare is a bit higher than half the amplitude of the pulse in Figure 23. This is usuallv fo und, so the bus pu lse is starting from a higher more red uction in amplitude than one wou ld expect point. If the pulse h:td started ti·om a lower point, to r from the removal of 10 pF from the effective device example about 2.5 V, the pu lse amplitude wou ld not capacita nce, and this diffe rence, while not completely have been as large. Further discussion of the recei1·er explained, is in the b1·orable direction. The noise pulse detection range appears later in this section. that reached the next device (where it could be The signals in Figure 25 were pu rposely chosen detected as an error) would be even smaller, because of to have broad falling edges of approximately 15 ns. the dispersion and attenuation in the bus and because Normal SCSI signals arc 5 ns or Elster.The broad edges the neighboring de1·ice would need to have its 25-pF maximize the chance that the bus pulse \\'ill produce a capacitance charged also. The signal at the measure­ signal slope reversal ohbe type that can produce mul­ ment poi nt 2 meters away in Figure 24 indicates tiple edges. The bottom trace in Figure 25 shows a bus the intensity of the attenuation and dispersion to be pulse in the most sensitive part of the bl l i ng edge. This expected in the ra pid transient bus pulses. The details pulse produces almost no slope re1·crs::llbecause bl' the

3.5 U1 r- ...J 0 2:. 2.5 \2 METERS AWAY w CJ " .....�------

(/) '·· •' ::::> '-..- (l) \ 2.5 AT CONNECTOR

NANOSECONDS PER DIVISION 5

Figure 24 Bus Pulses \\'ith :-..!o Scope Probe Att:�ched to the l)c,·ice

4 . U1 2 r- ""' BUS PULSE ON ...J 0 NEGATED STATE 2:. 0 w CJ ------�- (l) 0

20 NANOSECONDS PER DIVISION

Figure 25 Bus Pulses on Actll

Digital Tc( hnical journal Vol. 9 �o. 3 1997 29 time it is ready to become positive-going, the data sig­ Howc,·cr, the anticommon mode case will always have nal has fa llen so much that there is no voltage source the positive and negative signal lines within a diffe ren­ to drive the signal more positive . At th e beginn ing of tial logical voltage level of ground, and the transients the fa lling edge , the slew rate is increased bv the bus \\'ill therefore be small. Even in the anticommon mode pulse; in the middle, the edge is extended and collSc­ case, the dkct is at most a slight shift in the ti me \\'hen quently the overall ti me required tor the fa lling edge is the diffcrcmial state change is obsen·ed, since the tran­ almost exactly the same as tor the tilling edge that has si ent d istu rb:111ces are so sm::li I. no bus pulse (see the top trace ). In the pathological differential case, large com mon Therefore, the main effect of rapid transient pulses mode levels exist on both the positive and the negative occu rs when they intersect the signal edges (where a signals. 'The insertion transient will be larger because state change is expected :.111\WJ\' ), and the ef!Cct is th e bus volt:�gc is larger. This c1sc is even more rare since movement of tbe position of the ed ge bv no more it requires both coincidental pin mating and coinci­ than 2 ns fr om the normal position. This mo,·cmcnt is dental Lu·gc common mode. already accounted fo r in the SCSI st:tndard as jmlsl! The other case considered that can have a u nique distortion ski?W, so there is no important eftect. etlect on ditkrentiaJ systems is that ofextended bounce. If the mating event happens while the bus signal is This c1sc extends the effective mating time to the point in the asserted state, there is little effect since little when some overlap between the transient activi ty on the charge is transferred. If the event happens in the rising pins is more likelv. Re call that the extended bounce case edge, there may not be enough \'Oitage diffe rence to \\'JS onlv ,·isible when a leakage mechanism was ;l \';1ilable start a rapid transient-again, there is little eftCct. lf:t to discharge the incoming dc,·ice capacitance . In actu:tl rapid transient is initiated on a rising edge, the imp:tct de,·iccs, no signiticam le:1 kagc occurs so a bounce event is still a sm:dl shift in the position of the edge. In docs 110t produce disturbances after tl1is tirst contact. any arbitrary combination of signal level and type or' The difkrcntial signa l seen by the i ncom ing device transient, the bus disturbance will not be greater than may be scriouslv affe cted lw extended bounce if there those shown in Figure 24 and figure 25. is bus acri,·itv during this bounce. Consider, fo r exam­ ple, a c1sc in which the positi,·c signal cont�K t opened Differential bcctuse of a bounce C\'ent after achieving a ru ll ch�1rge. For diffc renti

30 Dip;iralTcc hniul journal Vo l. 9 �o. 3 1997 The worst-case differential tr:1nsienrs occur when 4. ANSIX3 .253-1995· 1nformation Te chnolog)I -SCSI-:3 one treats the differential system as two independe nt Parallel lnteJface (SPJ ) , X3TJ0/855D ( New York: American i single-ended SCSI buses-one fo r the positive signal National Standards Institute, 1995). Ava lable fi·om Global Engineering, IS Inverness Way East, Engle­ and one fo r the negative signal. wood, CO 80 112-5704, tel. (800) 854-71 59 or (303 ) The rapid transient becomes more and more 792 -2181, bx (303) 792-2192. detectable :1s bus speeds increase and the receivers and timing margins become more sensitive. Schemes to 5. SCSI Enhanced Parallel lnteJface (EP!), X3 T7 01 encourage the gradual transient are the best protec­ 1143D ( New Yo rk: Ame1ican NationaJ Standards Insti ­ tion against the ultimate problems caused by rapid tute, 1997). Available fi·om Global Engineering, IS Inver­

ness Way East, Engle ood , 801 1 2-5704, te l. (800) transients. The best-known method tor producing w CO 854-7159 or ( 303) 792-2181, tJ.x (303) 792 -2192. reliable gradual transients is to avoid a metal-to-metal contact during the initial contact and until the device 6. Information about 1/0 inredaces is avail able ar rhe T I 0 capacitance is ch::�rged . At this time, no such connector home page ar http:/jwww.s vm . com/t !O. system exists fo r SCSI applications. 7. Single Connector Attachment-2 ( SCA - 2 ) Specifications: SFF-80 15, SFF-8046, SSF-8048 , SSF-8066, and SSF- Overall Summary 845 1. Avai l a ble fi:om the SFF Committee, 14426 B lack Walnut Court, Saratoga, CA 95070, voice fa x -back ser­ Evolution in to ur significant hardware technologies vice (408 ) 74 1-1600, tel. (408) 867-6630 ext. 303, in the recent p::�st has enabled parallel SCSI to break fa x (408) 867 2115. through the barriers that were preventing it from delivering excellent value, flexibility, and growth to the Biography computer data storage industry. Application of more scientific methods, use of the latest silicon technology, and developments in the interconnect tedmology pro­ vi ded the to undation fo r these improvements. DIGITAL provided most of the basic data and led important stan­ dards and industry bodies to accompljsh this.

Acknowledgments

Fee Lee, Keith Childs, Chuck Bagg, Pak Seto, and Jonathan Salles were instrumental in developing the special test environment and fo r acq uiring much of the data presented in this paper. The author gratefully William E. Ham acknowledges the management and technical support Bill Ham joined DIGITAL in 1983 and has been working ti·om Pete Korce, Mike Chamberlain, Bob Passmore, in storage technology since 1990 as rhe manager of the Richie Lary, Ken Chester, Bob Rennick, Laura Storage Bus Technical Office (SBTO ). The SBTO group Wo odburn, and Ellen Lary. represents D I C lTAL at most of the i mportant standards and industrv bodies rh at are invoh·cd wirh the rransmis�ion of storage d·ata . Presently, these groups are SCSI, STA, References Fibre Channel, and SFF. The SBTO group also creates information rr om acruaJ laboratory testing and was respon­ I. ANSI X.-3 277- 1996 Information Te chnology sible fo r introduci ng Fast-2 0 tec hnology to ANSI in 1993

X3 T9 2/3 75R SCSI-3 Fa st-20, X3 T10/1 071D (New and hot-plugging tech nology ro SCSI in 1992. Bill has a York: American National Standards Institute, 1996 ) . Ph.D. in electrical engineering trom Southern Methodist Av ailable tr om Global Engineering, 15 Inverness Way University in Dallas, Texas, and has been active in the elec­ tronics ind ustry i n a variety of technologies since 1970. He East, Englewood, CO 80112-5704, rei. (800) 854- has held significant positions in silicon chip, printed circuit 7159 or ( 303) 792-2 181, fa x ( 303) 792-2192. board , mulrichip mod ule technology, and storage bus tec h·

2. Ve ry High Density Cabled Interconnect (VHDCI) nology. In ad dition, he is thepast technical editor ofrhe Specirlcation , SFF-8441. Available ti·om the SFF Com­ SSA physical �tandards, past editor ofrhe new SPI -2 SCSI standard, editor of the entire bmilv of SFF connector mittee, 14426 Black vVa lnut Court, Sara toga , CA documents, editor of the new Enha1;ced Parallel lntei·r;K c 95070, voice fi x- back service (408 ) 741-1600, tel. ANSI project ror SCSI, and secreta rv for rhe Fi bre Channel (408 ) 867-6630 ext. 303, fa x (408) 867-2115. . physical standards group (T1 1.2 ) . 3. ANSI X-.31.)1-1994. information 5_ystems-Small Computer S)•stem Jnterfa ce-2 (SCS!-2 ), X3T9.2/375 R ( New Yo rk: American National Standards I n sti tu te, 1994 ). Available fr om Global Engineering, 15 Inverness Way E�st, Englewood, CO 80 I 12-5704, tel. ( 800) 854- 7159 or (303) 792- 2181, f.u (303) 792-2192.

DigiraJ Technid Journal Vo l. 9 No. 3 1997 3) I Peter L. Higginson Michael C. Shand Development of Router Clusters to Provide Fast Failover in IP Networks

IP networks do not normally provide fast A DICITAL ro uter engineering team has refined and failover mechanisms when IP routers fail or extended routing protocols to guarantee a five-second when links between hosts and routers break. maxim um loss-ot� serYice rime during a single fa ilure in an Internet Protocol (IP) network. We use the term In response to a customer request, a DIGITAL router clusterto describe our improved implementa­ engineering tea m developed new protocols tion . A router cluster is defined as a group of route1·s and mechanisms, as well as improvements to on the same local area network (LAN ), prO\'id ing the DECNIS implementation, to provide a fast mutual backup. Ro uter clusters have been in service fa ilover feature. The project achieved loss-of­ since mid-1995. service times below five seconds in response Background to any single fa ilure while still allowing traffic to be shared between routers when there are The Digital Equipment Corporation Network Integration no failures. Se[ (DECNIS) \·er bri dge/router is a midrange to high-end pwduct designed and built bv a DIGITAL Ne tworks Product Business Group in Reading, U.K.1 The DECNIS performs iligl1-speed routing of II\ D ECnct, and OSI (open system interconnection ) pro­

tocob and can b�we th e fo llowing network interfaces : Ethemet, FDD I (fiber distri buted data inte rface ), AfM (asvnclmmous transter mode), HSSI (High­ Speed Serial I nter fJce ), Tl/El (digital transmission schemes), �md lower-speed WAN (wide area network ) interbccs. The D ECN!S bridge/router is designed

around a backplane, with a number of semi-autonomous line cards, a hardware based address

lookup engine, :- 111 d

D:1ta packe rs arc normallv handled co mpletely by the line cards and go ro the central processor on ly in exception cases. The DECNIS routers run a number ofhigh-pro rile, high·a\'ailabilit\', \\'ide-area data n etworks to r tele ­ phone scr\'icc pro"iders, stock exchanges, and che mi­ cal companies, as well as fo rming the b�Kkbonc of DIGITAL's internal network. Tv pic1JI\', the D EC NfS routers are d eployed 111 redundant groups \\ 'ith diverse interconnections, to provide \Tr\' high availability. A common req uirement is never to take the nerwork down (i.e., dllling mainte­ nance periods, connectivity is preserved but redun­ dancy is reduced).

.i2 Dirrirc1l Tcchnie<1l lnurnc1l Vo l. 9 No. 3 1997 Overview manages the transkrL1 1 and passes information about the call. The application uses User Datagram Protocol !!' is the most \\'ideh· used protocol to r communication ( UDP) p:1ckcts in IP 11 ith retransmission fi· om the bcrwcen hosts. Routers (or g�1tcways) arc used to link application itself. hosts that arc nor directlv connected . 'vV hen I P was Because this application requires a high leve l of data origin�1lly designed , duplication of \1\TANlinks was com­ net\\'Ol'k a\·ailabilirv, network designers planned a mon but duplication ofgatc\\'ays rix hosts \\'�1s rare, and duplicate nct\\'Ork with manv paired links :md some no mechanisms t(x avoiding bilcd routers or broken mesh connections. Particular problems arise \\ 'hen the links between hosts and routers we re developed. human initiator becomes impatient if there arc debvs ; In 1994, we began a projccr to rcsrrict loss-of however, the more critical requirement was one over sen·icc rimes ro bcJm,· fi ,·e seconds in response to anv which the network designers had no control. The single bilurc; fo r example, r:1ilurc of a router or its source of the calls is another system rhat makes a single elecrrical supply, t:1ilure of a link between routers, or high-level retransmission after ri ve seconds. If that rJ ilurc of the connection between the router and the retransmission does not receive a response, the "'hole LAi'\ on which the host resides. In contrast, existing svstem at the site is assu med to ha\'C L1 iled . This lc:Jds routi ng protocols IJa,·c rcCO\'Ct'\' times in the 30- to to new G1 1ls being routed to other scn·ice sires or sup­ 45 -sccond range, and bridging protocols arc no bet­ pliers, and manual intervention is required . ter. Providing rast bilover in Tf' networks required To resolve this issue, th e customer requested a cnh:111ccmenrs to mam' areas of the router's design to networking s�'stem th�lt \\'Ould recover fr om :1 single co,·er all the possible f1 ilure oscs. It also required the fa ilure in arw link, interf:1cc, or router within a ti,·c ­ im·cntion of ne\1' protocols to support rhc host-rourcr second period . The srandard test ( \\'hich both the cus­ inrcraction under IT'. This was ac hieved withou t tomer and we use ) is to start a once-per-second pi ng, requiring anv changes to the bosr IP code. and to expect to drop no more th:m fo ur consecutive · In rhis paper, \\'e start bv discussing our targets and ping packers (or their responses) upon llow this with a detailed a rdysis of the difkrcnt disruption when the bilcd component recovers. hilure c.1ses. We then show how we have modified the To meet the customer chall enge , the router group bc h�l\'ior of the routing conrrol protocols to

This problem is :1 \'oided by the adoption mechanism si bi!' throughout the \\'hole nct\\'ork; and a route w wi thin the router cl uster. H < i n g met these �1ims, as recalculation time during which 3 new set of routes is we ll �1s fast fa ilover, we ca n justifiably call the result caJculated and passed to the forward ing engi ne. router clusters. Detection times arc in the order of tens ofscconds;

fo r example, 30 seconds is :1 common dctault. ·rhc t\\'0 The Customer Challenge most popular link-stare routing control protocols in large IP net\vorks :1re Ope n Shortest Path First A p�1rricular customer, a telecommunications service (OSPF)2 and Integrated Intermediate Svstem-to­ provider, has an Inte lligent Sen·ices Net\\'ork <1 pplica­ Inte rmcdi�lte Svstem (lmegrated IS-IS) -'Th ese proto­ rion Lw which ,·oicc G11is can be rranstCrred to another cols ha,·e distribution "hold dO\ms" (to limit the operator at a difkrcnr location . The data network impact of· route tl aps) to prevent the generation of a

Digital Tec hnical )ourn�l Vol.9 \!u ,, I <.)97 33 new control message within some interva l (typically 5 or 30 seconds ) of a previous 011e. The distribution of the new information is rapid ( tvpicalh· less than one second), depending p rimarily on link speeds ;md network diameter; however, the distribution may be adversely affected by transmission errors which require retr:msmission. The default retransmission ti mes after packet loss vary between 2 and 10 seconds. The route

recalculation typicaJJy 1kt� es less th:m one second. ROUTER CLOUD These values result in total rccoverv times after fa ilures (for routing protocols with default settings ) in the 45- to 90-second range. Distance vector routi ng protocols, such as the Rou ting Information Protocol (RIP),4 tvpically take even longer to recover, parth• because the route com­ putation process is inherentlv distributed and requires multiple protocol exchanges to reach com·ergence, and partly because their timer settings tend to be tixed KEY: at re latively l on g settin gs . Conscqucntlv, th eir usc is ----- POSSIBLE MANAGEMENT LAN not fu rther considered in th is paper. ALSO PROVIDING REDUNDANT PATH Simi larly, bridging protocols, as stancbrd, use a 15- second timer; one of the worst-case recovery situations Figure 1 requires three timeouts, making 45 seconds in all. T1viccll Configuration fiJ r Ro uter Clmter Usc Another bridging recovery e1se requires :m u nsolicited data packet fr om a host and this results in an indeter­ minate time, although a ti meout will c1 use Hooding after a period . routers are DECNIS 500 routers with one or two In lP protocols, there is no simple \\'a\· tcJr a host ro WA N links and t\\'o Ethcrnets . The second Erhernetis detect the fa ilure of its g:lte\\ ay; nor is it simple t(Jr a used as a management rail and as a redundant local router to detect the fa ilure to communicate with a path betll'een routers one and two (Rl-R2) and host. In the fo rmer case, several minutes may pass between routers three and to ur (R3-R4). before an Address Resolution Protocol ( Al\2)entrv In rhc original plans tOr the customer network , the ri mes out and an alternari1-c gate\\·ay is c hosen; fo r rourer cloud consisted of' groups of routers ar two or some implementations, recovery mav be impossi ble th ree central sites ;m d pairs of links to tl1e host sites. In without manual intervention. Failure to communiote designi ng our sol u tion, however, we tried to allow any with a host m:1v be the result of fa ilure of the host numbn of routers on each LAN, interconnected bv a itself, which is outside the scope or' this project. gener.1 l mesh network. For test p urposes , both \\'e and Alternatively, it may be due to f: 1ilure of the LAl' , or the cusromer used this set-up ll'ith direct Rl-R3 and the router's LAN interface. In this case, there exists an R2-R4 Tl links as the network cloud. alternative route to the LAN through another router, ·we have to consider what h appens to packets travel­ but the routing protocols \\'ill not make usc of it unless ing in each direction dming a fa ilure: there is little gai n the subnet(s) on the LAN :11-c declared u nreachable. in deli1·ering the data and losing the acknoll'ledg­ Th is requ ires either manu�11 inten-cntion or timely ments. Since the direction of data f1ow docs not give detection of the LAN fa i l ure by the router. rise to additional complications in the network cloud, there ;\rc just t\\'O bilure cases: Analysis of the Failure Cases l. f;lil u rc of a router in the nctll'ork cloud

The tirst task in meeting the customer's challenge was 2. F:1i lure of a link in the network cloud

to analyze the various fa ilure and recoverv modes and We kee p these cases distinct because the r:1ilure and determine which existing managem ent parameters rccOI'CI'\' mechanisms are sl ighth' diffe rent. could be tuned to improve recoverv times. After th;lt, We �1 lso need to consider a fa ilure local to one oftiK new protocols and mechanisms could be designed to LANs on which the hosts are attached. A bilure l1ere ti ll the remaining shortcomings. has rwo consequences: ( l) The packets originated by 0 A typical net\\'ork configuration is sb 11'11 in Figure I. the host must be sent to a differentrouter, and ( 2) The The target net\\'ork is similar but has more sites �m d response packets trorn the other host through rhe net­ many more hosts on each LAN . M;lll\' of the site wod<. cloud must ;l lso be sent to a differenr router, so

34 Digit;tl Tt chnic:tlJournal Vol .')�o 3 1':1':17 that it can send them to the host. We break down this The customer requested enhanced routing, and the type oftailure into the fo llowing three cases: existing nenvork was a large routed WAN, so enhanc­ ing bridging was never seriously considered. Our expe­ 3. Packets fr om the host to a fa iled or disconnected rience has shown that the IS-second bridge timers can router be reduced only in small, tightly controlled networks 4. Packets to the host when the router fai ls and not in large WANs. Consequently, bridging is 5. Packets to the host when the router intert�1 ce fails unsuitable fo r fa st £1 ilover in large nenvorks. For link-state routing control protocols such as Note that we arc using the term router interfa ce OSPF and Integrated IS-IS, once a fa ilure has been fa ilure to include cases in which the connector fa lls detected recovery takes place in nvo overlapping out or some fa ilure occurs in the LAN local to the phases: a tlood phase in which information about the router (such that the router can detect it). In practice, fa ilure is disu-ibuted to all routers, and a route calcula­ fa ilure of an interface is rare. (Removing the plug is tion phase in which each router works out the new not particularly common in real networks but is easy routes. The protocols have been designed so that only to rest.) Figure 2 shows these fa ilure cases; this con­ figurationwas also used for some of the testing. local fa ilures have to be detected and manageable para­ meters control the speed of detection. Recovery of a link that previously tailed causes no problems because the routers will not attempt to use it Detection of fa ilure is achieved by exchanging Hello until afterit has been detected as being available. Prior messages on a regular basis with neighboring routers. to that, they have alternate paths available. Recovery Since the connections are usually LA.t� or Point-to­ of a fa iled router can cause problems because the Point Protocol (PPP) (i.e., \\'ith no link-layer ackno\\'1- router may receive traffic before it has acquired suffi­ edgments), a number of messages must be missed cient nenvork topology to fo rward the traffic cor­ before the adjacency to the neighbor is lost. The mes­ rectlv. Recovery of a router is discussed more fu lly in sages used to maintain the adjacency are independent the section on Interface Delay. of other traftlc (and in a design like the DECNIS mav be the only traffic that the control processor sees ). Can Existing Bridging or Routing Pro tocols Achieve Typical default values are messages at three-second 5-Second Fa ilo ver in a Network Cloud? intervals and lO lost for a fa ilure, but it is possible to In this section, we discuss the bilure of a router and the reduce these. fa ilure of a link in the nenvork cloud (cases l and 2 ) .

FAILURE CASE 5 ------1

FAILURE CASE 4 ------j

FAILURE CASE 3

FAILURE CASES 1 Failure of a router in the network cloud 2. Failure of a link in the network cloud 3. Packets from the host to a failed or disconnected router 4. Packets to the host when the router fails 5. Packets to the host when the router interiace fails

Figure 2 Diagt·am of"Fai lure Cases Targeted to r Recovet·y

Digital Tec hnical JounlJI Vo l. 9 �o 3 1997 35 Decreasing the Routing Ti mers serious case. Howe1'er discussion of this is deferred to The detault timer values arc chosen to red uce over­ the later section The Designated Router Problem. heads, to cover short outages, and to ensure that it is Our target ti1T seconds is made up of three seconds not possible to r long packets to cause the adjacencv to to r the bilure to be detected , leaving two seconds for expire accidentally bv blocking Hello transmission. the inf(mmrion about the failure to be fl ooded to all

(Note transmission of a 4,500-bvtc packet on a 64 routers and tor them to recalcu.late their routes. kilobit-per-second link takes half a second, and q ucu­ Within the segment of the network where the recov­ ing would normally require more th:�na packer ti me.) ct·v is required , this has been achieved (with some tun­ However, with high-quality Tl or higher link speeds ing ofthe software ). in the target network and priori tv queuing of Hellos in the DECNIS, it is acceptable to send the Hellos at one­ Recovery from Fa ilures on the LANs Local to the second intervals and count three missed as a r:1ilurc. End Hosts (Although we have successfu lly tested cou nts of two, we do not recommend that value tor customers on The prc1·ious section shows that we can deal with router WAN links because a single link error combined with a failure J.nd link bilure in the network cloud (cases l and

delay due to a long data packet would cause a spurious 2 ). Here we consider cases 3, 4, and 5, tl1 ose that deal fa i I u re to be detected. ) The settings of one second and with t:li lures on the LANs local tO the end hosts. three repeats were within the existing permitted From the point of view of other routers, a t:1 ikd ranges fo r the routing protocols. router on a LAN (case 4) is idc11tiul to a fa iled router in

When these shorter timers arc used, it is important the network cloud (case l ) : a router has died , and tbe that any LA Ns in the network should not be over­ other routers need ro route around it. Failure case 4 loaded to the extent that transmissions arc de laved . tllerdorc is remedied b)' th e timer adjustments The network managers should monitor WAN lin ks described in rbc previous section. Note that these timer and disable any links that have high error rates. Given adjustments arc an integral part of the LAi'J solution, the duplication of routes, it is better to disable and ini­ because tbcv allow the returningtraf fic to be re-routed.

tiate repairs to J bad link tlun to continue a poor ser­ These rimer adjustments c

In some cases, a t:1iled link mav be detected at a lower router interrace bils-tor TP req uires that the router level (e.g., modem signals or FDDI station manage­ c:tn detect a bilurc of its interface (for cxarnp.le, that ment) well betorc the routing protocol realizes that it the plug has been removed). If the LAN is an FDDI, has stopped getting Hellos and declares the adjaccncv this is trivial and \'irtually instantaneous because con ­ lost. (This can lead to good resu lts during resting, but tinuous signals on the ring indicate that it is worki ng it is essential also to test link-t:1 ilurc modes that arc not and the inrcrtacc directly signals fa ilure. For Ethernet,

detected by lower levels.) In the worst c�1sc, however, we fKcd J number of problems, partly due to our imple­ both the detection of a fJiJedrou ter or the detection of mentation :m d partlv due to tl1e nature of Ethernet itself a fa iled link rely on the adjacency loss and so ha\'c the We tormed a small ream to work on this problem alo11e. same timings. Because of the v

Loss of an adjacency causes a router to issue a be attached, tl1crc is no ctirectin dication oftailure: only an revised (set of) link-state messages reflecting its new indirect one bv fa ilure to successfi.tl ly transmit a packet view of the local topology. These link-state mcss:�.ges within a one-second interval. For maximum speed, the arc fl ooded throughout the network and c1usc cvcrv DECN IS implementation gueues a ring of eight buffe rs router in the network to recalculate its route tables. on the tra nsmit intertace and docs not check tor errors However, because the two or more routers will nor­ until a ring slot is about to be reused. This means tl1at an mally time out the adjacency at ditTcrent times, o11c error is onlv detected some time afterit has occurred, con­ message arrives first and causes a premature n:calcula­ suming much of our five -second budget. tion of the tables. Therefore it may require a subse­ The control software in the DECNIS management quent recalculation of the route tables before a new processor has no direct kn owledge of data traftl.c two-way path can be utilized. We had to tunc the because it passes directly between the line cards. router implementation to make sure that subsequent Thcrctorc it sends test packets at regular intervals to recalculations were done in a speedv manner. tinct ou t ifthc interface has fa iled . By sending large test During initial testi ng of these parameters, we discol·­ p

36 DigitJl Tcchnic1l jourml Vo l. 9 No. � 1997 reduced rhe rimers and increased rhe fi-equencv of rest than two rou ters in a cluster, and to continue to route packets to be able to detect interface fa ilure within traffic by reasonably optimal ro utes. In addition, we three seconds. (The rest packets have the sender as wished to not confuse network management protocols destination so that no one receives them and, as usual , about the true identitv of the routers involved �m d, more than one hilure to transmit is required betore if possible, to share trafflc O\'er rhe WA N links where the interbce is declared unusable.) appropriate. This initial solution caused several problems when it was deployed ro a wider customer group; we had more Electing a Primary Router complaints than previously about the bandwidth con­ In our solution, rhe ti rst req uirement is tor other sumed by the rest messages and, more seriouslv, a routers on the LAN to detect that J router has railed or number of instances of pre,·iously working networks become discon nected , and to have a primarv router being reported as unusable. These problem net\\'orks elected to org:\ll ize recoven'. This is achie,·cd bv :�II \\'ere either exceprionallv busv or had some otherwise routers broJdcJsting packets (called IP Standbv undetected hardware problem. Over time, the net­ Hellos) to other routers on the LAN e\·en' second. works with hardw:1re problems were fixed , and we The highest priority (with the highest IP address mod ified the ti mers to avoid hlse triggering on very breal<.i ng ties) router becomes the primarv router, and busv networks. Clearly, the three-second target fa ilure to receive IP Standbv Hellos fr om another

required more thought. router fo r 11 seconds (three is the ddault) causes it to Se,·eral enh hosts) is both "deprecated" by the RFCs and has tlxcd request (using rhc standard ARP protocol") tc) r rhc 45-second timers that exceed our recoven· target. ra ke IP add ress and thus takes the traffic fr om the host. Customers ha,·c a \\ 'ide range of I l' implementations AfterJ bilurc, a newlv elected primarv rou ter broad ­ on their hosts, and reliance on nonstandard fe atures is casts an ARP re quest containing rhe inf-ormation tlut difficult. The particular target application t-o r this work the f:1ke IP address is now associated with the new pri­ ra n on person:�[ computer systems with a third-party mary router's MAC address. This causes the host to IP stack, and we obtained a copy r(>r testi ng. Such IP update its ta bles and to fo rward �1 ll rraftic to tht: new stacks sometimes do not have sophisticated recO\·ery primarv router. schemes and discussion with various ex perts led us to The sending of TCMP redirects' bv rhe routers has to be lieve that \\·c should not re lv on anv co-operation be disabled in ALU> mode . Redirects sent bv a router t-rom the hosrs. wou ld uuse hosts to send traffic to an !P address other Among other objectives, we wanted to be inde­ than the fa ke JP address controlled by the cluster, and pendent ofthe routing control protocol in use (if any), recovery J-i-om El ilure of that router would then be to permit both a mesh stvle of networking and more impossible. Disabling red irects causes an addition:�!

Digiral T� (hniul I uu rnal Vo l. 9 �o 3 l ')97 37 IP problem. If the pri marv router's WAN link fa ils, all the In MAC mode, the hosts are configured with the packets have to be inefficiently t(Jrwarded back over the add ress of any rou ter in the cluster as the default gate­ LAN to oth er routers . To avoi d this problem, we i ntro­ w;w. (The concept that it does not matter which router duced the concept of monitored circuits, whereby the is chosen is one of the hardest fo r users to accept. ) priority of a router to become the primarv depends on Som e load shari ng can be achieved by sett.i ng different the state of the WA..t'\J link. Thus, t he pri m<�rv router addresses in di tk rent hosts . changes when the WAN link fa ils ( or all the links ta il if Si nce tl1e DECNIS is a bridge router, it has the cJ.pa­ there are several), and tl1e hosts send the packets to the bi litv to receive all packets on Eiliernet and many MAC new pri mary (whose WAN link is still intact). addresses on FDD!; thus a l l packets on all t he special

ARPmode has a number of dis

because redirects have ro be d isabled . The monitored associated with the unus �:d DECnet area 0. Thev are circuit concept works onlv on the first ho p fi·om the ideal because the\' are part of the locallv administered router; more distant fa ilures cannot change the IP group �m d ha,·e implementation efficiencies in the

Sta ndby p riotitv and 11l<1Y resu lt in inefficicllt routing:. DECNIS because the DECnet hi-ord (AA-00-04-00 ) is Most seti ol!Siy, tlKr u les 1:or hosts acti ng on information already decoded , and thev :u-e 16 addresses differing in in ARP requests have only a "suggested implementa­ one nibble only (i.e., AA-00-04-00-0x-00, where xis tion" status in the RFCs,and we fo und several hosts that the hexadecimal index or· th e ro uter in the cl uster). did not change when req uested. or were ,·erv slow in Note that A� requests sent bv the rou te r m ust also doing so . (Note thJ.t we did consider broadcasting an contain the specia l MAC address in the sou rce hard­ AlU) response, but there is no allowance in rbe specif-ica­ \\·;m: address field of rhe A� packet, otherwise the tions fo r tl1is message to be a broadcast packet, \\'hereas hosts' ARPt a bles rn::l\'be updated to contain the \\'rong an ARP request is norrnall v a broadcast packet.) MAC address. MAC mod e has minor d isad vantages. Initiallv, it is MA C Mode IP Standby (to Re-route Host Tra ffic) e

To solve these problems, we looked t<.x :1 mechan ism ever, this can be lost alter red i rects. In addition, a smal l that did n ot relv on any host particip:ttion. The result chance of packet duplication exists during recoverv \\'as what we termed MAC mode . Here, eKh router because there ma\' be a short peri od when both uses i ts own IP address (or addresses t( >r multiple sub­ routers are recei,·i ng on the same special MAC ad dress nets) but ans\\'ers A� requests with one of a group of ( 11·hich does no t happen in AlUJ mode because the

special J\ti.ACa ddresses, confi gured t

part ofthe rou ter cl uster conf-iguration . When ::1 router prefe ra ble to a period when no router is recei ving on

fails or becomes d iscon nected., the pri m ar y (or the rhat address. newly elected primary) ro uter adopts the ta iled router. By adopt, we mean it respond s to ARPrequ ests fo r the Interface Delay fa iled router's IP address with the bilcd router's spe­ Reccntll', \\·e added an i nterrace deJa\' option to ame ­

cial MAC address , and it recei,·es and t( Jr\\·ards J.ll liorate a situation more likelv to occu r in large net­ packets sent to the biled router's special MAC address works. Jn tl1is situation, a router, re booting after a (in addition to traffic sent to the prim:�ry router's own power loss, a reboot, or a crash , reacquires irs special special MAC address and those of ;my other tailed MAC address bdcJt-c it has received all of the rou ti ng routers it has adopted ) . updates ti·om nei ghboring routers and thus drops The immediate ad vantages of J'v1 AC mod e are that r�lckets sent to it (ond worse, returns "unreach:Jble" to

ICMP red i rects can conti nue to be u se d , and, p ro,·id ­ the host). Tvpicallv, the main LAN i n iti al ization wo uld ing the red irects are to routers in the cl uster, the fa st be deLl\'ed t( Jr 30 seconds wh ile rou ti ng table updates

fa ilover will continue to protect :�gainst fu rther ta il­ were re ceived m·e r the WAN interfaces and an\' other ures. The mechanism is complete l y tr:111sparent to the LA N interfa ces. The backup continues to operate dur­

hosr. In a cl uster with more than two ro uters, the pri­ ing this 30 seconds. (Note that with I ntegrated IS-IS, mary router will use redirects to cause traffi c ( resulti ng we could. have d el ayed IP on the whole router, but we ti·om failure ) to use other routers in the cluster ifthev did not do this because it would not have worked fo r

have better routes to specific destinations. Thus multi­ OSPf, \\'hich requ ires !P to do the u pdates . ) We use a

ple routers in a cluster and mesh llct\\'orks are sup­ fi xed configurablc time rather than a ttem pti n g to

ported. This al so sokes tl1e pro bl em of hosts n ot detect the end of u pd::1tin g, becau se determ ini ng com­ timing out redi rects (an omission com mon to manv IP ple ti on is diffi cult if the network is in a stare of flux or i mp le mentations derived from BSD ) , because the redi­ the router's IVA N links are down. rected address has been adopted .

38 Di�irJI Te chnical )llurJLll \'ol. 9 ;-.;o. 3 1997 Redirects and Hosts Th at Ignore Th em Special Considerations for Bridges When a router issues an ICMP redirect, the RFCs state We do not recommend putting a bridge or layer 2 that it must include its own I P address in the redirect switch benveen members of a router cluster, because packet. A host is req uired to ignore a redirect received during fa ilover, action wou ld be required fr om the fr om a router whose IP address is not the host's next bridge in order fo r the primarv router to receive pack­ hop address fo r the particular destination address. ets that previously were not present on its side of· the Therefore, it is necessary to ensure that the address of bridge_ We cannot rely on this bei ng the case, so we the fa iled router is correctly included when issu ing a must have a way of allowing bridges to learn where the redirect on its behaiL In the DECNIS implementation, special MAC addresses currently are. More impor­ because the destination MAC address of a received tantly, if bridges do not know where the special MAC packet is not available to the control processor, the pri­ addresses are, they often use much less efficient (Hood ­ mary router cannot tell wbetl1 er a redirect has to be ing) mechanisms. issued on be halfofitselfor one of the adopted routers. For greater traceability (and simpler implementa­ The primary router therefore issues multiple redi­ tion), we use the router's real MAC address as the rects-one fo r each adopted router (in addition to its source address in data packets that it sources or tor­ own). Since redirects are rare, this is not a problem, wards. We use the special MAC address as the source but they could be avoided by passing the MAC desti ­ address in the IP Standby Hellos. Since the Hello is natjon address oft he original packet (or just five bits to sent out as an lP multicast, it is seen by all bridges or f1ag a special MAC address and say which it is) to the switches in the local bridged network and causes them control processor. to learnthe location of the address (whereas data pack­ It is contrary to the basic IP rules fo r hosts to ignore ers might nor be seen by non-local bridges). Since we redirects 8 Despite the rules, some hosts do ignore are sending the Hellos every one second anyway, there redirects and continue sending traffic which has to be is no extra overhead. sent back over the same LA..l'\1.These cause problems in vV hen a pri mary router has adopted routers, it cycles all networks because of the load, and, in the DECNIS the source MAC address used fo r se nding its Hello implementation, because every time the line card rec­ bet\veen its own special MAC add ress and those ofrhe ognizes a redirect opportunity, it signals the control adopted routers. We also send out an additional Hello processor to consider sending a redirect. This may immediately when we adopt a router to speed up happen at data packet rates and is a severe load on the recognition of the change. control processor, which slows down processing of Since the same set of special MAC addresses is used routing updates and might then cause our five-second by al l router clusters, we were concerned that a bridge recovery target to be exceeded. that was set up to bridge a non-lP protocol (e.g. , local To reduce the problems caused by hosts ignoring area transport [LAT]) but not to bridge IP, might be redirects, we improved the implementation to rate­ confused to see the same special MAC add ress on limit the generation of red irect opportunity messages more than one port. (This has been observed to hap­ by the line cards. We also recommend in the docu ­ pen accidentally, and the resultant meltdown has led mentation that, where it is known that hosts ignore us to avoid any risk, however slight, of trus happen­ redirects (or their generation is not desired), the ing.) Hence we make 16 special MAC addresses avail­ routers be connected by a lower-cost LAN than the able and recommend to users that they al locate them main service LAN (such as the management LANs uniquely within a bridged domain, or at least use dis­ shown in Figure l ). Normally, this would mean link­ joint sets on either side of a bridge. ing (just) the routers by a second Ethernet and setting its routing metric so that it is preferred to the majn The Designated Router Problem LAN fo r packets that would otherwise traverse back on the main LAN to the other router. This has t\vo advan­ vV hile testi ng router fa ilures, we discovered addition:d tages. Such packets do not consume double band­ delays during recovery due to the way in which link­ width and cause congestion on the main LAN, and state protocols operated on LANs. In these cases, the they pass only through the fas t-path parts of the fa ilure of routers not hand ling the data packets can router, which are well able to handle full Ethernet also result in interruption of service due to the control bandwidth. mechanisms used . In MAC mode, it is also possible to definea router For efficiency reasons in link-state routing proto­ that does not actually exist (but has an IP address and cols, when several routers are connected to a LAN, a special MAC address) and is adopted by another they elect a designated router and the routing proto­ router, depending on the state of monitored WAN cir­ cols txeat the LAN as having a single point-to-point cuits. Setting this as the default gateway is another way connection between each real router and a pseudo of coping with hosts that ignore redirects. router maintained by the designated router (rather

Digitol Tcc hnid journal Vol. 9 No. 3 ll)l)7 39 than cotlnections ber11·een all the routers ). The desig­ nor �' ure case. ) \Ve h�11 c nor, so rar, fel t that it is s nated rou ter issues link-st�1 te packe t on beh�1lf of the ll orth ll' hi l e to break the rules Lw �1 l lowin g a shorter pseud o router, showing it �1s having connections to hol ding time fm OSPF. each rea l router on the loc1 l LA �, and each router issues a link-state packet showing con nection ro the Co nclusions

pse udo router. This mech�1nism opcr�1tes in �1 bro�1dlv similar \\'�\\' in borh Integra ted IS-IS and OSPr:; the \Vc successfulh· designed rklo,ld exhibits hysteresis, thus minimizing unneccss�1rv des­ and imcrruptions ;1 fte1· bil u res of less than hve secon ds

igna te d router changes. in both LA� .1 nd \VA.N crll'ironments. This capabilitl' For ro uting ta ble calcu l�1tions, a n·�1nsit path o1 n the h:1.s been d e plm ed in the product since the middle of k LAl'-J is t�1 e n From a route1· to the fl<'eudo router and 1995. An Tm(rnet E11g,ine ering Task Force (IET I-' ) then to �m other router on the Hence �111\' clunge LAN. group is currenth' attempting ro produce a standard in pse udo router sLltus disrupts calcuL1tion ofrhe net­ protocol to meet this need .'' work map. l When a design�ued router tails, a s ew oF updates Acknowledgments l occurs; each rourer on the LA� oses rhc adjacenc1· m s the old desi gnated muter �1 11d issues �� 11 e11 l i nk - tate V1rious mem bers of tiJe route1· engineering team in prk �• s Ethernetc a ble -o u t detection. hr as the other routers arc concerned; the old desig­ nated router stavs, disconnected, in the tables r( >r as References s long as 20 min ute ro an hour. This h,1 ppens �lt le1 el I l •m d at lcl·el 2 in Integrated IS-IS, res u ting in tll'ice ]) lk1'h .1 nd S. Bn·ant, "The DFC :NIS 500/600 Multi­ x as manv updates. The interacti ons �l iT com pl e ; in protocol Bridge Ro uter ;1 nd Gatc\\·,n·," VJ,!< ii({/ l('chni­ general, the1· result in the en in multiple, ne11· 1c>l. no. s d g of col joumol :1, I (\Vinrcr 19l)3) : 84-9� link-stare mess�1ges . 2. ). J\.lo1·, "OS l'f' Versi o n 2," lntc rn cr Fnginrcring T�l Sk Apart h·om the pure distri bution and processing r I994 ) Force, ll..f'C 1:1�3 (M.1 c h problem of these updates and nn1 .l ink-sure p�1 ekets, there a1·e deliberate dcLws added . A minor one is th�u 3. l\. C1 llot1, "L\ c· of OSI IS-IS rc >r Ro uting; in TC:P/11' '' nd o " updates in Integrated IS-IS arc rare-limited on LANs Dtul l-:1\\ir n mull,, lnrcrnc r Engi nccri ng T1 sk (to minimize the possibilir�· of mcos.1ge loss ). A llLljor hm:c, lU:C: 1 195 (Decc111bcr 1990). une is rhat �1 particu r i n k-sr te p ke t c111not be 4 . Rou Jnform,Hiot1 .la J a ac C. llcdri..:k, ·· tin� Protocol," Inrcr· h u t Rl-'C u updated ll' ithi n a old in g rime from �� pre�·i ous pda e net 1-:nginecring Ll sk 1-'orcc, I 058 (I n e 198� ). (tolimit the number of messages <1Ctuallv g:encL1ted ). J. l'm rc l, "L>cr Dn;1gra 111 Protocol ," S R I \:cr11·ork The debult holding time is 30 seconds in Inrcgr.1ted J lnl(mnation C:cmcr, ,\ lcnlo P.uk, C1liL, lu-C: 7o� IS-IS; ir em be red uced to I second in the e1 ent 11-c (Au gu st l91W) r( >und th at the best solution 11'<15 ro allo\\' as manv as 10 n s s 6. D. l'lutJlllKr, "1-:thcrnct Ad dress Roo lurion l'rotocol," updates in a 1 0 -seco d period . The re a on r( Jr th i is g itl that the rl rst u pd ate usu�dh· (()l\t�1ins int(>rmation lmcrncr En c crin� L1sk 1-'orcc, Rl-'C 826 (\:on:mbcr 19�2 ) about the disconnection and it is highlv desi ra bl e ro get the update ll'i th the con ncction om �1s bst •ls possi­ 7. r. I'm re i, ··]ll[crllct Control J\ .lcsS•lgc l'rorocol," ]ll[erner ble. In addition, 111 th e 11·ider nef\1.<>1-k, �1 11 uplbtc can Engineering -Ll.'>k force, Rf'C 79 2 (Scpwnber 19� I ). overtake <1 nd replace a pre1·ious one. 8. 1\ . 1\r;lcicn, "Rc q uircnll'lltS f (Ocrobcr 1997). two routers, one m u st be the design�ued router so it is

4 Biographies

Peter L. Higginson Peter Higginson manages the advanced development work on router products fo r DIGITAL's Internerworking Products Engineering Group in Reading U.K. (IPEG· Europe). His responsibilities include improvi ng commu· nications on customers' large networks. Most recently, he contributed to the corporate Web gateway strategy and future router products. Peter was issued one patent on efficientATM cell synchronization and has applied fo r several other parents related to networks. He has published many papers, including one on a PDP-9 tor DEC US in 1971. Before joining DIGITA L in 1990, Peter was the sofuvare direcror fo r UltraNet Ltd. (now part of the Anire Group), a maker ofX.25 equipment. For 12 years before that, he was a lecturer in the Departmenr of Computer

Science, University College London. He received an M. Sc. in computer science from University of London in 1970 and a B. Sc. (honours) in mathematics fr om University College London in 1969. Peter connected the firstnon · U.S. host to r.he Arpanct in 1973.

Michael C. Shand Mike Sband is a consulting software engineer with DIGITAL's Network Products Business in Read ing, U.K. He is cu rrently involved in the design of !P routing algorithms and the system-level design of networking products. Formerly, Mike was a member of the NAC (Networks and Communications) Architecture Group where he designed DECnet OS!, Phase V routing archi· tecturc. Before joining DIGITAL in 1985, Mike was the assistant director (systems) of the Computing Centre at Kingston University. He earned an M.A. in the natural sciences from the University of Cambridge in 1971 and a Ph. D. in surface chemistry from Kingston University in 1974. He was awarded six patents (and has fiJ ed another) in various aspects of networking.

Digiral Technical Journal Vo l. 9 No. 3 1997 41 I

Lawrence G. Palmer RickyS. Palmer Shared Desktop: A Collaborative To ol for Sharing 3-D Applications among Different Window Systems

The DIGITAL Group has designed a An adi-�1 JKcd product de1·dopment effo rt undertaken software application that supports the sharing Lw gr�1phics engineers in the DICITAL Wo rkstations Croup led to the creation of a new som\'Jre application of three-dimensional graphics and audio across called Shared Desktop. One project goal was to enable the network. The Shared Desktop application collabor,nion ,1 mong users of three-dimensional (3-D) enables the collaborative use of any application g1·aphics wmkstations that run either the UNlX or the over local and long-d istance networks, as well Wind o11s i\'T operating S\'Stem. Another goal \\'as to as interoperation among Windows- and UNIX­ allow these users to access the high-perr<: mnance 3-D based computers. Its simple user interface cap�lbilitics of tllCir ortice workstations ri-om their laptop co m pu ters or home-based personal computers employs screen capture and data compression (PC:s) that run the Windows 95 svstem ami do not techniques and a high-level protocol to transmit h:t\T 3-D gra phics hardware. This goal necessitated packets using TCP/IP over the Internet. a cross-operating-s,·stcm application th�n could effi­ cient!�' and ctfectin·h· handle 3-D graphics in real time �md sh�1re these graphics with machines such as laptop computers �m d PCs. In this p:tper, 11·c begin with a discussion of the software currmtlv a1·�1ilable r(lr computer collabor:ttion. We then discuss the de1 clopment of the Shared Desktop applica­ tion, r(Kusing on the user interf:lCe, protocol splitting, screen capture �md d�lta handli ng, �m d dissimilar !Tame buftcrs. \ Vc conclude ll'ith sectionson JdditjonaJ uses and h1ture di1n:tions ofthe Shared Desktop prod uct.

Current Collaboration Software

Computer colhboration mav be ddi n ed as the interac­ tion betll'ccn computers Jnd their human users over a local or long-distJnce network. In general, it invokes a tr�msfcr oftntu�1l, g;r�1phical , audible, and visual infor- 111<1tion ri·om one colbborator to �m other. The parti­ cipants slurc control information either Lw means of com p u ter gc n nated synchronization e1·ents or bv human voice �m d visu<11 mm'ement.' Speciricallv, computer collaboration involves com­ municating;hics �1cbpters with hard \\'JIT acceleration. ( Comf•u ter· .1 idcd design/com purer-assisted manu­ fl cturing I CAD/C:Aiv! ] applications like Parametric Tec lmolos>�' CorpoLltion's [ PTC ] Pro/ENCINEER usc hardll'are accelerators through OpenCL' or

+1 i)i�inl Tc chnic:ll )oum.ll Direct3 D' progrJmming protocols. ) Other computers e\perts in the collaborJtion soft\\'are r:Hhn th:m in the do not conL1in 3-D

Digir:1\Tc chniul lound Vo l. lJ :-.: " . 3 19')7 SERVER DIGITAL UNIX MACHINE

r------I ·I�� j �------,:I : /? L f J � : I I I I I I I I y + ...

VIFWPORT II W.NDO\V I I I \

CLIENT CLI ENT 2 CLIENT 3 1 DIGITAL UNIX MACHINE WINDOWS MACHINE WINDOWS NT MACHINE 95

KEY: GRAPHICS, AUDIO, AND MOUSE ---- AUDIO, KEYBOARD, AND MOUSE

Figure 1 Server Deskrop witl1 Viewport and Clienrs

When a user initiates a collaboration, the audio is otT Upon connection, parrjcipants can hear and interact by default but remains integral ro a session as a conve­ with the server. The resultant audio dialogue combined nience (as opposed to using tbe telephone). Through with the graphics, keyboard , and mouse interactions a pull-down menu operation, the server enables audio facilitate a collabora6on environment in which partici­ fo r all participants in one operation. The usual audio pants share an application. Since each user can operate management tools used to set microphone recording a separate mouse and keyboard, the auruo channel acts levels and speaker/headset play-back levels are avail­ as a synchronization mechanism to indicate which col­ able. As Figure l indicates, the UNIX machine collects laborator controls the shared applications at any given audio and distri butes it to the three client collabora­ moment. The participants communicate their actions tors. Likewise, the three clients collect audio and send verbally, interacting in much the same way as people it back to the server fo r mixing. In this way, all partici­ who arc sittingaround a table and working. pants can hear one another and interact with \\ 'hatevcr objects appear in the viewport on the server's screen. Design Features Figure 2 shows the Shared Desktop Manager !Tom the ini6ator's viewport running on the UNIX server. A par- For our implementation, we concentrated on three 6cipant may usc a Session pull-down menu to control the principal areas: protocol splitting, screen capture and viewport and to connect and disconnect otber con fer­ data handling, and dissi milar fr ame buffe rs. In this sec­ ence members. The Op6ons menu allows to r audio, tion , we discuss our investigation into using a protocol remote cursor, and call-back control. The application's sp litter and our decision to rely on screen capture and Help puU-down menu provides the usual help informa­ d:na handling. We also ruscussdis similar fr ame bufters. tion similar to a Windows help taci.lit:y or a We b browser's help. The window lists the status of attached clients. Protocol Splitting We looked fo r a way to distribute 3-D graphics among workstations and PCs that would be independent of the application, graphics protocol, architecture, operat­ �ession Q.ptions !::!elp ing system, and windowi ng system. On UNIX, we fo und application sharing provided by distributed win­ clientl (99 05 1 0) - Connected dows protocols. For example, the X Protocol' allows a client2 (99.0. 10.11)- Connected user to send an application to a nonlocal display and to client3 (99.05 11)-Connected send X applica6ons protocol messages to several screens simultaneously. A protocol splitter, however, has rusadvantages due to its requirements fo r band ­ Figure 2 width, programming, and latency. Shared Desktop Manager

44 Digital Tec hnical Journal Vo l. 9 No. 3 1997 Protocol splitters require distri bution of graphics the computational power of the Alpha microprocessor commands and display lists by means of a network. to r red ucing the data, we continued our investigation . Three-dimensional models often contain megabytes of We to und that this approach req uires the windowing graphical intormation that describe specific screen system to perform screen capture bv means of a non­ operations. vVhen displaying a model locally, these CPU-intensive routine ( [ Di\IIA] graphics operations move quickly and easily over sys­ as opposed to programmed l/0). Based on our tests, tem buses that are capable of handling hundreds of we concluded th at screen capture technology would be megabytes per second. Hmvever, when these same easier to implement than a protocol splitter, would graphics objects are copied over computer networks, have better latency fo r 3-D operations than a protocol the amount of info rmation can overload even the splitter, and would be easily adaptable to the various highest-speed networks. For example, using a 100- windowing systems and 3-D protocols we wished to megabvte (iviB) Pro/ENGINEER truck assembly, a have intemperate. current generation 3-D workstation can load, display, and rot:�te the truck once in approximately 2 minutes. Graphics Compression The screen caprure approach The same operation between two identical 3-D work­ requires a number of steps to efficiently prepare the stations takes 20 minutes when pertormed by a distrib­ data tor transmission. First, the contents of a viewport uted protocol, and the rotation of the truck does nor are captured, and the sample is s:�ved to r comparison appear tluid to the user. If the same data or application with successive samples. Second, the captured viewport is duplicated on every machine, only updates with sy n­ samples are diffe renced to find screen pixels that have chronizing events are distributed, but this requires that not been changed and delta values for those that have all machines have the same graphics hardware. been changed . Third , the resultant array of values is The programming software needed to r interopera­ compressed bv a fast, run-length encoding ( lU,E) of tion among dissimilar operating and windowing the array of difrerence samples. A more CPU-intensive systems using protocol splitting is quite involved . compression may now be applied. The fo urth step is to The ability to support Xl l desktops, Windows 95 apply LZ77 compression that reduces the remaining desktops, and Windows NT desktops while using mul­ RLE data to its smallest fo rm. In step fo ur, the original tiple 3-D protocols like OpenGL and Direct3D would data has been reduced while retaining its characteristics req uire that these protocols exist on all plattorms. so that it can be restored (uncompressed) without loss

Latency requirements tor 3-D are very stringent. on a receiving computer. This final lossJess stage of Thus, any network jitter makes even the best network compression occurs only if it reduces the amount of link create breakup (visual distortions) when rotating data and if the network was busv during a previous 3-D objects. Network jitter also causes delays in send­ transmission. LossJess compression is important fo r tl1 e ing window protocol messages; as a delay increases, nondestructive transfer of data fr om the server's screen the window events may no longer be useful. For exam­ to the clients' screens and has application in industry. ple, when rotating a 3-D object, the delayed events As Jn example, consider a doctor who is sharing an must propagate as the network permits although this x-ray with an out-of-town colleague. If the graphics may once again congest the nerwork since the events were compromised by a lossv compressor, the collabo­ may no longer be needed . The object has now rotated rators could not be guaranteed that the transmitted to a new view. The ability to drop some protocol mes­ x-ray was identical to the one sent. Witb the Shared sages in a time-critical wav is a requirement to r collab­ Desktop application, the doctor who is send ing the orating with 3-D objects, and the protocol splitter x -ray is guaranteed that the original graphics are approach to sharing has no solution tor this problem. restored on the colleague's display. In some torms of compression, data is tl1rown out by the algorithm and Screen Capture and Data Handling never restored, so that tl1e final screen data may not To overcome these issues, we investigated capturing the accuratelv reflectthe original graphics. Figure 3 shows screen display, the tinaI bitmap result of the interaction the steps in the capture and compression sequence. of graphics hardware and software th:tt the viewer sees. On the Alpha architecture, these compression steps Capturing the screen is in itself nothing new; it bas been are performed as 64 -bit operations, both in the data used t(x some time to include screen visuals in docu­ m

Dig:ir;ll Technical Journal Vo.1 . 91"o3 l997 4S rr.=====�� DIFFERENCE r------, ARRAY 64-BIT CAPTURED LZ77 VIEWPORTS RUN-LENGTH COMPRESS ENCODING

ADD NETWORK TRANSMIT TO HEADERS ALL CLIENTS

CAPTURE RAW CONVERT TO AUDIO SAMPLES ADPC

ADD N ETWORK TRANSMIT TO HEADERS ALL CLIENTS

Figure 3 Cro1phics c\ 1\d Audio Com�m.:ssion Dc1rc1 Ho11 Di.1gr.1111

Audio Compression Similar ro riJe t"-f,lph i cs com pres ­ asscmbh dcltclhase resides) is ;I n A lph aSrarion 500 sion described , the audio comp ression in Sl tcl tTd 11·ork'>t.1tion t·unning the i)JGITAL UNIX opu.ning Desktop itJI·olvcs sc1·eral step�. hrst, the ,w dio scm1 ples s1 stem 11 ith cl 1\JII'tTStmm 4D60 graphics contmller. are captured throu gh '1 microphone cm d sound cc1rd In this e \Cunple , an 800- b1 600- pi .\el bl' 24-bir Shared combinJrion. These samples c\1'1.: com pared 11 irh rhe Dcskto11 1 ie11port is be in� C1j.1tured, co mpressed, and background noise lc1 el (determined p 1ior ro hcg111nit1g tr<111Slllitted ro the Sh;lrcd Desktop client S\'Stem at a conrcrence ) to see if rhe .'> c1 1llJ1 1es JI"C u se fu l . Sc1111J11es .1 bour h1 e u pd cl tes per second. The upd,ue rare is below the background noise le1·el ;�re nor tLm skrred . determined lw the capttl l·e l' iell'p ort size, the e\tenr of This i m plements a si l e nce derccri on method ll'hnch1· detail chcmges betwee n captures, the amount of pro­ onlv u sefu l sc1111ples ll'ill :�ckmce to the ne\t le1 e l cessing po11 n needed lw the application ro make · of compression. Second, the ne \t comlxession uses c lungL � ro rhe model, c1 11d the speed ohhe nctll'lJrk.. G.7ll or othe r similar :� udto unn pression '>Lltldcnds In rhis o.unplc, 11 hen rorc1ri ng rhe rruck asscmbh·, a and com·errs ach pti1 e difrc rellti,ll pu lse code moduL1 com pressed strcun of 400 ro 500 ki l obl' tes pu second rion (ADPClvl ) samples ,l t 64 kdobirs pet· second inro is l';c ne r;� red ;� nd represents rhe ti1·c updaLes per second 16 ki lo bi rs per second ( 4: I losw compn:ssion ).'' ']'hird, lllCiltioncd. A simple assemblv m i ght be Jblc to do a this dara is then readv r(x rrcmskr to a recei1 ing. cotn roration 11 irh Shared Desktop C<1pturing a 11d n·Jnsmit­ pu ter so rhar it ma1' be decompres\cd ,1 1Jd ouq1ur ro a ring 15 upcbtcs per seco11d, cl lld a more comp l icated speaker or a headset. The audio stream res u lting ri ·om model 'like rhc truck ::tssemhk sholl'n) II'(Jllld rccei1·e these steps generates ar mosr I (J ki lo birs I'CI' scumd IC11 n tqxhtc'>pn second. ll'hen someone is speaking, ;md no output '' hen it is silent. Figu re 3 also shOII'S the audio comprL· ssion stq1s. Dissimilar Frame Buffers To com plete rhe reyuirc mems of our implemenurion, Data Transmission After the gt·cl p h ics :111Li ;Hi dio cb rc1 11 c tlCcdcd to share graphics inr(mnJtion ac ross dis·

s are collected 'l tlli comp res ed, thel' c1re co m bined ;1 11d simil.u· hc1rd11 ,1 \'C, i.e., mc1chines 11·ith difkrem gr<1 p hic rransmirred ;K ross the nct11·ork b1· rnu· protocol allm1·s fo1· coherellt JUdio, s�·nchronil.;ltiOll rio n t(JI· the pi xe l , i.e., ll'hat color the user 'ee s t(Jr a of graphics c1 11d cursor e1·ems, lCill 11·here rhc ics ourpurs, especialil' ri1r 3 - D graphics, and rhe loll'er-

\o\. 9 \'o. 3 [99;- Sbew Dir �-oe Dir s, .... Load c..tl9 Ed• C-"9 Tnoll Tnolo :so,s�... ce�er.o

Figure 4

Screen Capture of the Windo\\'s Shared Desktop ClientSharing Pro/ENG l0!EER \\'ith a U� I X Sh;lred Desktop Se n er

depth di�plavs are usuallv fo und on less-c1pable, 2-D The matrix shows input screen or ,·isual tvpe depth gr:�phics pbtrorms. Most laptop computers have low :Kross the top row and delineates output bitmap depth bit depth (8 to 16) displ;ws and no 3-D capabiliti es. on the lettcol umn. Bitmap depths of 4, 8, 15, 16, 24,

Commodity PCs also ty piclily have 8- or 16-plane and 32 are used in vVi ndows systems, ;md depths of4, depths. Gr.1phics devices that support 3-D graphics 8, 12, 24, and 32 are used in Xll. The x in the matrix provide deeper display types such as 24- bit or 32-bit. requires no conversion :1 11d is captured <1 11d displayed

Some de,·ices support a mix of sever;:li or <1 11 the bit without the need fo r additional con\'(.:rsion. The c depths listed in the matrix (below) either concurrently shows bitmap depths thJt can be expanded to the our­ or fo r the entire screen at one ri me. put fo rmat by using a colormap or by shirring pixels 'We defined a matrix ofscr een depths :�nd proceeded into the correct fo rmat. Th<.: d shows that information to fi ll in the ,·arious combinations so that the applica­ must be dithered to match the output. Dithering can tion would work effecti,·elv across diftercnt plartorms result in a minimal loss of inrc>rmation, but ,,.c ha\'C and graphics hardware c1pabil ities. The m;Hrix enables developed a \'en· good and efficient method of doing computers without 3-D capability to display the our­ this conversion. The m (mix mode) marks those visual put fi ·om 3-D-capable graphics devices. The matrix of types on X I l that can exist on the screen when the screen-depth combinations ro llows. root depth is 24 or 32; i.e., an 8-bit '' indow can be

present on a 24-bit displav. The mix mode requires <1 Output different interpretation of the 24-bir pixels prior to Bitmap Input Screen or Visual Ty pe Depth compression and transmission. Si nce no 12-bit output Depth 4 8 12 15 16 24 32 displays exist, 11 marks inapplicable transrcmnations.

4 X md md d d d d Altenute tc>rmats of 24 pixels ( 3 bvrcs per pixel and 8 e mx md d d d d bl u e/green/red [BGR] triples ) are su pported as \\'ell as 8-bit pseudocolor and 8-bit true color. 12 n n n n n n n

15 e me me X d d d Sample Uses 16 e me me e X d d

24 e me me e e X d Like other collaboration sofi:,,·are, the Sh;ned Desktop 32 e me me e e e X application can be used in remote situ;ltions to help

Digital Tcchnic"l Journal Vo l. ') No. 3 1997 -t7 people communicate and share data. These uses include to application sharing based on a protocol splitter, the telecommuting, debuggi ng/support, :m d education. Shared Desktop application ofters easier interoper­ abilitv and better latencv during 3-D operations. With Te lecommuting a protocol splitter approach, it is difficult to decide One feature we built into Shared Desktop is the �1 bility which, ifanv, graphics events to drop when network

to originate a sharing session fr om a remote location. jitter or network b:mdn-idth dela\'s occur. Our Our intent was to allow an individ ual to work outside approach is synchronized to the last screen capture. the officeen vironment on a home PC or J laptop com­ When the network is no longer congested, the current puter. In the telecommuting scenario, J workstation screen capture can be sent, thus minimizing the per­ with high-end graphics fi.mcrions and applications cei\'ed efrecr of the network delay. The only disadvan­ located in the office would call back rhc home user's rage to bitmap si1Jring is its requirement that the ]ow-end svstem and present the user \\·irh his work \\·indO\\·ing S\'Stem and displav driver implement a environment. For example, consider a user of PTC's DMA screen capture ru nction and not programmed Pro/ENGINEER who is working on a 3-D assembly I/0. DMA screen capture req uests have a minimal with a 100-MB database and must mJke a change to load on the \\'i ndowing system. the part from home. Prior to the Shared Desktop 'vVe Jre planning a number of improvements ro the application, the only options were either to mimic the ad\·�mced dt\'elopment \'ersion of Shared Desktop . work em·ironment at home or dri\'e to the office to In our initi�11 \\'Ork, we made no changes to the win­ make the change. To mimic work environment, the a dowi ng systems. Ideally, the product version might equipment at home must support Pro/ENGINEER have a mechanism that notifies an appl ication when software and might require 3-D hard\\ are. In �l ddi­ and \\'here another application has made changes to tion, the user would have to retrieve a recent version of the screen. With the added ability to capture onlv the 100-tvlBda tabase over the telephone lines, which those �1 reas of rhe en that ha\'e changed since the would rake many hours to copy. vV irh the Shared scre last notification, the windowing svstem could pertorm Desktop application, the user can access the l00-Ml3 the ri rst two steps in the capture process. part using the low-end computer O\'er stancbrd tele­ Although rhe compression scheme we implemented phone lines. The changes to the assemblv then occur works r(Jr most cases, some graphics may not com press on the system and to the large database :�t the office. well using the combination of RLE and LZ77. Instead , content-specific compression or adapti\·e compression Remote Debugging/Support techniques might be better applied. This is an Jrea of Another use of the Shared Desktop application is ror studv we hope to pursue. customer support or remote debugging. Consider the The current graphical user interface (GUI ) lacks user of a 3-D design application \\'ho d iscm-crs J bug some conferencing kawres. The product version will in a new version of the sotrw�1re. A complex model be packaged with other applications to provide \'ideo, often causes a bug rbat requires soft\\'Jre support to char, \\'hirebo�1rd, ri le transkr, and user locator/ obtain the database to re-create the problem. Using directory services. Shared Desktop, a user could shovv a support repre­ FinJllv, the sharing model we implemented fo r the sentati\'e the problem on the running :�pplication, JS Shared Desktop application is easily ported to other opposed to tiling a problem report. svsrems. Thus the application could be available fo r widespread usc. Off-site Tra ining

A remote training scenario provides �l ri nal example of collaboration using computers. The Sl1ared Desktop References application facilitates remote training b\' connecting I.A. Hopper, "l\mdor�1-A.n Experimental Svste h.Jr Mul­ students in a sharing session . Each student's desktop m timedia Applicltions," O(X!ralil!,� S)'stems lmwwm11cr�-; Gu ide !USC

f:ditirJI/ (Redmond, \V;1sh .: Microsoft CorporJ. ti on , 1995 ). Conclusion and Future Directions 4. ,1/ulllj.Joint St ill [m(},r.>,e und Allii(J/CifiOII f!rotocol. ITU-T Recommendation T.l26 (D1·aft) (Gene''<\: Inrcmational The Shared Desktop collaboration soft\\'are employs Te legraph ;lJld Telephone Consllltati,·e Com mittee, 1997). J simple user inrer£1ee that emphasizes ease of 3-D applicatjon sharing and audio conkrencing. Compared

4/l DigirJI T.:chnic1l journal Vo l. ') No. 3 l')97 5. R. Sc hciflcr, j. Gem·s, R. Ne\\'111;111, A. Menro, c111d A. Wo jL1s, .\.' \F/11{/ou· Si s/en/. C Librmy r111d !'mlucol Neji:rence(Burlingron, Mass.: DigitJI Press, 1990).

6. Pulsl! Cnde .\Jodulotion (PC\ IJ r{ I (Jice Freque11cies. C:ClTl Recommendation G.7l1 (GeJJt:l'a : lnrcrnJtionJI Tc lecommunic:nions Union, 1972 ) .

7. R. Palmer and L. Palmer, "Video Te lcconkn.:ncing rc>r Networked Workstations," United St;ltes P;uenr no. 5,375,068 ( 1994 ).

8. P Hcckbcrt, "Color lnHge Quantization r(>r rramc nuffc r

Displw," Om!jJIIIer Cmphics. \'OI. 16, no. 3 (I982 ) .

Biographies

Lawrence G. Palmer

LJJT\' Palmer is a consulting engi neer in \Vo rkstations G1·aphics Dc,·cloprncnr. He joined DIGITAL in I 984 and cLJJTcnrlv co- leads the Sh;ued Desktop proJect. He holds a B.S. in chcmisrrv ( summa cu m laude) from the Unin:rsin· ofOkL!Iwm;l and belongs to Phi Beta [(;lpp.!. He is co-im·cntor t(>r nine patents on cnJbling sofn,·arc tcchnologv r(>r audio-l'idco tckconkrcncing.

Ricky S. Palmer

Ri ckv Palmer is J consulting engineer in Workstations Cr:1phics Del'clopmcnt. He joined DIGITAL in 1984 and cum.:ntly co-leads the Shared Desktop proJect. He holds a n.s. in physics (mJgnr nine patents on enabling soriWr audio-\·ideo telcconkrencin.r. a o

Digital Te chnical journal Vo l 9 No. 3 !997 49 I

David C. P. LaFrance- Linden

Challenges in Designing an HPF Debugger

High Performance Fortran (HPF) provides As \\'e learn better 1\';1\'S to express our thoughts in the directive-based data-parallel extensions to �o rm oFeomputer prog;r::tms and to rake better advan­ tage ofhJrd\\'�lrc resources, \\'e incorporate these ideas Fortran 90. To achieve parallelism, DIGITAL's HPF and [nr�ldigms into the progLlmming !Jngu::1ges \H: compiler transforms a user's program to run as usc. l-'orrr�m 90' ' prm·ides mcch::�nisms to operate several intercommunicating processes. The ulti­ direct!\' on arravs, e.g., A= 2 *A to double each clement mate goal of an HPF debugger is to present the of A independc m o�· rank, rather than requiring the user with a single source-level view of the pro­ progr�1111ll1Cr to opn:tte on indi1·idual elements 11 ithin gram at the control flow and data levels. Since nesred Do loops. J.'d ::�m·of these mechanisms arc natu­ pieces of the program are running in several dif­ ral!�· d�1ta [Xlrallcl. High Per�ornnnce Fortr::�n ( HPFt• extends Fortran 90 11·ith data distribution directil'eS to ferent processes, the task is to reconstruct the t:JciliLHe comput�1 tions done in parallel. De buggers, in single control and data views. This paper pre­ turn, need to be enh::�nced to keep pace 11 i th nell' fe a­ sents several of the challenges involved and tures of the langu Jges. The ti.tmhmental user require­ how an experimental debugging technology, ment, holl'ever, remains the same: Present the control code-named Aardvark, successfully addresses tJo,,· ofthe program �1 11d its dat�l in terms of the ot·iginal many of them. source, indcpcndem ohl'l1:1tthe compiler h::�sdone or "·Jut is happening in the run-time support. Since HPF com pi lcrs ::turomatic 11lv distri bu tc data :md compu ta­ tion, rherebl' ll'idcning the gap bctll'cen :1ctu:1l execu­ tion �m d origin�1l source, meeting this req u ire ment is both more impmt�lllt ;l lld more difficult. Thts p�1pcr describes sel'eral of the challenges HPF crc1tcs t( Jr ,\ debugger �1 nd how an experime11t�1 l debug­ ging tech nolo�·, t ntc rt1<1l h" code-n�1med All'dl'ark.,suc­ cessful II addresses m�m'· ofrhem using techniques that ha1 e appliubilin· bl'l'ond HrF. For example, program­ ming p;1r;1Jigms common to explicit message-passing s1·srems �uch as the Mcss;1ge Passing Inter EKe ( MPI)'-­ cm benefit ti-om Ami1 ;1rk's methods. The Hl'f com piler ;l lld run time used is DICITAL's HPJ-' compiler,' ll'hich produces <1 11 execut:1ble that uses the run-rime support of DIGITA L's Par::t!lel Sofr11 ,\rL· Em·ironmcnr .'' DICITAL's HPf compikr tr<111st(mns a progr.1 m ro run �1 s sc1 cr:1l intercommuni­ cating processes. The fu ndamem:1i requirement, then, is to give the appe::�r::�nccof a single control Holl' and a single tb u space, c1·en though there arc sc,·nal indi­ ' idu�1l comrol rlo\\·s ;l mi the d:�ra has been distribu ted . ln the �n per, I imrud uce the concept oflogic1l entities and show how thev �1 ddress many of th e control tl ow challenges. A discussi on of �1 rich �l lld flexible data model th�n easiil handles distributed d::�ra t(Jllm,·s. I then poi m our difti culrics i m posed on user intcrbces, especi�1llv when the program is not in :1 complete!�·

50 \'ol. <) '.:o.·' I <)<)7 consistent state, and indicate how they can be over­ base classes and methods t()l- logical entities to include come. Sections on related work and the applicability of many tl1ings that are prob�1bly specific to phvsical enti­ logicJ! cmities to other areas conclude the p What is a stack ti-a me? What operationsarc expected to \\'Ork on all kinds of processes Breakpoints but actually on!�, work on phvsical processes> Experience Setting a breakpoint in J logical process sets �1 break­ to date is inconclusive. AJrd\·ark currently defines t11c point in each ph�·sical process and collects the physical

Digital Te( hlliul )ounlJ! Vo l. <) '>1 o. 3 1997 S I EXECUTION STATES:

LOGICAL RUNNING ------,

PHYSICAL RUNNING \ \ \\ \ I 1\ I II \\\ \ STOPPED 1 3-2-0- L- L 0-0-1 -2-3 213- L------+ TIME KEY LOGICAL THREAD L STARTS + . * PHYSICAL THREAD 0 STOPPED.

Figure 1

Determining When c1 Logiol Thrc1d Stof'S

<1 Aa r(h·a rk 1 breakpoint rcpresentJtions into logic1l brea kp oi nt. xh i e -cs rhe HcxibilitY of l'astlv di�krenr

r:or Hl'F, any action or condirion<11 expression is r�·pes ofcomp;1rison kevs (mach ine addresses Jnd logi ­ associated ll'ith the logical breakpoint, not \\'ith the ul Stof) reasons) bv IL\I'ing rhe comparison kel' tvpe bc physiul breakpoinrs. Consider the e\f)ressloiJ rhe most basic Aard1·<1rk base class, ll'hich is the cquil·­ ARRAY < 3, 4 l.LT . 5. Even if the clement is stored in "lc nt of ]a1"<1's Ob ject class, Jnd by using run-timc onlY one process, the entire HPF 1)rocess needs to stop tl'ping <1S neceSSJfl'. bdore the expression is evaluated ; otherwise, there is

r he potential tor incorrect d:tLl to be rc1d or f(>r Single Stepping

processes to conti nu e running ll'hen thev should nor. Single stepping a logical thread is accomplished by Th is requires each phvs ic:d process ro reach irs [)11\'sic:ll singl e src ppin g the plll'siCJI threads. It is not sufficient e1·,1luated. breakpoint bdore t he e xp ression can be to singlc step the ti rst th rc1d , 11·,1it fo r it to srop, and Once CI'Jiu,ue d, the process remains stupf)ed or COIJ­ then proceed ll'ith the oth er thrcads. If the progr:�m

tinues, depending on the resulr. For HPF, .1 brc1k · sratemcnt rcquires communic:�rion, thcn the entire '1 t point in a logical process implies glolx 1 l br breakpoint determination is ro bee ohhreads. :�sk A the thre<1

r d PHYSICAL PROCESSES For physic:�I DI GITAL UNIX rh e:� s, a s I G TRAP IN ITIAL LOGICAL STOP REASON a 1 AND THEIR BREAKPOINTS sign l cou ld bc a brcakpoint, ,,·irh the co mp< rison kc1· STOPPED AT COLLECTION being rhe progrJm counrcr :�dd ress of rhe porenti;ll PO STOPPED AT breakpoint instruction. This comp,1rison k.e1' is then P1 · STOPPED AT P2 STOPPED AT -B used to search the breakpoints installed Ill the plwsic1l process to determine: ll'hich (if ;1 111·) breakpoint ll'

For a logical thrc1d, the initi,11 ( logic1l ) sto�) I'C<\son LOGICAL could be :� brc:�kpoint ifeach of the ph\·sic1l threads i::. PROCESS' LOGICAL r-- B stopped Jt breakpoint, ;JS sboll'n in Figure 2. The 0 :1 BREAKPOINTS 0 comparison kcv in this c;Jsc is logi 1l stop re:1son 0 rbc c '--- itself. The brcakpoinrs of rhe componcnt stop rc;Jsons are th en comp,lred ro the compo11cnt hreak�)oints of rhe installed logical breakpoints to determine ir:� logi­ d is Figure 2 C1l brca kpoim ll'as rea ch e . If rh cre :1 march, the l.ogiul Brc.lki'Oint Determination logical thread's stop reason is uplhtcd .

DigitJJ Tc chnic11 )oum.1l Vo l. '.! :\o 3 lY'.!7 run rea�on is empowered to take the actions (e.g., set­ In the context of HPF, a user inrerr:1 ce registers its ti ng or enabling te mporary breakpoints) necessary to process change handler with the logical HPF process. carrv out its task. In this paper, the word empotrered During construction ot' the logical process, Aard\'ark means that the reason has a method that wi ll be called registers a ph}•sical-to-logical process change bondler to do reason-speciric actions to accomplish the rea­ with the physical processes. It is this physical-to-logical son 's semantics. This relieves user interfaces and other handler that coordinates the physic1l entities. When the clients fr om figuring out how to accomplish tasks. As a firstph ysical thread stops, as at time "*" in Fig-ure l,the resu lt, Aardvark defines a "get si ngle-stepping run rea­ handler is notified but notices tlut the ti mestamps do son" method tor threads. Clients use the resulting run not in dicate that the logical thread should be considered reason to continue the thread, thereby initiating tbe to have stopped . When the last physical thread stops, single-step operation. the handler then synthesizes a "stopped at co llection" Therdi:)re, single stepping a logical thread in logical stop reason, as tn Figure 2, and inri:mm the Aardvark involves calling the (logical) thread's "get (logical ) th read that it has stopped. single-stepping run reason" method, continuing the Aard vark defi nes some callbacks in process change thread with the result, and waiting tor the thread to handle rs that are to r HPF and other logical paradigms. stop. The "get single-stepping run reason" method to r These callbacks allow ::1 user interrace to implement a logical thread in turn calls the "get si ngle-stepping policies when a thread or process goes into an interme­ run reason" method of the component (physical) diate state. For example, at time "*" in Figure 1 a threads and collects the (physical) results into a logical physical thread has stopped but the logical thread is single-stepping run reason. When invoked, the logical not yet stopped. \Vh enever a phvsic1l thread stops, the reason continues each physical thread with its corre­ handler's "component thread stopped" callback IS sponding physical reason. invoked . A possible user interface policy is" Single stepping dramatically demonstrates the If the component thread stopped r()r a nasl )' rea­ autonomy or' the plws ical entities. When continuing • son, such as an arithmetic error, tr\' to stop �1 11 the a (logical ) thread with a (logical) single-stepping run other component threads immediately in order ro reason, the physical threads can start, stop, and be minimize divergence among the physical entities. continued asynchronously to each other and without am· inter\'ention ti·om a user interface, the logical enti­ • If this is the first component thread that stopped tor ties, or other clients. This is especially true if the thread a nice reason, such as reaching a breakpoint, start a was stopped at a breakpoint. In this case, continuing rimer to wait to r the other component threads to a physic�! thread involve� replacing the original stop. Ifthe ti mer goes off betore all the other com­ instruction, machine single stepping, putting back the ponent threads have stopped , rrv to stop them breakpoint instruction, and then continuing with the because it looks like the\' are not goi ng to stop on original run reason. Empowering run reasons (and their own. stop reasons) to efrect th e necessary state transitions • If this is rJ 1e last component thread, cancel any ri mers. enables physical entities to be autonomous, thus The user interface can provide the mc1ns for the user relie\·ing the logical algorithms �r om enormous poten­ to define the timer inten·al, as well as other attribu tes ti al complexity. of policies. These policies and their control mecha­ nisms are not the responsibility of the debug engine. Coordinating Physical Entities The previous discussion describes some logical algo­ Examining an HPF Call Stack rithms. The section "Starting �md Stopping" describes using ti mestamps to determine when a logical thread When an HPF program stops, the user wants to sec a becomes stopped (see Figure l), and the section call stack that appears to be a single thread of control. "Breakpoints" describes a logical thre::�d possibly Sometimes this is not possible, but e\'en in those c1ses, a reaching a breakpoint (see Figure 2). The physical debugger can offer a fair amount ofassistance. The HPF entities need to be coordinated so that the logical language provides some mechanisms that also need to algorithms can be run. In Aardvark, this is done with a be considered. The E X T R I N s I c c H P F _ L o c A U proce­ process chanp,ehand ler A process change handler is a dure type al lows proceJures written in Fortran 90 to set of callbacks that ::1 client registers with a process and operate on the local portion of distributed data. This its th reads, allowing the cl ient to be notified of state type is usefi.tl fo r computational ke rnels that cannot be changes. For example, if a user interface is notifiedthat expressed in a data-paral lel fashion and do not require a th read has stopped and that the reason is a UNIX communication. The EXTRINSIC(HPF SERIAL) signal , the user interface can look up the signal in a procedure rvpe aiiO\\ s data to be mapped to a single table to determine if it should continue the thread process that runs the procedure. This type is useful to r (possi bly discarding the actual signal) or if it should calling inherently serial code, including user interraces, keep the thread stopped .

Digit�! Tc chnic.ll journal \'ol . 9 :-:o 3 1997 53 which mav not be wri tten in 1-'ortclll. DIGITAL's HI'!-' rlcl\\. When a procedure is call ed, each thread executes compiler also su pports ltciuuing. 11 ·hich al lo\\·s scri:-�1 the c1 ll an d executes it ti·om the same place. A l ogical code to call p�1rallel HPF code. AJI these mech:�nisms brc:1kpoinr is rc:�ched \\'hen the plwsical threads are afkct the call stJck or ho\\' a user n�l\'iF:atcs the ell) stopped at the s:�mc place at rhc correspond ing phvsi­ stac k. They req uire underlving su pport h· om rile ul hrc:�kpoinrs. These cases lead to svnchronized debugger as well as user inrcrhce supflOrt. tr:�n1cs. The most common c:�use of an unSYnchronized ri ·amc is interrupting the progrJm during :1 computa­

Logical Stack Fra mes tion. F1 en 111 this usc, the dil'ergcnce is usuallv not very

Aardvark's logical e ntjty model applies to st:1ck fra mes: !Jrgc. One reason t(Jr a mu lri tr�1 me is rJ1e interruption of logica l stack fi-:1 mes collect sc1-cral plll'sical stack ft·a mes the progr:�m 11·hik it is communicating dat�l benveen and presenr :1 Sl'nthesizcd 'ie\\· of the ( IDgica l) c1ll processes. In this c:�sc, the code paths ca n e:�silv di1·ergc, stack. Currenrlv, Aard 1·a rk ddines t() ur nvcs of logic,1l depending on \\ hich threads :11-c sending, ll'hich arc stack frames to represent ditkrent scenarios that Cl ll recei1·ing, and how much data needs ro be m oved . be encountered: Sc1lar ti·anKs arc crc:�red because ofrhe semantic fl ow ot·

the program : the m,1in progr;1m unit is \\'ritre n in either I. SGl lar, in which onh· one pll l'sical thrud is senLlllti ­ a serial langu age or an HPI-' procedu re c::1lled an Glll�· active EXTRINSIC ( HP F _S ERIAL) proced ure tl'pc. 2. Synchronized, in which all the rhre�1ds :1re at rile The result of the alignme n t algorithm is a set of s:1me place in the same fu nction 6.-amcscol lected 1nto :1 call st�1ck. The norm::d n:�,·ig:�­ 3. Unsvnchronized , i n \\·hich all the thrc1ds

also . ri a l fo r sync i 4. Multi, in which no discernible re lationship exists V 1 b e lookup works best hron zed ti ·:1mcs :�nd,rC.> r HPr, works t(Jr unsvnchroni;.ed frames between the correspondi ng plll'sical threads as \\ ell. 1-'llows the te mp oral order of calls �1 11d tih·:lmcs inro a suck. tr ace ll'ith svnchronizcd ti·a mes. also correctl y h andles recursi1·e proced u res . The dls­ A narroll'cd fo cus can also Ionic be hind the scenes of tl1e adl'antage of starting at the outermost fr:� mes is th�n t\l'inning mechanism described in rhe next paragraph. e�H: h plwsical thread's enri rc stack must lK dererm i ned The second aid is to 1·ie11· a single plwsic:-�1 process in bd( Jre rhe logic:1l st:�ck c:�n be constr ucted . Usuc1lk isolation, efkcti1 ek turning off the pa.ral lel debu ggi ng the progr:.unmcr only wants the innermost tew ti·�1111cs, algorithms . This technique is usdi.d to r debugging so rime delavs in rhe construction process un red uce EXTRINSIC(HPF_LO CAL)

\'"1. ') \:u.� I '-)') 7 Twin ning or the line number, mav not be 1·alid. Nevertheless, a DIGITA L's HPF provides a fe ature called twinning user interface can provide useful int(xmation :� bout in which a scalar procedure can call :t parallel HPF the state of the program. The program used t(>r the procedure. This allows, fo r example, the main pro­ fo llowing discussion has a serial user interface m-itte n gram consisting of a user interface and associated in C and uses t\vinning to call a parallel HPF procedure graphics to be wri tten in C and have Fortran/HPF named HPF_ FILL_IN_DATA (see Figure 3). The do the numerical computations. The te ature is called HPF procedure uses a fu nction named MANDEL_ vAL tll'inning because each Fortran procedure is com­ as a non-data-parallel computatjonal kernel. The pro­ piled twice . The scalar twin is called fr om scalar code gram was run on five processes. (Twinning is a DIGITAL on a designated process. Its duties include instruct­ extension. Most H PF progrJms are ll'rirten entirelv in ing the other processes to call the scalar twin, distri b­ HPF. This example, ll'hich uses t\l'inning, ll'as chosen uting irs scalar argu ments according to the HPF to demonstrate the broader problem.) directives, calling the HPF lmn fr om all processes, Figure 4 shows the program interrupted during distributing the parallel dat:t back onto the desig­ computation . Line 2 of the ti gure contains a single mted process after the HPF rwin returns,and ti nally ti.m ction name, MANDEL_V AL. Line 3 contains the return ing to its caller. The HPF twin is called on all fu nction's source tile name bur lists tivcline numbers, processes with distributed dat:t and executes the implying that this is an unsynchronizcd ti-a me. In EKt, user-supplied boclv ofthe procedure. the user interface discovered that Aardvark crc1ted <\ n At the run-time level , the program's entry point is unsynch ronized logical fra me. Instead of trying to get normallv c;dled on a designated process (process 0), a single line number, the user interf.1 ce retrie1nl the and the other processes enter a dispatch loop waiting set of line numbers and presented them. Tn lines 4 hJ r instructions. Conceptually, such a program starrs through 10, the user intert:1 ce also presented the range in scalar mode and at some point transitions into paral­ of source lines encompassing the lines of all the com­ lel mode. An HPt debugger should represent this ponent processes. This user interfKe 's up comm�1nd transition. Aardvark accomplishes this by having (line 21) navigates to the calling ti-a me. In this exam­ knoll'ledge of the H I'F twinning mechanism. When it ple, the fr ame is svnchronizcd, causing the user inter­ notices plwsical threads entering the dispatch loop, EKe to present the function's source ri le

Digiral Tcd1 11ic1l joumJI \'ol. 9:-\o3 1997 subro utine hpf_ fi l l_i n_d a ta(target, w, h, ccr, cci , cstep, nmin, nmax) integer, intent(in) w, h byte, intent(out) target (w,h) rea l*S, intent(in) ccr, cci, cstep integer, intent(in) nmi n, nmax !hpf$ distribute target(*, cyclic)

integer ex, cy ex w/2 cy = h/2

forall(ix = 1:w, iy = 1:h) & targ et(ix,iy) = mande l va l(CMPLX(ccr + ((ix-cx)*cstep), & cci + ((iy-cx )*cstep), & KIND=KIND(O.OD O)), & nmi n, nmax)

contains

pure byte function mande l_va l(x, nmi n, nmax) com plex (KIND=KINDCO.ODO) ), inten t(in) x integer, intent(in) .. nmin, nmax

integer n

real(kind=KIND(0.0D0)) xorgr, xorgi , xr, xi, xr2, xi2, rad2 log i cal keepgoing

n = -1 xorgr = REAL(x) xorgi = AIMA G(x) xr xorgr xi = xorgi

do n = n + 1 xr2 xr*xr xi2 xi*xi xi = 2*(xr*xi ) + xorgi keepgoing = n < nmax rad2 = xr2 + xi2 xr = xr2 - xi2 + xorgr if Ckeepgoing .AND . (rad2 <= 4.0)) cyc le exi t end do

if Cn >= nmax) then mande l va l nmax-nmi n else mandel va l MOD (n, nmax-nmin) end if

end function mande l va l

end subrouti ne hpf_f i ll_i n_data

Figure 3 HP F _F I LL_I N_DATA Procedure (Source Cmk t(Jr Figures..J. and 5)

mechanism at ti·ames #5 and #6 (lines 23 :\Ild 24) is Examining HPF Data similar to the mechanism at h·ames #3 and #4 (lines 14 and 15) of Figure 5. In Figure 6, the1· do not c1 ll �1 Examining data gener:dlv in\'Clh·es determining where scalar twin but rather call the messaging library to the data is stored , ktching the data, and then presenr­ recei1·e instructions fr om process 0. The mess�1ging ing it. HPF presents difhculties in all three areas. librarv, however, is often not svnchronized among the Determining where data is stored requires rich and peers, and frame #2 (line 15) shcl\\ s a multiri·ame. This tl nible data-location representations ami associated user interface shows a multifi·;une as a collection or· operations. Fetching small amounts of data can be one-line summaries of the plll'sical t!·ames (lines 16 done nai1 ell', one element at a time, but fo r large through 20). amoums of data, e.g., data used tor visu�1lization, taster

56 \'ol.

2 #0: MANDEL_VA L(X = <>, NMIN 255, NMAX 510) 3 at mb .hpf . f90: 45,44,45,40,39 4 39 xr2 = xr*xr

5 40 xiZ = xi*xi 6 41 xi = 2*(xr* xi) + xorgi 7 42 keepgoi ng = n < nmax 8 43 rad2 = xr2 + xi2 9 44 xr = xr2 - xi2 + xorgr 10 45 if Ckeepgoing .AND . (rad2 <= 4.0)) cyc le 11 12 debugger> print x 13 $1 = # 20 21 debugger> up 22 #1 : hpf$hpf_fi l l_i n_da ta_( TARGET =<>, 23 W = 400, H = 400,

24 CCR = -0 . 76000000000000001, CCI = -0 .02, CSTEP = 0.001 , 25 NM IN = 255, NMAX = 510) 26 at mb .hpf .f90:14 27 14 forall(ix = 1:11, iy = 1:h) & 28 29 debugger> info address target 30 # 31 type INTEGERCKIND=1 ), DIMENSIONC1 :400,1 :400) 32 phys_count 5 33 add resses 34 0: Ox1 1fff71 f0 35 1: Ox1 1ff f7000 36 2: Ox1 1ff f7000 37 3: Ox1 1ff f7000 38 4: Ox1 1ff f7000 39 a rank 40 trank 2 41 diminfos dlo11er dupper plo11er pup per dist k 42 0 1 400 1 400 coL Lap 43 400 80 cyclic 44 45 debugger> info address target C100,100) 46 # 47 type INTEGERCKIND=1 ) 48 peernum 4 49 Locative #

Figure 4 Program 1 11tcrruprcd during Compurario11

methods :1re needed . Displaying cbta can usually use to be distributed so tlut successive arLw elements �1rc the ted1 11iques inherited fr om the underlving fortran not necessari ly stored in a single process or address 90 support, but some mechanism and corresponding space. These lead to d:tt�l that can be stored discon­ user interface handling is needed ll'hen replic1tcd data tiguouslv in memory as \\'ell as in ditkrellt memories. has diff<.:rentvalu es. Fortran 90 also introduced arrav sections, vector­ valued subscripts, and ti cld-ofarrav operations,'" Data-Location Representations 11·hich fu rther complicate the notion oF 11 here d�na is Re presenting 11·here data is stored is relati,·elv casv stored . Although evaluating an expression in\'(>h'1ng to do i n languages such as C and fortran 77: t he data Jn array can be accomplished by reading the entire is in a register or in �1 contiguous block of memory. �m-ay and performing the operations in the debugger, Fortr:ln 90 imrod uced �1ssu med -shape and deterred­ this appro:�ch is inefficient, especiallv t()r a result rhat is shape arr�l\'S,'' 11 here successive arLl\' elcmems are not sparse comp�1red to the entire arLl\·. A st�lndard tech­ neccssarilv adjacent in memorv. HPF aUO\\'S the Jrray nique is to perform address arithmetic �m d fe tc h onlv

\'ol. 9 );o_·' 1997 1 debugger> where 2 > #O(unsync) MANDEL_VAL at mb .hpf . f90:45,44,45,40,39 3 #1 (synchr) hpf$hpf_fill_i n_data_ at mb. hpf .f90 :14 4 #2 (synchr) hpf_f i LL_i n_data_ at mb .hp f.f90 :1 5 #3(scalar) mb_f ill_i n_data at mb .hpf .c:45 6 #4 (scalar) main at mb .c:421 7 #5 (unsync) _hpf_twinning_ma in_usurper at [ ...J/Li bhpf/hpf_twin.c:499, 506,506, 506,506 8 #6(synchr) start at [ ...J/a lpha/crt0.s:361 9 debugger> focus 1-4 10 debugger> where 11 > #O(unsync) MANDEL_V AL at mb .hpf.f90:,44,45,40,39 12 #1 (synchr) hpf$hpf_fi l l_i n_data_ at mb .hpf .f90 :14 13 #2(synchr) hpf_fi l l_i n_data_ at mb .hpf . f90:1 14 #3(synchr) _h p f_non_peer_O_ to_dispatch_Loop at [ ...J/Li bhpf/ hpf_twin.c:575 15 #4(synchr) _h pf_twinning_ma in_u surper at [ ...J/lib hpf/hpf_twin.c:506 16 #5(synchr) start at [ ...J/a lpha/crtO .s:361

Figure 5 Control FlO\\ of a T\\·i nncd Progr.1m lntcrruf'tcd durin"- Con1putation

1 debugger> where

2 > #O (scalar) __po ll at <> :41 3 #1 (scalar) <> at <> :459 4 #2 (scalar) XRead at <> :1110 #3(scalar) XReadEvents at <> : 950 6 #4 (scalar) XNex tEvent at <> :37 7 #S(scalar) Hand leXInput at mb .c:58 8 #6 (scalar) ma in at mb .c:452 9 #7(unsync) _h pf_twinning_ma in_u surper at [ ...J/lib hpf/hpf_t win.c:499,506, 506,506, 506 10 #8(synchr) start at [ ...J/a lpha/crt0 .s:361 11 debugger> focus 1-4 12 debugger> where 13 > #O (unsync) select at <> :,41 ,,41 ,41 14 #1 (unsync) TCP_MsgRead at [ ...J/lib hpf/msgtcp.c:,1057,,1057, 1057 15 #2 (multi) 16 17 TCP_RecvAva il at [ ...J/lib hpf/msgtcp.c:1400 18 swtch_pri at <> :118 19 _T CP_RecvAva il at [ ...J/lib hpf/msgtcp.c:1400 20 _T CP_RecvAvail at [ ...J/lib hpf/msgtcp .c:1400 21 #3 (unsync) _hpf_Recv at [ ...J/lib hpf/msgmsg .c:,434,488,434,434 22 #4(synchr) _h pf_RecvDir at [ ...J/li bhpf/msgmsg .c:509 23 #5(synchr) _h p f_non_peer_O_ to_dispa tch_loop at [ . . . J/libhpf/ hpf_twin.c:563 24 #6( synchr) _h pf_twi nni ng_ma in_u surper at [ ...J/li bhpf/hpf_twin.c:506 25 #7( synchr) start at [ ...J/a lpha/crtO .s:3 61

Figure 6 Control Flow of a Twi nned ProgLllll Interrupted 'vV hilc Idle in Scal.1r Mode

the

DigirJI Tcchnic.ll ]ourncll \'ol. CJ �o. 3 19CJ7 nenr tield and changes the element rvpe to th:�tof the shows that it is on process 4 at the memon· <1 ddress fi eld. A vector-valued subscript expression requires indicated by the contained loc:1tivc . Jdditioml su pport; the reprcscnt<�tionto r each dimen­ sion can be a vector of mcmorv oftsets instcJd of Fetching HPF Data bounds Jnd inter-element sp�King. As just mentioned, locati\'es prm'ide J method to tctch All arrays in HP� arc qual ified, explicitly or implic­ the data described by the locative. for a locative that itlv, with ALIGN, TEMPLATE,�md DISTRIBUTE dircc­ descri bes a single distributed arr:w clement (e.g., ti\ es."' DIGITAL's HPF uses <1 superset of the Fortran Figure 4, lines 45 through 49 ), the method extracts 90 descriptors to encode tbis information. Aardvark the appropriate physical thread from the logic:tlthr ead models HPF MTavs with another derivation of the and uses the contained locative to tC tch the data rcla­ locui,·e class that holds information similar to the H PF ti,·e to the extracted physic1l thread . Fora locati,·c that descriptors. The most pronounced diffe rence is that describes an HPF array, Aardvark currently iterates Aardvark uses a single locative to encode the descrip­ over the valid subscript space, determines the physical tors ti·om the set of processes. Aardvark knO\\'S that the process number and memon· offset tor each clement, loc.ll mcmorv addresses arc potentially different on and fetches the clement fr om the selected plwsical each process and maintains them as a vector, but cur­ process. for small numbers of elements, on the order rcntlv assu mes that processor-independent informa­ ot- a fe "· dozen, this technique h:1s �1cceptablc per­ tion is the same on all processes :md on lv encod es that formance. For large numbers of elements, e.g., tor information once. visualization or reduction operations, the cumulative Re ferring again to Figure 4, line 22 shows that the processing and communication delav to retrie,·c cJch

Processing �1 subscript triplet, e.g., from : to: stride, the common "read (contiguous ) memory" method. If in"olves adjusting the declared bounds and the :tlign� the process is remote, as it usuallv is \\'irl1 H Pf, the menr; it docs not

Processing c1 ticld-ofarray operation adjusts the element Com·crring a locative describing a Fortran 90 arrav type and oftset.s each memory address. section to a "read memory section" method should be When selecting a single array clement bv providing casv: thev represent nearlv the same thing. For �1 loca­ scalar subscripts, another tvpe oflocati"c is useful. This tive that describes a distributed HPf arrav, Aard,·ark locative describes on which process the data is stored would need to build (physical ) memory section and a locative relative to that selected process. For descriptions fo r each plwsical process. This can be example , line 45 of Figure 4 requests the location done by iterating 01·er the phvsical processes and information of a single arrav element. The result building the memory section tor each process. It is

Digir:1l Tel' llllical Journal Vol. 9 c-:o 3 1997 39 :� lso possi ble to build the memon· sections t()l" ;1 11 the ricalil· the s.1mc em point to targets located at difti:renr processes during; �1 singl e pass throuf':hthe louri,·c, bur mcmo1T ,1 ddrcsscs tor unrelated reasons, leading to

rhc per tcmn �1 1Kc gJins nu1· nor be l arge en ou g h w difti:rcnr lncmor,· add ress ,·alucs :� nd th erefore a f:llse w;11-ranr the added comploitv. positi\'c . To corrccrlv derdi.:rencc rhe poi mcrs, though, r\:11·d ,·ark IKtds tile difkrcnt mclllOrv :�dd ress \'�llues. Differing Va lues In short, it is rcJson

A debugger should be aware rh:�r ,·alucs might difkr ;lnd some are not , it also seems reason able to crcatc a

:� nd adjust the p1·cscnt�1 tion ofsuch l';ll ues acc ord ing � \ . d iikring '.1lucs object rhar holds rhe di tk 1ing resu lts. I NTVAR . LT . 5 Aardvark's ;1pprcK1ch is to ddlne a nc11· ki nd of value What if is used �1s the condition of a object c11led chj/(Tinp, mlues ro rc prescm <1 ,.;1luc ti·om l11·c;1kpoim ,m d some ''

;H thi s f�oint; �1 brgc number of , alucs could distr,Kt in common pr

the COtnponcnt ricld com p p t r is �1 distributed arL11' be saved :m d restored. Although logical cntirics II'Ould pointer. Aardvark does nor currenrlv process the array be used to coordinate rhc phvsical details, this is a lot ck scripror( s) to r s caL a r% compp t r ar the right pbce of 11·ork and has nor been prototvped. and as a result docs not recognize the expression as Related Work an JtT�\1'. As mentioned carlicr, A1rd1·ark reads a repli­ Gitcd �1 rrav element ti·om a singlc proccss. To process array< 1)% ompp t r, DJGITA L's representative to rile tlrsr meeting or· c all thc descriptors arc needed, e.g., r(>r thc base memon· �1ddresses in the plwsical the HPF User Croup reported a general IJmcnr processcs. The usc of this rcbtivclv ncw construct is among users about the lack of debugger support. '"-20 groll'ing rapid ll·, cb·aring the import�m cc of being Browsing rhc vVo rld vV ide Web rcveals little on the supported bv debuggcrs. topic of HP� debugging, although some eft(lrtS ha1 c pro1·ided various dcgrees of sophistication.

Ensuring a Consistent Vie w A program can h�ll'e irs plwsical threads stop ar the Multiple Serial Debuggers s�1me place bur be in diftercnt ircrations of a loop. A simplistic :1pproach to debugging support is to start Aard1·ark mistakcnlv presents this St<1te as Sl'n­ a traditional serial debugger on each componcm pro­ chroni;,ed and presents dat�1 as if it wcre consistent. cess , perhaps providing a scparate window r( lr e

Dig;iral Tc clmic11 )oum.1l Vol .') 1\!o. 3 1')')7 61 Prism DIGTTf\l .'s HPF; the equiv�1lcnt of twi nn i ng requires a

The Prism debugger ( ' crsions d:tri ng ri·om 1992 ) , t(n­ sn·lisric ,,.a1· of coding and declari ng a dispatch loop . merlv fr om Th inking Nb chines Corporation, provides Thre<1d.-b·cl SPMD USU<1liv bas J pool ofthrc1 ds wJit­ d e bugging support fo r CM �ortL111.�1 ! ! The ru n- rime ing i n �1 dispatch loop, requiring Aarch ·ark to kno"" model of Ci\tl Fortran is essentiallv si ngl e instruction, some nKc hanics ofrhe run-rime support. multiple data (SIMD), \\'hich considerabh- simplifies The di fte ri ng ,·alucs mecha nism c:�n app!v to data i11 managi ng the program. The program gets compiled SPMD fX1 ra digms. DIGITAL's recent introduction of into an executable that broadcasts manoi nstrucrions Th read Local Storage (TLS ),2' modeled on the Thread to rhe parallel machine, even on the CM-5 synchro­ LoLal Storage t:1cility of Microsoft Visual C++1'' with nized multiple instruction, multiple lb t�1 ( MIMD) similarities to TAsKc 0 MM oN ofCrav Fortra n /'1 provides machine. Prism prim�1rily debugs the single program another sour(e ohbc same \'ariable h,wing potentiJIIv doing the broadcasting. Thcrdore, oper

arrav to inc l ude :-� process axis. For example, <1 rwo­ code, a variable can have several simul t�lllcous litetimcs dimcnsional 400 x 400 �HTa\' distri buted < *,cYcLIc) ( e. g., rile resu lt of loop unrol ling) or no acti1·c lifetime on ti vc processes is presented as a 400 X 80 X 5 �1rrav. (e.g., bet\\'een �1 usage and the next assignmem). New For oplicit message sen ding programs, Prism controls lkri,·ations of the locati ,·e class ca n describe the mu lti­ the target processes and provides a ""·here gL1ph," ple homes or the nonexistent home of a 1·ariablc. which has some of the ,·isual cues that Aank1rk's logi­ Fetc hing b\· means of suc h �1 Joclti,·e creates !1CI\ kin ds cal t!·ames provide. of' .:! l ues thJt hold �1 11 the 1·�1lues or an i ndica tion that there is no ,.�1lue. User interfaces become �111·are of

To ta/Vie w these ne,,· kinds of ,·a] ues in ways siIll ilar to thci r Recent ( 1997) versi ons of the Tot•1IVic\\' debugger, a\\"arencss of differing val ues. fr om Dolphin lmercon nect Solutions, l nc., prmidc Aard,·�1rk.'s method of aski ng a thread to r a single­

some support tor the HPF compiler ti·om The Portland s tepping run reason and empowering the reason ro Group, lnc.:·'·" TotaJVic\\" prm·i des " process groups," accomplish irs mission (an be tl1c basis tClr si ngle step­ which are treated more like sets tor se t-wide opera tions ping optimized code. Opti mized code generallv inrer­ tJ1an like a S\'nthcsis imo <1 si ngle logicJI cntin·. As <1 lc.11·es instructions tium difkrcnr source lines, rendering . result, no uni6ed view otrhe c:-�11 stacks exists. To tal View the standard "execute instructions until the source line can "di"e" into a distributed H PF arr�1\' and present it as numbn c hanges" method of single stepping useless. a single

dat�1 is not cu rrentlv intcgL1tcd imo the cxprc�sion semantic c1·cnts of a source line, Aan.h-ark can construct system, hcl\\·ever, so a conditional breakpoint such as a single-stepping run reason based on scmantiL n·enrs A< 3, 4) . LT . 5 docs not \\'ork. To ta!Vic\\" is being t·ather than line nu m bers. Single stepping an optimized acti1·clv dc1·c lopcd; h.t turc 1-crsions \\'ill likclv prm·ide H PF program immcdiatelv reaps the bendits since logi­ more complete support t(n Hl'f. Cli steppi ng is built on phvsical steppi ng.

Applicability to Other Areas Interpreted Languages I .ogi cal entities can be used to support debugging

Manv of the techniques that Au·dv ,1rk incorporates can interpreted languages such as Java'1 and Tcl .'1 In this apph· to other �1 reas, includi ng the single program, use, tl1c ph \sic al process is the operating s1·stem's multiple data ( SPM D) paradigm, de buggi ng optimi;.ed process (the ]�1\'a Virtual Machine or the Te l inter­ code, and interpreted l;1ngu:�gcs. p reter),

tion .';-li Aardvark's twinning <1 1 gori rh ms em be used svmbol tables rather than the svmbol tables of the in both cases. Process-le\·el SP1\l D is simtlar to plwsic:-�1 process .

62 \'ol. ') :-.:o. 3 1997 Summary 10. It is possible to al11·avs usc logic;ll entities, but some­ times ir is easier to \\'Ork 11·ith the bui lding blocks 11·hcn HPF presents a v:�riety of challenges to a debugger, additional structure might be cumbersome. In a simi­ lar 1-cin, Fortran could eliminate scalars in E11·or of including controlling the program, examining its call arravs of rcmk 0, bu r Fortran chooses ro re tain SC arc ge neric names fo r categories. Which particular stop rc\Sons L\ 11 into and a rich data-location model can manage HPF arrays and ex pressions involving arrays. Many of these ideas 11·bich category is a sepaLltc design question. Once the catcgorv is determined, the po licv prese nted can be can applv to other debugging situations. On the sur­ pert(>rmed. fa ce, debugging HPF can appear to be a daunting task. Aard,·ark breaks down the task into pieces and attacks 12. Debuggers often build a (phl'siC\1) ca ll stack fr om

them using powerful extensions to f.1miliar ideas. innermost ri ·ame to outermost fi·ame, inrcrleal ing con­ struction with prese ntation . The intcrlcwing gi 1-es the

Acknowledgments appc1rancc of progress c1-cn i r· there arc occasionc1 l delays between fl-a mes. The totcll rime required to reach the outermost fr ames of each pl11·sical thrc1d, I am grateful to Ed Benson and Jonathan Harris fo r wh ich must occur bdorc consrnrcrion of a logical c1ll their unwavering support of my work on Aardvark. stack can begin and before anv prcsent�rion is possible, I also thank Jon;nhan's HPF compiler ream and Ed's could be noticeable to the user. Parallel Software Environment run-time team fo r u a pr

REAL :: ARRAY_ARG_2D C:,4:)

1. flrogmmmiug Languar;e hJJ1ran 9U ANS I X3.l98- A deterred s hape array Ius either the ALL o cAT ABLE or l992 (New York, N.Y.: American National Stan(brds Po INTER �ttri bu re , speci ri cs ncirher the lo11U nm I nsti tute, 1992). upper bound, and often contains local procedure I'

3. High Performance Fortran �orum, Hig h Pe1jormanc.e 14. An :-rrray section occurs 11·hen some su bscript specifics Fo J1tWI Lanp, uage -�)!ectjicaliou. \'ersiou 2.0. This more than one element. This can be done 11·ith a sub­ specirication is avaibblc b1· anom·mous ft p ri·om soft­ script-triplet, which oprionalh· spccirics the lo11n an d lib.ricc .edu in the dircctorv pub/HPF. Version 2.0 is upper extcnrs and cl stride, and/or ll'irh cl ll i nteger I'CC­ the ri le hpf-120 .ps.gz. tor, rclr example, 4. C. Koc.:l bcl , D. LovcmJn, R. Sc hrei ber, C. Steele, Jr., ARRAY_3DCROW1 :ROWN , COL1 ::COL_S TRIDE , PLANE_VEC) Jnd M. Zosc l' nx' 1- lip,h Pcrj(Jr/1/CIJ/U! Fo rlra II Ha lld­ hoo/<. ( Cam bridge , t\ tL lss.: MIT Press, 1994). A fi cld-of-arnw operation spccihes the array fcm11ed b1· a tl cld of· each structure clement of e1rra1·, r( lr 5. MPl forum, "i\lf"l-2: b; xrensions ro the lvkssage -Passing em example, lnterbcc," available at http:/jwww. mpi-r()rum.org/

docs/m pi-20-html/mpi2-report.html or 1·ia the Forum's TYPE

6. M. Snir, S. Otto, S. H uss- Lederm an , D. Walker, and In general, each of rhcsc spccirics disconriguous j. Dongc1 rra, :1/l'l n1c Co mple!r! Reji:rence>( Cambrrdge, memory. Mc1ss.: MIT Press, 1995). 15. "DEC Fortran 90 Descriptor format," OFC Fo rtrc/ll

7. W. () ropp, E. Lu sk, Jnd A. Skjellum, f'si11,1.!, .\JPI 90 l'

16. The HPF arra1· descriptor ror the 1·:1ric1ble ARRAY in the 8. J. Harris er al., "Comp iling High Pcrfornunce Fortran r(>r Distributed - memory Svstcms," Oigiwl Te cbuical HPI-' fragme nt .Jollrllctl. vol . 7, no. 3 ( 1995 ): 5-23. 1HPF$ TEMPLATE TCNROWS,NCOLS) 1HPF$ DISTRIBUTE TCCYCLIC,BLOCK) 9. E. Benson et al., "Design of Digital's ParJIIcl Soft ware REAL :: ARRAYC NCOLS/Z,NROWS ) Er11·imnment," Di,qilnl Te c/JJtical jou rnal. 1·ol. . 7, no. 3 'HPF$ ALIGN ARRAYCI,J ) WITH TCJ,I*Z-1 ) (1995): 24-38.

Digit� I Te chnical journal Vo l .<) No. 3 1997 6.1 contains componcnrs cotTc�pond ing to each of the 30. U "<)(! Cu 111 tl!f711ds o 11d /Jirecl it ·es Rej'aci!U' \.fa 111101 ALIGN, TEMPLATE, lici r, f(>r c\Jtnt>lc, ''"l l il t<'l l l :\ \\'bile Poper (l'vLl\- 1996 ). This p.1per is REAL :: MATRIX(NROWS ,NCOLS ) a\·ailablc :n hrrp:/ 1\\'\\w.j,wasoft.corn/docs/white/ 1HPF$ DISTRIBUTE MATRIX(BLOCK,BLOCK) Lm gcn'/ or 'i;1 ;\ notwrnousftp ti·orn ft]>.ja' .1Soft.corn in rhc dirc.:crorv docs/papers , f()r C\dlllple, the fi le 17. The ,-,lri;lblc SCALAR in "-h ircp;lpcr. ps. L1r. '/.. . 1HPF$ TEMPLATE TC4,4) 1HPF$ DISTRIBUTE TCCYCLIC,CY CLIC) 32 J_ Oustcrlwut, J'c/ C/1/(/ the yz, 'f(,o/l'il (Reading;, /vl.1ss 1HPF$ ALIGN SCALAR WITH TC*,2) ;\dd i >on-IVcsl c\·, 1994 ).

is l' Mti;lllv rcplicllcd and \\'i.ll be stored on the same [>r occssors th ;l t rhc logic<11 second colu m n ot rhc tern· Biography pla te T i s stored .

1 R. R. Coh n , Suun:e-!.cn:l I JciJII.f.!,.[.! ill,l!. o( .\lllt>l!htlicullt f'aml!elizcd l'ru.� IWIIS. Ph.D. Thesis, Carnegie lvk l

ion Un i\ Crsity ( October 1992 ).

19. HPJ:' User Crouf', Febru;�n· 23-26, 1997, S;� rH:l Fe , 0: e\\ J\ !cxico. lntorrn .nio r1 <1bom rhc mee t ing is :1\-,lil­ .l ble :H IHtp://""'"'' ·LH1 l.gt"/H I'I-'/

20. [n ,1 trip rcporr, DJC!Tr\l.'s n.:prcscntlaincd .1 bout th e bck of g:ood de buggr ng supt>mt t( >r Hl'F Those who h:1d seen our I AarJ,·.l rk-b:lsl·d I debugger David C. P. LaFt·anee-Linden liked it " l or .. _ l r\n i nd usrri:d H I'F u'n coll1plai ned I J);l\ id Llh:111cc-Lindl·r1 is a t.> rincip<1 1 soft" .11-c engineer e mphatica l h· <1 bour rhe l.1ck of good d c lJLrgging su p · rn DlCITAL's High l'crfornJ;1ncc Tcchni c1l CompuLing port. _ I i\ norher i nd usrri;� l H PF user's I higgot con· g;rollJ' Since.: joinint'- J)JC!TAl. in 199l,he h<1S \\ mked cem is the LKk of good deb uggi ng tc1 cilitics." on rools t<>r P'1r:1llcl f>rocl·ssing, including rhc HPI··c.!J>ablc dclHrg;g.n dcsnihed in rl1is paper. He has Jlso conrri bu ted 21. l'rf., llt r:�cr :,· c;u ,de (Ca mbridge. t\ L1s.s.· Th inking ro Lhc implcrncnution of the I'.Jr;lllcl Sofu, :1 rc En,·ironrncnt tvb ch i nes Corpor;nion, 1992 ). ,md ro rhe cornpi l c·rime put(mnancc.: ot' thc HPf compi ler. l'r·ior to Joining DIC rTAL, l),wid "orkcd ar S\'111 holics, 22. (:If Jl,r/!wt l'm,r; mllllllill,l!. (,"tude ( Cambr·i d �c , ,\ L1ss Inc orl li-om-crJd Slli'I'Ort, ncrworks, operating S\'stcrn Thinking i\Llchincs C:orv()J'crf(H·rnance,;md CPU architectu re . He rccei,ni

23. 'li>!ol\ iew I :,·ers (,"u ide ( Dt>l ph in lnruu mncct Solu­ :1 ll .S. in rn.trhcnntic' ti-om M IT in 1982. riom, Inc., I 997) . 'fllis guid e is a,·,,il:lhlc 'i<1 <1 11011\'· m ou s ltl' ti·om ft.p doll1hinic.s .ctHl1 in r h c toL1h i c\\/ DOC:LIMF:\ IATIO:---! d i rccron . r\r the tin1c of \\ri ring rhc ti le is TV -3.7.5-L'SERS-MA0.1 l1AL.[>s.Z.

24. I'C,H/'1-' I·'�',-:' (,"uide (\Vilson, illc, Ore.· The Portbnd Croup, Inc, 1997)

25 A'-'11' l_.ortm u 00 ji >r /)f,!�ito/ ('.\I\ (M:I\'l�;mL ,\ LJss.: Digi t.l] Equipment Corpor;nion, October 199:>\.

2(J. "fine-Tu ni ng l"<>"·er fortr:m," .1111'.\jJn> 1'01\"/:N /-IJJ·­ Im/1 --Pru.!.Jttllllll<'l'�' (,"uidc( 1\ \"urH:lirl \'ie\\·, C.1lif.: Silicon Graphics Inc, 1994-1 91)(> )

27. "ColllJ>ilation Comrol Statements :1 nd Com pi l e r Directi, cs," nu: Furl mil ')() C:\et· .\fctllilltf (M<11 nard , r\·Llss.: Digital Equ i p m c rH Corpor:1rion. t( nt hcorninl:'­ in 1998 )

2R. Nl'icmt· .\ utes j(J!·[fli:!.!. ilo/ 1'.\'!Y I I[T,iou I iO/J (;\h\ · na rd, !Ybss .: Digit�! Equit>mcm C :orpor;Hi on, 1997).

' 29. "The Thrud Am·i b u tc,'' in .\fiuosuji \·isu({/ 1 ++ C++ l.illt,l!. 11up ,t · Nef('rence (I ic•1:vitJ/1 .2.() ) ( Red nwr1d , I Vas !1. · t\l i croso ft Pn:ss, 1994 ) : 3RlJ-39 1.

64 Digit.li Tec hnical Journal Vol. � No. il l�97