Editorial Tl1e D(�ilul Tccl:ntiwljournalis a refereed The tdlo11 i11� .tre tr.tdem;�rk.s ofDigital J�ne C. Blake, t'vbna�ing Editor journal published quarterly bv Digital Equipment Corpor.uion: :\lph.1SerHT. Helen L. Patterson, Fdiror Equipment Corpor,nion, 50 Nagog Park, AlpluSwion, AltaVista, DECner, Dil;rr.\L., Karhlct·n M. Srerson, Ediror AK02-3/B3, Acton, MA 01720-9843. rhc DIGITAL logo, DIGITA L C:S:JX. and Open VMS H .1rd-copv subscriptions can be order ed lw Circulation sending o chec k in U.S. funds (made pavable BSME is a tr.1ckm.u-k ofR.'iA 1),1[,1 Sec11rill, Inc. Catherine M. Phillips, Adminisrr�ror ro Digirol Equip ment Corporation ) ro rhe ccM.1il is a rr.ldcm;Hk ol·,·c:Mail, Inc., a wholl,·. publi shed- bv address. General subscription owned subsidiar�· ot' Lon1s DcYclopmcnr Production rates ,\re 540.00 (non-U.S. $60) for tour issues Corporation. Christa vV. Jcssico, Production Editor and 575.00 ( non- U.S. S!!5) tor eight i"uo. CRYPTOCard is,\ registered rradenurk of Elizabeth McCrail, Tvpogr•1phcr UniversitY and college protessors and Ph.D. CRYPTOC.ud Corp<;ration. Pc[(;r R. Woodburv, Illustrator srudcms in the elecoicol engineering and com­ purer science tields receive cornplimcnrarv sub­ IBM is .1 registered trademark of International tion r. Advisory Board scrip s upon regucs DIGITALc ustomers Business 1\'i�chine.s Corpor;1rion. lll

DIG !TAL employees mav order subscrip ­ SccuriD is a rcgistncd tr:Hknurk of Security tions through Readers Choice at URL DynamiLs Technologies, Inc hrtp://wcbrc.das.dec.com or by entering S/Kcv is;\ regi,stcred trademark ot'Bcll VTX PROF! LE •H the OpenVM5 system Communications R.csc.1rch, lnc. prompr. SPEC is a registered rr.tdern;lfk ot'rhe Srand;ml Inquiries, address changes, and compli ­ l\:.:rfOnn;lncc Ev.1luation Corpor.1rion. nu:nt.Hy subscription orders can be sent ro rhc Digilal '{(>cbnica!juurnalar rhe UNIX is .1 rcgisrernl rradenurk in rhe l 1nitcd

published- bv address or rhe e lectronic St:ttcs .lnd orhc:r C

Comments on rhe conrenr ot'anv paper and requests ro conracr ourhors arc \\'clcomed

and rn.1y be scnr ro rhe managing editor .H rhc pubJisbcd-ll\' or electronic mail address.

C:op,Tighr © 1997 Digital Equipment Corpor;Hion. Cop,·ing without fee is pcr­ rnirrcd provided thar such copies arc made t

The inform:�rion in rhejouma/is subject to ch,1ngc \\'ithour notice and should nor be construed as a commirrncnr by Digital Equipmenr Corporation or bv the compan­ ies herein represented. Digital Equipment Cover Design Corpor.ltinn assumes no responsibility for Tunndin� and ti rt:walb arc two dh:crive :1ny errors thar may appear in the journal. rcchnologit:s t(>r ensuring secure commu­ ISSN 0898-901X nicnions between rhc public Internet and private n<:tworks. Our covet· depicts encap­ Documentation Number EC-P8429-18 sulated and cryptographically .,ccurcd dara Book productionwas done by Quant.ic as "unreadable" numbers rr:wcling in <1 Communications, 1 nc. protective runnel until reaching rhe tircwall. The tirew;1ll functions as;\ screen rhar permits only authorizt:d cLna to p�ss into rhe pri,·are network wht:rc· p.1ck<:ts ofdat1 can bt: dt:cryptcd wirh a key rh.n is shared between the sender and rhc receiver. DIGITAL implcmcnrarions of a runnel and � tirewall arc presented in rhis issue.

The cover design is by i.tKinda O'Neill ofrhc DIGITAL lnd usrri.ll

Foreword Paul J. Cormier 3

ALTAV ISTA INTERNET SECURITY AND MAIL

The AltaVista Tu nnel: Using the Internet Kenneth F. AJdenand Edward P. Wo bber 5 to Extend Corporate Networks

Protecting a Private Network: The AltaVista Firewall J. Mark Smith, Scan G. Doherty, Oliver J. Leahy, 17 and Dermot M. Tynan

Developing Internet Software: AltaVista Mail Nick Shipman 33

ALPHA-BASED WORKSTATIONS FOR NT AND UNIX

DIGITAL Personal Workstations: The Design of Kenneth M. We iss and Kenneth A. House 45 High-performance, Low-cost Alpha Systems

Design of the 21174 Memory Controller for Reinhard C. Schumann 57 DIGITA L Personal Workstations

FurtherReadings 71

Dtgttal Tcchnic:d journal Vol. 9 No.2 1997 Editor's Introduction

DIGITAL has pioneered many net­ their experiences in deploying the House discuss the primary reasons f(x working developments i.n its 40-year AltaVista Tunnel within DIGITAL. initiJting a wholly new design: simul­ history. A recent development, AltaVista, Once data arrives-almost-at irs taneously ro take advantage of new, has ctptured the popular imagination, destination, the firewall is a filtering high-performance memory technolo­ as evidenced by worldwide accesses, router that determines whicb data gies and ro implement at a low cosr.

averaging 18 million per day, to this packers will be allowed to pass from A new, low-cost core logic design Internet search engine. Introduced in rhc public to the private ncrvvork. was needed ro fimction as rhe CPU­ 1995, AltaVista indexing of the entire lv!JrkSmith, Scan Doherty, Ollie to-memory interface. The result, Internet \Vas made possible by 64-bir Leahy, and Ocr Tynan compare types described by Reinhard Schumann, VLM Alpha technology. The index of firewalls; describe firewall functions w:ts tl1e 21174 single-chip core logic proceeds today at a pace of more than �uch as alarm systems, autJ1cntication, ASIC f()r the Alpha . 6 million pages per day. DIGITAL's and reporting; and present tJ1c design Designers were able to meet tl1eir Internet developments, however, ofcheAiraVista Firewall tor DIGITAL own aggressive performance go:Ils bv go well beyond search functions. UNIX. The AltaVista Firewall com­ f(xusing on reductions in the main Business users need greatJy improved prises both application-level and memory latency that was attributable

security and protection to integrate p�Kket-filtering functionality and ro rhc memory controller subsystem the power oflnternet connectivity implements the p1inciplc "that which and by using as much of the raw into their businesses. It is this need is not expressly permitted is denied." bandwidth of the that is addressed in tl1e papers on The development of the Alta Vista CPU's data bus as possible. tunnels, firewalls, and electronic mail. Mail product is presented by Nick Subjects tor papers in the next Additional papers in the issue feature Shipman as a case study in the issues issue of rhe .fou mal inc I ude the high-performance, low-cost Alpha being engineers who design products par:dlel SCS I technology, shared microprocessor-based workstations t(Jr business users of the Internet. desktop software, and a high­ with unique design features, such He relates several of the fundamental performance debugger. as a single-chip core logic ASIC. assumptions about engineering pro­ "Tunnel" and "firewall" are strong jects that were overturned by the

metaphors rhar developers use to engineering team; tor example, connote rhe kind of security software product definitionhad conventionally

necessary to prorecr business com­ started witJ1 tJ1c tcclmical issues to be

munications rransmirted over the addressed and now starred instead

Internet. Tunneling protects data 11�tl1 a product pw-cbase price. .htrtl1er, Jane C. Blake as ir travels in the public Internet in an eftorr ro ensure product sim­ McmuMillJ:t Fditor by providing secure encapsulation plicity tor the target customer, they within the standard TCP/I P proto­ imposed the principle of simplicity col. However, as Ken Alden and Ted throughout tl1e project-simplicity Wobber explain, additional security in presentation, in design, in meth­ measures arc necessary, specifically, ods, ::md in implementation. cryptographically secure encapsulated A low-cost, high-performance packers. The authors describe how workstation has been designed by secure network-level routing can DIGITAL's workstation engineering be achieved by combining tJ1e well­ group. In the first of two papers known technologies of tunneling and about the DIGITAL Personal secure channels. The paper includes \IVorkstarions, Ken Weiss and Kenny

2 Di1,'.it;ll Technical Journal Vol. 9 No. 2 1997 Foreword

The products presented in this mature. These factors quickly led issue are good examples of theresuJts to the DIGITAL AltaVista Firewall that can be achieved in extremely product. short periods of time-six to nine DIGITAL responded to this mar­ months-when research, product ket demand quickly, initially moving development, and services groups the SEAL technology to a standalone work together to bring world-class product on DIGITAL UNIX plat­ products to market. forms. The same engineering group In the case of the Firewall product, that developed the SEAL technology research rook the lead early, while tor tJ1e Services group seamlessly Paul J. Cormier the Internet was still used almost moved to product engineering, first DirectorofEngineerinl!,,AltaVista exclusively by the scientific and tech­ in the Internet Business Group and nical community. As one of the first later to the AltaVista Group. This Fortune 500 companies to connect smooth transfer of experience a.Jlowed In this issue, we focus on Internet to the Internet, DIGITAL quickly DIGITAL to go to market after a software products that arc part of the saw the threat to the security of its short, six-momh development cycle Alta Vist:� portfolio. These p:�rticular network and, in response, members and to be one of the first vendors to products arc notable because of the of its research group developed the ofter a st.andalone, commerciJI fire­ fashion in which they have been initial firewall technology. As with wJII product. developed and brought to market. many innov:�t.ions, this technology Engineering has .learned ti·om With the commercial Internet sofr­ was recognized by DIGITAL's cus­ research and carried that knowledge w:�re industry moving quickly from tomers as sratc-ot:the-art and was in :�nd experience through services :md nonexistence tO the most competitive turn demanded by tl1em for their own directJy to the product engineering and fast-moving industry in existence, uses. While this complex technology community. Moreover, engineering engineers have recognized the need was stiU in its infancy, DIGITAL has adapted its process to sr::�y com­ for innovation at all stages of product Services was able ro deliver high-end petitive within the Internet market, lik. To succeed in Internet software, security solutions to companies that enabling DIGITAL to be a technol­ engineers must be innovators both desired to connect to the Internet. ogy leader witl1 the AltaVista Firewall in technology and in product devel­ Not surprisingly, these companies product on DIGITAL UNIX and, opment. Innovation begins with also needed to address the same net­ more recently, on the \Vindows NT the concept and continues through work security issues that DIGITAL platform. product development and on to raced as a consequence of collJlecting The AltaVista runnel, or secure delivery to the end user; and the to the public inti-Jstructure. Starting '�rtual private network product (VPN), cvcle continues. with the firewall technology, the has similar roots to those of the fire­ From the beginning of the Internet SEAL Firewall Service was born, and wall technology. Tunneling was born revolution, DIGITAL h:�s been a DIGITAL became one of the very first in response to a need for visitors at significant player in Internet somvare Internet security somvare providers. DIGITAL facilities to securely tra­ and solutions development. The As more and more enterprises verse the trusted internal network company's success is attributable connected to the Internet and with untrusted packets. Again, the to the many diverse and technicallv experienced the same security issues research community rook the lead. t::lientcd groups that are f(JCusing DIGITAL had been facing, it became With the tllllJleiing concept their resources on developing soft­ evident that both firewall technology reversed, that is, access allowed from ware and solutions for Internet users. and the market were beginning to the untrusted external net\vork (the

Digital Technical journ,ll Vol. 9 Nu. 2 1997 3 Internet) to the trusted corporate way to solve DIGITAL's mobile­ As has been proven with these network and with encryption added, worker problem. products, the model that is good for the rudimentary basis tor today's DIGITAL Services has also begun the company and t()r the customer

product was put forward. to ofkr 1 he runnel product, coupled is one that includes

The newly tonned Internet engi­ ll'ith int(mnation services' pilot • Researchers incubating and neering group was ready to take the experiences, as a solucion to its cus­ piloting the technology in

technology and prototype forll'ard, tomers-the same model used with the labs

putcing into action a new instance of the initial tirell'all technology. • Engineering groups t"

As was the case with the firewall, a tal­ move f·()rward in using the Internet • Services groups developing a

ented engineering group moved the JS an efkctive business tool, standards repeatable solution tor customers initial product to market within six arc emerging that DIGITAL is help­ DIGITAL will continue to move monrbs. DIGITAL was once again ing to define. Future products arc technologies rapidly ti-om research able to lead in the Internet space and being dc1·doped based on the stan­ through products and on to solutions, dajm the first VPN product to surface dards and include fean1res that allo11· thus accelerating rhe use of tbe in tl1e market, one that currently has other companies, "'ho may have verv Internet as a mainsrrean1 business many compecirors. different securitv strategies :llld poli­ tool and helping businesses take As was also the case with the fire­ cies, to take advantage of the Internet advantage of the Internet and be wall, DIGITAL recognized a good in a secure and productive manner. competitive in their own markets.

use of this technology to solve one of The model of research, product its own problems. The telecommlllu­ development, and services II'Orking cacions costs of moving d1e U.S. -based together to deliver innovative, cunjng­ sales force to home offices and con­ edge products and solutions that use

necting it back into the corporate net­ the ubiquity of the Internet to solve work were becoming excessive. The rcal-ll'orld customer problems will information services organization ran continue to expand DIGITAL's a pilot with 2,000 sales people, using Internet capabilities and ofkrings. local Internet connections and the A cornerstone of the research­ Internet tunnel to authenticate users product-development-services model to the DIGITAL corporate network. is the t:-�lent and mind-set of the prod­ The solution was perfect because the uct engineering group. The advan­ tunnel supplies the encrvption capa­ tage of keeping intact the core of the bility that ensures the privacy of con­ Internet and AltaVista engineering fidential business data as it traverses groups through the entire techno­ the public network inti·astructure. logical cycle that I present here has The results ofthe pilot were stag­ enabled the engineers to react quickly gering in terms of the savings in to changing requirements and market telecommu1ucations costs and keep­ conditjons. The group has consistently

ing our internal network secure. responded ll'ith t\\'0 major product With this pilot in hand, informacion releases per year and some minor

services moved to ofter the tunneling releases needed to satisf)• J particular, service to other internal groups as a significant demand.

4 Digital Technical journal Vol. 9 No.2 1997 I Kenneth F. Alden Edward P. Wobber The AltaVista Tunnel: Using the Internet to Extend Corporate Networks

The public Internet has become a low-cost The public Internet is fast becoming a ubiquitous and connection medium for joining remote employ­ inexpensive medium t

Overview

Before rhe Intcrnet became pervasive, corporate net­ works were built ti-om leased and dial-in telephone lines. Such networks carried su bsrantial costs for both communications equipment and telephone service. Usually, security relied on the inaccessibility of rhe physical medium, and over the years, the risk of wiretap has proved to be slight when compared to password

Digital Tcd111ical Journal Vol. 9 No. 2 1997 5 cracking or other higher-level �lttack. The reason for intermediate node ("man-in-the-middle") will not this is th�ltmost telephone systems are both proprietary inrention�11ly read or modin' the data portions of tun­ and cemrally managed, and they are tbcrdore not easy neled pJckcts. To prevent such unwanted tampering, to subvert in the brge without a subst�lnti:d budget. we cryptographically secure encapsulated packers for The Internet brings opportunity J.nd challenge to passJge over the public net\\'Ork. Abstractly, data the modern corporate network designer. Glob:1l con­ passed over this secure channel appears once Jnd nectivity makes it possible to replace expensive leased onlv once at the rccei,·er as sent by the sender. lines and communications equipment with Internet furthermore, an attacker observing the public net­ connections. However, such connections LKk the work cannot read this cbta. Thus, tunnel encapsula­ physical security of telephone lines. Furthermore, tion ensures that private-network daragrams unnot direct connection to the Internet poses numerous, interact with tl1C routing algorithms oft he public net­ well-documented security problems. Conscquentlv, \\'Ork, whereas secure channels guarantee that the tun­ many organizations find it necessary to isolate their neled dat�l arri,·c intJct fl·om an authenticated source private networks behind firewalls-filtcring routers and that privacy is maintained. that place constrJints on packets Jllo\\·ed ro pass figure L depicts a secure runnel in operJtion. Nodes

between protected and public networks. The policy A and R ::ti"C tunnel endpoints, that is, packet routers decisions made in configuring firewalls alwavs im·olve that forw�1rd to �1nd from tunneled routes. Node A a difficult trade-off between securiry and functionalitv. processes datagrams in private network X and deter­ Cryptography makes it possible to emulate most of mines which, if any, should be routed to private net­ the properties of physically secure wire using Internet work Y. Node A then enc:1psulatcs all such datagrams connections. When enc1psulated at a suitable protocol and sends them securely across its tunnel connection level , cryptographically secured data cJ.n be allowed to to node B. Node B checks the integrity of each trans­ trJ\'ersc firewalls without substanrialll· weakening mission and then decapsulatcs and fclrll'ards the security policy. However, the encapsulation protocol cbtagrams ro network Y. 'The process is symmetric, must require no implicit trust in the router nodes and although this is not t1icrurcd. links that make up the f�1bric of the insecure nct\\'ork. These methods can be used to connect any sort of To solve this problem, the protocol employed by private network; however, our product is specifically the AltaVista Tunnel uses a symhesis of t\\'O \\'ell­ designed to connect TP networks Lw tunneling !P dau­ understood networking constructs: tunncling proto­ gr

NETWORK X TUNNELED NETWORK Y IP PAC KET 1 \ l \ -' \ -' -'<( <(-' ( NODE A NODE B T �--� �f--1 f--1 a: a: � lL �. lL � ) \ T 1--- INTERNET ------1 T

Figure 1 A Secure Tunnel in Operarion

Di�it�l Technical jounul Vol. 9 0:o. 2 1997 The choice bcrween UDP and TCP depends on that tunnel endpoints need not be packaged with whether datagrams or byte streams best apply to tu n­ or dependent on a specific firewall implementation. neling. Since our application is inherently connection Eventually, the emerging standards wi ll probab.ly pre­ oriented, TCP offers a natural fir, while any UDP vail fo r static tunnels; however, no standards exist fo r design must include an explicit means fo r reliable con­ transient (mobile) users and our solution remains nection maintenance. In addition, byte streams elimi­ quite viable. nate constraints caused by packet boundaries, so fr agmentation and maximum size determination pose Implementation fe w difficulties. Furthermore, byte streams enable fo rms of cryptography and data compression that As with many tunnel implementations,' we provide would be awkward to implement using datagrams. Of tunneling by tricking the operating system's routing course this flexibility docs not come without cost. TCP layer into fo rwarding packets to an emulated network adds an extra layer of reliable transmission, and per­ device. This device does not transmit packets directly, packer headers arc large . but rather it encapsulates them as data witruna higher­ The previous discussion lends no clear advantage to level protocol. The Alt:Nista Tu nnel implementation either protocol option. We chose to implement the contains three major components: the tunnel applica­ AltaVista Tu nnel using IP-in-TCP in order to simplif)' tion, the protocol handler, and the pseudo-device dri­ firewall security policy. As shown in Figure 2, a tu nnel ver. The majn fu nction of the tunnel application is to connection usually traverses at least one fi rewall. In interact with the user or system administrator and to practice, a runnel virtual connection is composed of modif)r the system routing tables to make tunneled several distinct TCP connections laid end-to-end. routes avajlable. This code also maintajns a database of 'vVhcre TCP connectjons meet, there is a bidirectional acceptable partner endpoints and matching crypto­ relay process that shuffles packets in either direction. graphic keys. The protocol handler implements the Such a relay service is included with most ti rewalls.' tunnel encapsulation protocol and all associated cryp­ We also offer an intelligent relay that participates in the tography. The pseudo-device driver is responsible tor tunnel connection protocol and therefore allows more redirecting packers tl· om the local IP stack to the fl exibility in choosing destination endpoi nts. encapsulation protocol handler and vice versa. By using TCP connections and relays, we minimize Figure 3 shows how the components of the the policy changes required to permit tunnel txaversaJ. AltaVista Tu nnel cooperate to process tunneled IP All that is necessary is to enable TCP connections packets. The diagram depicts a single-user client and a between the tunnel endpoint, which is on the private tunnel server. Although the same basic su-ucture network, and the relay, which is just outside the fire­ applies to all tunnel endpoi nt software, there are su b­ wall. (Note that relays are logically outside the firewall, stantial di fferences between single-user and server con­ al though they might be implemented on the firewall figurations, and between the UNIX and Windows machine.) Whether a generic or an intelligent relay is implementations. For example, the single-user version used, fi rewall-traversal connections always originate usually runs only while the user is actively connected. on a locally controlled network. Furthermore, TCP On the server side, the tunnel application is a daemon connection requests are infrequent, and therefore process that continuously wajts for connection requests TCP traversals are more tractable to log at the firewall and services existing connections. The fo llowing three than arc datagrams. Although the firewall industry has sections discuss the individual system components in begun to develop standards fo r IP-in-IP tunnels,' " our detail and, where appropriate, point out the djfferences choice of IP-in-TCP gives us the clear advantage between the VJJ"ious softwareconfigurations.

SINGLE-USER TUNNEL TUNNEL RELAY FIREWALL SERVER

I INTERNET I

Figure 2 Tunnel with Intelligent Relay

Digir�l T�chnical )ourn"l Vo l. 9 No. 2 1997 7 SINGLE-USER TUNNEL TUNNEL SERVER

TU NNEL APPLICATION .

• PROTOCOL PROTOCOL • I'" , IP STAC K HANDLER HANDLER r II I NETWORK u NETWORK PSEUDO-DEVICE DEVICE PSEUDO-DEVICE 1 DEVICE I

Y. • , INTERNET r • • • ... PRIVATE NETWORK

KEY:

ENCRYPTED UNENCRYPTED

Figure 3 Svstem Components and Data Flow

The Tunnel Application Routing As mentioned in the Implementation section, our tun­ The primary flm ction of rhc tunnel application is to nel works by manipulation of the system routing table. present a user interface (UI). Although each instantia­ In some environments, such as \Vindows 95, there is tion of the user interf1ce is slightly different, the flmc­ no fu lly integrated notion of packet routing (some­ tion of the application remains the S<1111 C. The times called IP t(Jrwarding). However, there is support AJtaVista Personal Tu nnel '97, a single-user configura­ fo r multiple network devices. Each network device bas tion, offers a straighttorward graphical user interbce a uniquely assigned IP address so that the IP stack can (GUl) (see Figure 4) that allows the user ro register a determine which device to use when transmitting set of target runnel servers, select from this set, and packets. The AltaVista Tunnel pseudo-device appears then establish and tear down connections. The to the operating system as just another network emphasis is on simplicity. A tunnel connection may be device. There is a one-to-one relationship between started from either a command line interface or the tunnel connections and pseudo-devices. During con­ GUI. If the GUI is used to start a tunnel, the GUI win­ nection establishment, the tunnel application activates dow can be minimized and ignored until the end of a pseudo-device and modifies the routing table to the tunnel session. The �lpplication logs all interesting include any newly reachable private network or net­ events, reflects current state through the user inter­ \vorks. The application then restores the original state fa ce, and notifies the user ofexceptional events. In this upon termination of the connection. configuration, only traffi c h-om local applications is The tunnel server is implemented in a richer routing directed over the tunnel, and no inbound tunnel con­ environ ment. Each server typically routes an entire IP nection requests arc accepted. class-C subnet\vork (254 addresses) but may support In the server configuration, rhe tunnel application is partial subnetworks or multiple networks as well. A significantly more complicated. The primary tl.mction tu nnel server can maintain multiple connections, and of the server code is to restrict ru nnel access to ;Jutho­ this is accomplished by assigning a different IP address rized clients. To achieve this, the server application is to each pseudo-device/tunnel con nection. IP pseudo­ �1 lso responsible to r issuing cryptographic credentials device addresses :tt both ends of the runnel arc assigned and maintaining an authorization database . In addi­ dytumically or statically fr om a pool of' IP addresses tion to accepting connections, a tunnel server is capa­ control led by the server. The operating system, com­ ble of initiating them. In the "workgroup" tunnel bined with a routing management program such as configuration, two servers cooperate to maintain a gated/ pert

Digit,ll Tc chnicJI )ourn,ll Vo l. 9 No. 2 1997 : := AltaVista Personal Tunnel - [ crl] 1!!1� 13 file I unnels .E.dit �iew .hog Window !ielp

User Name IT unneiM gr@digital. com .Connect

Server Key ld Jeri

T unnel Server JLR L. digital. corn Port I E;E;66

1st Firewall tunnel -firewall. digital. corn Port j6666

2nd Firewall Port j

� Log------, [0�3/04/97 Ot:: 44: 2J] ..... Looking up ho::) addres::: information - [09/04/97 Ot:: 44:2J]�; tarting tunnel to C:R L. digital.com . . [0�3/04/97 o:::: 44: 34] Connected . now authorizing ... - [0�3/04/97 44: 5] o::::: 3 .t..uthorized. loading ps:eudoadapter. .. [09/04/97 0:3: 44 35] Pseudo adapter initialized. [09/04/97 0:3: 44: 35] Tunnel up . client I P addres:::' i�< 1 E;_1 1.160. 63 �J _j

0 For Help, press F1 I Disconnected I READ:

Figure 4 The AltaVista Personal Tu nnel '97 User lnterbce

Key Management and Access Control could have obtained a similar result with symmetric key In practice, a secure channel protocol is only as strong encryption, we believe that the current design will as the techniques it employs fo r naming and key distri­ a.llmv our system to scale up gracefully through the bution. In the AltaVista Tunnel system, we must name addition of public key certification. both tunnel servers and human users. (Tunnel users When a new user is registered, the tunnel server must be authenticated by name, not by IP address, generates a new RSA.key and key file fo r that user. The since many users acquire IP addresses dynamically from user's public key is inserted into the server's key file, rl1eir Internetservice provider. ) Because no ubiquitous and conversely, the server's key is inserted into the infrastructure exists to support such a namespace, our user's key tile. To obtain enough randomness fo r key software currently assumes a flat, server-specific nam­ generation, we carefully measure tbe elapsed time (in ing structure, much in the style of PGP.'' We use RSA machine instructions) to perform each of a sequence public-key cryptography'0 to establish secure connec­ of disk seeks. These results are then hashed to provide tions. Each tunnel endpoint maintains a key file that a seed fo r a pseudorandom number generator. There is contains a sequence of names and matching public substantial evidence that the air turbulence between keys-one (name, key) pair per potential destination. hard-disk heads and platters contributes sufficientran ­ Each key file a.lso contains the password-encrypted pri­ domness for such purposes." vate key of its maintainer. The key file is signed by this Both single-user and server tunnel applications use private key to prevent tampering. Note that the com­ key files, and the credentials stored therein, as a mini­ promise of any given (nonscrvcr) key file docs not mum requirement to r successful authentication and affect the security of other endpoints. Although we authorization. Our server software places additional

Digital Te chnid Journ�l Vol. 9 No. 2 1997 9 constraints on incoming connection requests. In :� ddi­ tion to recording a new user's public key, tunnel l. A sends to B: A, B, (P,�>l, Kl> '(S,) servers maintain a small set of tunnel configuration 2. B sends to A: B, A, {P�>,), K '(S�>) parameters for each user. These parameters define the range of IP address pairs that can be assigned to the A and B compute: P, = Best ( (P,,l 1\ { P,,l) server and client pseudo-device, the set of network A and B compute: S = S, S, routing entxies that are passed from server ro client at A and B compute: K = Reduce (Pt, S) tunnel fo rmation, and the minimum level of encryp­ tion strength permitted fo r a tunnel connection. Figure 5 Tu nnel Kev Exclungc Protocol Creating or initiating J ru nnel connection can be a complex task, considering the network path the tu nnel connection might traverse. This path can include nvo intelligent relays, any number of generic TCP/IP Figure 5 describes our key exchange protocol; K'( ) relays, and a tina! tunnel endpoint. Requiring the user signifies encryption with the public component of an to remember such a path would have made the tun­ RSA key pair. Suppose tunnel nodes A and B wish to nel exceedingly difticult to operate. Therefore, the share an encryption key. Both cJn determine their AltaVista Tu nnel stores this information in an external partner's public key fr om their local key fi le . As shown contiguration file. Each new user receives both a con­ in Figure 5, node A invents a random number S.,, figuration file and a key file to initialize a newly encrypts it with node B's public key, and sends it to installed tunnel application. These tiles provide all the node B's network address. This message also includes dJta necessary to run the tunnel application-the user {P,,l, a set of proposed cryptographic algorithms and need only press the connect button. key lengths that node A considers acceptable for com­ municating with node B. Upon receipt of message 1, The Protocol Handler node B similarly constructs and sends response 2. Now nodes A and B can choose P, , a negotiated The AltaVista Tu nnel protocol handler is responsible choice of key length and algorithm, by intersecting to r establishing secure virtual connections benveen ru n­ sets { P,,) and { P,,, l and then selecting the best available nel endpoints and fo r encapsulating and transmitting option using an a priori ranking. Both parries can also redirected IP packets as data. These connections are vir­ compute S, a shared key seed, by decryption and tual in that they are composed of several distinct TCP exclusive OR. Finally, nodes A and B can produce a connections joined by relays. Upon tl1e establishment of shared key by reducing the shared seed to a kev in a each new virtual connection, the tunnel endpoints manner specific to P,. For simplicity, this prot col is engage in a dialog to :tgree on security parameters fo r executed fo r every new connection, and by defau; lt, a that connection. For our purposes, a secure connection new connection is established every 30 minutes. This must have at least the to llowi ng properties: technique gu

• Integrity-Data received over the channel cannot decrypt message 1, and only node A can decrypt mes­ be modified in transit. sage 2. As a rcsu l t, both parties em believe that K is known only to each other. An intennediJtc node can­ • Exactly-once dclivcry-E:�chdatu m is received once not control the negotiated key by intercepting mes­ and only once. sage 1 and then retransmitting a modified version to • Privacy-An attacker may not learn the contents of node B. (Note that this represents a deniJI-of-service transmitted data by observing the nenvork. attack.) Both node A and node B, ho,vever, must take We use cryptography to provide these properties. As some care in choosing their algorithm proposals. An discussed in the previous section, key fi les fo rm the intermediate node can fo rce the resultant connection long-term basis fo r trust benveen tunnel endpoints. parameters P to be the weakest proposal jointly accept­ Prior to transmitting data, the parties must perform able to both parties. This problem would be elimi­ mutual authentication and agree on a key length, a set nated if messages l and 2 vv ere cryptographically of cryptographic algorithms, and a shared encryption signed at their origin. key. It is important that keys be negotiated periodi­ Once the key exchange is complete, it is easv to see cally, since this minimizes the benefit an attacker can how to achic,·c the essential properties of secure con­

gain fr om breaking a speciftc key. In the current nections. We sign all transmitted data by appending AltaVista Tunnel , we perform the very simple key the output of a keyed hash fu nction under K, where a exch

[() Digit

Digital Technical journal Vo l. Y No. 2 1997 11 OPERATING SYSTEM

1 TCP/IP� TCP/IP+ TCP/IP STACK TRANSMIT RECEIVE FUNCTION FUNCTION

TRANSMITI TED RECEIVEDt NETWOR PAC KETS NETWORK PAC KETS � I

� I ARP EMULATOR PSEUDO-DEVICE -- DHCP EMULATO R (VERSION 1.0 ONLY) ' t TRA SMIT RECEIVE PAT HJ PAT H � I I ENCAPSULAT ION I I DECAPSULAT ION I

ENCRYPTION DECRYPTIONt I I I I tI

REAL NETWORK ADAPTER (SENDS AND RECEIVES PACKETS ON THE PHYSICAL NETWORK)

Figure 6 Control Flow in the Windows Pseudo-Device

address of the remote tunnel server pseudo-device to The network pseudo-device in AltaVista Tu nnel '97 this emulated host. The IP stack is tooled into believing is simple by comp::�rison. With help fr om Microsoft,

there are two nodes on the emulated network LA.J'\1 - our implementation is now able to emulate a Windows the tunnel client, vvhich provides the local node, and dial-up adapter rather than a LA.!'\! . DiaJ-up adapters the tunnel server as the gateway host to the r<'al net­ are treated specially b}' the Windows IP stack. No ARP work or networks on the other side ofthe tunnel. packets are directed at dial-up devices, only one emu­ When the IP stack prepares to transmit a packet, it lated address must be maintained, and intormation must know the lvlAC address of the destination or the about gateways and dynamic IP addressing can be gateway host. If the stack does not know the MAC supplied oj ier l in k establishment. This, of course, address, it transmits an ARP packet to the pseudo­ perfectly matches the runnel's operating environment. device. The pseudo-device responds only to ARP The control tlow outlined in Figure 6 correctly requests to r the gateway host J\llA C address. To resolve describes the operation of this new pseudo-device this ARP req uest, the driver includes the fu nctionality implementation; however, as noted, ARP and dynamic of an ARP server. Note that the MAC address must address emulation is no longer required.

be unique to prevent a conflict withthe MAC address of a real device. Clearly, no Ethernet device will ever Dyn amic IP Address Binding contain '' i\tlAC address of 08-00-2B-OO-OO-Ol or Each tunnel is uniquely identi tied by the IP addresses 08-00-2B-00-00-02, which are the first two Ethernet assigned ro the pseudo-device at each endpoint. The addresses that Digital Equipment Corporation ever tunnel server uses a separate pseudo-device fo r each assigned . In the AltaVista Tu nnel, the first of these active tunnel. The tu nnel server implementation could addresses always serves as the pseudo-device 1\1A C have used a single I P address and pseudo-device tor address; the second serves as the i\tlAC address of rhe multiple tunnels, because each client is unaware of any gateway host. other's existence. However, that would have required

12 Dig;ir,d Te chnic1l journal Vo l. 9 No.2 1997 additional routing complexity at the tunnel server. By machines. Without using the tunnel, the same config­ using unique address pairs, the routing tables on both uration produced throughput averaging 8.8 megabits the client and server can be maintained easily without per second. This translates to a tunnel throughput effi­ platform-specific software. This design also permits ciency of about 73 percent. In both tests, processor conventional packet filtering on the tunnel server. The usage never exceeded 50 percent. tunnel address space can be a valid, externallyvi sible or In another test, we used t\Vo 150-MHz AJphaServer hidden nenvork, thereby supplying an almost unlim­ systems running DIGITAL UNIX version 3.2 as tunnel ited number of addresses. endpoints. This tunnel was used to route between two To fa cilitate a very large number of registered users Pentium 166-MHz processors running Windows 95 fo r any given tunnel server, we implement dynamic and LapLink, a popular remote access program fo r reuse of address pairs. Since dynamic addresses arc portable PCs.17 Tunneled file transfers between these negotiated at connect time, we need to bind IP computers (with L1 pLink compression turned off) addresses to pseudo-devices after tunnel connection averaged only 15 percent slower than transfers without establishment. For the Windows platfo rm, we usc the the tunneJ . Thus, by using the UDP-based LapLink Transport Driver Interface (TDI)"' from within the protocol, we were able to achieve a substantially better pseudo-device driver to perform both dynamic tunnel throughput efficiency tlun that reported fo r address assignment and routing table modification. FTP. As before, processor usage never exceeded 50 percent. From these simple tests we conclude that tu n­ Performance and Experience nel performance is not limited by CPU speed but instead by the nenvorking environment and payload Tunneling docs add overhead to data transmission. protocol. We also believe that TCP /IP window size This overhead fal ls into nvo categories. First, the may play a role in limiting tunnel throughput. encapsulating TCP connection adds network overhead The cost of cryptography at the client is not a seri­ by introducing an extra level of framing, plus any ous performance issue. A 133-?viHzPenti um proces­ acknowledgment and retransmission traffic that is sor can compute RC4 at more than 25 megabits per required to support reliable delivery. Second, cryptog­ second and keyed hashes at nvice that speed. The cost raphy adds to the per-packet processing cost, although of server cryptography is mitigated by the fa ct that this docs not generally become significant at low most client traffic is bounded by low link speeds and speeds. More to the point, transmission of encrypted that servers typically run on fast machines. data defe ats the compression present in many modems. Digital Equipment Corporation is using the The most significantperf ormance impact is observed AltaVista Tunnel product to support its mobile work­ when using the Telnct protocol. Because this protocol fo rce and telecommuters. Previously, the company sends fe w characters per packet, the encapsulation over­ used wide-area, dial-up telephone lines at extremely head is quite l1igh. In addition, the overall network path favorable rates, but more than 30 percent of the IP benveen Telnet client and server can be long enough to traffic that used this service had a destination address make character echoing sluggish. Remember that the outside the company. This meant that the company round-trip nenvork path may pass through at least two was acting as an Internet service provider ( ISP) and tunnels, a firewall, and potentially several other routers. was doing so at long-distance rates! A short-term eval­ Thus, to properly support interactive applications, it is uation revealed that users who remotely connected to essential to choose an Internet path so as to minimize the company network fo r more than 10 hours per the round-trip latency to the destinationnetwork. month would achieve substantial cost savings by con­ The performance is considerably better to r nonin­ necting through a public ISP and using the AltaVista teractive applications such as File Transfer Protocol Tunnel. Local calls are still directly dialed to remote (FTP) or Hypertext Transfer Protocol (HTTP) where access servers (RAS) located in areas where employee packets are usually ti lled to capacity. To prevent IP density is highest. In an early pilot, the top 100 RA.S packet fr agmentation, the tunnel pseudo-device users were offered ISP accounts and access using the reduces the maximum transmission unit size by the AltaVista Tunnel. The reduction in monthly tele­ amount of the encapsulation overhead. For full pack­ phone costs was dramatic-enough to fund each of cts, the encapsulation overhead is less than 5 percent. the users' ISP accounts fo r more than a year! In addi­ We have observed a peak rate tor tunneled FTP file tion, many of these telecommuters can now use transfers as high as 6.4 megabits per second. This mea­ higher-speed options such as cable moderns or surement was performed over an unloaded switched Integrated Services Digital Network (ISDN) to con­ Ethernet between a 200-megahertz (MHz) Pentium nect to the public nenvork, yielding an overall higher­ Pro client running Windows NT version 4.0 and a speed connection into the company than using 300-MHz DIGITAL AlphaServer system running traditional directly dialed 28.8 Kbps modem access. DIGITAL UNIX version 3.2. In this case, the FTP client At the time of this writing, more than 2,000 and server programs ran on the tunnel endpoint DIGITAL employees worldwide usc the AltaVista

Digital Technical ]ourn;1l Vo l. 9 No. 2 1997 13 Tunnel in their daily work. vVe have observed that a arc subtle problems that arise ti·om routine use of single tunnel server can handle at least 125 concurrent remotely connected machines. In the single-user tun­ users, although the average number of active users is nel, we disallow the fo rwarding of packets not origi­ much smaller. In fa ct, the ratio of active to registered ru ting on the local machine, but by definition, this users rarely exceeds l to 10. Currently, DIGITAL cannot be the case fo r tunnel servers. Consider what offers its empl oyees three tunnel access points within happens if a telecommuter uses a tu nnel server on his the United States and is planning to deploy additional or her home LAr"! to access a corporate net\>mrk. The tunnels overseas. home LAN is then automatically part of the corporate network. Now suppose that a housemate similarly Security Risks connects to another private nct\vork f!· om a machine on the same home LAN. This configuration could Products like the AltaVista Tunnel are not risk fr ee. allow unintended routing bet\\'ccn t\vo private net­ The most obvious risks have to do with cryptography. \vorks! Real-time routing is not the only risk. A com­ Cryptographic security is never absolute. One can only puter that is exposed to the raw Inte rnetor to a hostile hope to keep the cost of mounting an attack high corporate network could become infected with a virus when compared to the value of a successful attack. or Trojan horse program that becomes active only Thus, any sensible application of cryptography must upon tunnel connection establishment. be well ahead ofany expected attackers in terms ofkey All these threats are real; however, the benefitsto be length and algorithm strength. Barring fundamental gained t!·om tunneling are substantial. Any policy that change in the science of cryptography, prudent engi ­ involves the deployment ofiP tunnels should carefully neering is sufficient to provide this advantage. Since counterbalance these risks �md benefits. At the very our implementation does not mandate a specific algo­ least, machines that usc tu nneling, especially if iP for­ rithm or key size, the strength of our product's cryp­ warding is enabled, should be more carcfirlly managed tography can improve over time. than those directly connected to protected nervvorks. As discussed earlier, the tunnel encapsulation proto­ col protects against many common threats such as Summary eavesdropping, impersonation, replay, and man-in­

the-middle attacks. Dcnial-ofservice attacks such as The AltaVista Tunnel was jointly prototypcd by tl ooding a tunnel server with connection requests researchers at t\vo DIGITAL laboratories, the remain a problem, however, especially since connec­ Cambridge Research Laboratory and the Systems tion request processing is compute intensive. Newer Research Center. The prototype effectively demon­ key exchange protocols, tor example, Oaldcy,1' prcbcc strated that the Internet could be used to reduce

such costly operations with exchanges of random val­ telecommuting costs, and within a year, it had grown ues that then identify subsequent messages. This tech­ into a DIGITAL prod uct.1'' Since that time, the nique defeats simple flooding attacks by allowing the AltaVista Tu nnel product bas evolved to offer support server to control the rate at which identifiers are tix a variety of client and server plattorms, as well as issued . Our product might benefitfr om this approach, improved performance and enhanced cryptography. although the benefit would come at the cost of an The AltaVista Tunnel is an effective tool for extend­ additional network round-trip. ing corporate net\vorks. By combining tunnels and Attacks on password-protected key files are a greater secure channels, it allows Internet access to supplant concern,especially since laptop computers can easily be leased telephone lines without substantial Joss of stolen. If accessed directly, many password-protected security. Our deployment of this technology within containers are subject to dictionary attack, and key files DIGITAL has cut costs dramatically without substan ­ arc no exception. Ifan attacker succeeds in compromis­ tially afkcting net\vork performance. We expect that ing a key file, the attacker can masquerade as the key secure tunneling wi ll play an important role in servic­ ti le's owner. The fa ct that users arc often careless in ing the telecommuters of the fu ture. choosing passwords exacerbates this problem. Of course, tunnel servers must be carefully protected . Acknowledgments A tunnel server not only holds valuable keying infonna­ tion but also enjoys special privileges tor firewall traver­ We owe thanks to many people f(x the success of the sal. An attacker who can compromise such a machine AltaVista Tunnel project. Virgil Champlin wrote the

can compromise an entire nct\vork. Therefore, tunnel IP-in-TCP tunneling code on which we based our servers should be handled as carcfidly as firewalls. current UNIX pseudo-device. Mitch Lichtenberg pro­ Perhaps the greatest threat posed by IP tunneling is vidcci help and guidance concerningall issues related tlut it extends tl1e perimeter ot· any firewall it traverses. to the Windows operating system. Wick Nichols con­ Because the tunnel traffi cs in encrypted IP packets, tributed to our initial design and implementation. The auditing at the firewall is dinicult. In addition, there AltaVista Tunnel group turned our research prototype

14 Di,;ital Tec hnical Journal Vo l. 9 No. 2 1997 into presentable sofrware . Our cryptography is imple­ I 2. H. Krawczyk, M. Bellare, and R. Canetti, "HMAC: mented using the BSAFE tool kit tl·om RSA Data Keyed -Hashing tor Message Authentication," Internet Security, Inc. Finally, Ed Balkovich got us to work on R.FC 2104 (February 1997). This document is avail­ this in the first place. able at ftp://d s.inrernic.net/rfc/rfC2104.m. 13. lntormation about the RC4 algorithm is available in References and Notes "Amwers to Frequently Asked Questions About To day's Cryptography, Version 3.0, Question 87" 1. \V. Cheswick and S. Bcllovin, Firewa/ls and Internet t'rom RSA Laboratories, Redwood City, CA at Sew rity (Reading, Mass.: Addison-Wesley, 1994): http:/j w\vw.rs:LCom/rsalabs/newt:1 q/q87. htm I. 79-80. 14. D. Plummer, "An Ethernet Address Resolution Pro­ 2. B. Lampson, M. Abadi, M. Burrows, and E. Wo bber, tocol," Internet RFC 826 (November 1982). This "Aurhenrication in Distributed Systems: Theory and document is available at ftp ://ds.i nrernic.net/rfc/ Practice," ACM Tra nsactions on Computer Sys tems, rfc826.txt. vol. 10, no. 4 (November 1992): 265-310. This paper 15. Network Device Interface Specification,Windo ws 95 is av

7. GateD-Gateway Routing Daemon, Merit Gate· Daemon Consortium, http://www. gated .org.

8. J. Mogu l, "Using screend to Implement IP/TCP Security Pol icies," Network Note NN-16 (Palo AJro, Ca lif.: Digital Equipment Corporation, Network Svstcms Laboratory, July 1991 ). Reissued as NSL Te chnical Note TN -2. This document is available at http:/jwww. research.digital.com/nsljpublications/ TN·2.htm l.

9. P. Zimmermann, Th e Ojjlcial PCP Us er's Gu ide (Cambridge, Mass.: MIT Press, I 995 ). This book can Kenneth F. Alden be ordered at http://ww w-mitpress.mit.edu/mitp/ Ken Alden is a consulting engineer at DIGITAL's re..:ent· books/ comp/pgp·user. h tml. Cambridge Research L< boratory. Ken's research interests include novel network appliances, image processing, and 10. R. Rivest, A. Shamir, and L. Adleman, "A Method to r digital photography. In previous work, Ken conrributed to Obtaining Digital Signatures and Public-key Cryp­ various other softwareeng ineering etlorts tor the OpenVMS rosystems," Co mmunications of the ACM, vol . 21, operating system and OpenVMS workstations. He was the no. 2 (rcbruary 1978): 120- 126. lead graphical developer fo r the VAX/TPU workstation to Jiow·on, the technical director fo r the Europea.n 11. D. D;wis, R. lhaka, and P. Fenstermacher, "Crypto­ Networking/ Workstation Roadshow, and the cu a lyst to r graphic Randomness from Air Turbulence in Disk Network Services Integration Services' Internet services Drives," Advances in Cryptology -CRYPTO 94 development. Prior to joining DIGITAL in 1982, Ken Conference Proceedings (August 1994): 114-120. received a B.S. in industrial engineering/somv:�re engi· This paper is available at http:/jworld .std.com/ nccring trom the University of Michigan in 1981. He has three patents pending, two related to nmneling technology -dtd/r:�ndom/forward .ps. and one 011

Digital Tec hnical Journal Vo l. 9 No. 2 1':>97 15 Edward P. Wo bber Edward "Ted" Wo bber is a consLLltingengineer at DIGITAL's Systems Re search Center. Ted 's research interests include computer security, distributed systems, and collabor:u ive systems, and he has published extensively in these areas. Prior to joining DIGITAL, Ted worked to r Xerox Corp­ oration in Palo Alro, California, where he designed and implemented networking protocols during the early days ofcliem-server computing. Ted received a B.A fi·om Harvard College in 1975. He holds three patents and has ten patent applications pending.

16 Digiral Te chnical Journal Vol. 9 No. 2 1997 I J. Mark Smith Sean G. Doherty Oliver J. Leahy Protecting a Private Dermot M. Ty nan Network: The AltaVista Firewall

Connecting an organization's private network The advent of electronic commerce as a means of con­ to the Internet offers many advantages but also ducting business globally has resulted in an increasing number of organizations connecting their internal exposes the organization to the threat of an private netv,rorks to the Internet. Most users of the electronic break-in. The AltaVista Firewall 97 for Internet and the World Wide Web (vVWW) view the DIGITAL UNIX protects a private network from technologies involved as leading edge, but many are malicious attack or casual infiltration by screen­ unaware that the fo undations on which these tech­ ing all internetwork communication. It enforces nologies are built are quite old. the organization's network security policy so The Transmission Control Protocol and I ntcrnet Protocol (TCP/I P) were first developedin 1979. The that only allowed network traffic can cross the primary fo cus then was to ensure reliable communica­ firewall. When installed on a dual- or multi­ tions between groups of networks co1mected by com­ homed host, the AltaVista Firewall applies the puters acting as gateways.' At that time, security was not principle "that which is not expressly permitted is an issue because the size of this Internetwas small and denied" and uses patented technology to screen most of the users knew each other. The base technolo­ each IP packet that attempts to cross it. A highly gies used to construct this network contained many insecurities, most of which continue to exist roday2.t flexible access control grammar and a compre­ Due to a number of well-reported attacks on private hensive reporting and alarm system enable the networks originating fr om the Internet, security is AltaVista Firewall to detect and react to harmful now a primary concern when an organization con­ or dangerous events. The AltaVista Firewall also nects to the Internet ' Organizations need to conduct includes an HTML-based user interface to ease their business in a secure manner and to protect their configuration and management of the firewall. data and computing resources from attack. Such needs are heightened as businesses link geographically dis­ tant parts of the organization using private networks based on TCP /IP An organization implementing a secure network must firstdevelop a network security policy that speci­ fiesthe organization's security requirements fo r their Internetconnect ion. A network security policy speci­ fies whatcon nections arc allowed bet\veen the private and external net\vorks and the actions to take in the event of a security breach. A firewall placed between the private network and the Internet enforces the security policy by controlling what connections can be established bet\veen the 1:\vo net\vorks. All network traffic must pass through the firewall, which ensures that only permitted traffic passes and is itself immune to attack and penetration. This paper comprises 1:\vo parts. The first parr pro­ vides an overview offirewal ls, describes why an organi­ zation needs a firewall, and reviews the diffe rent types of tlrewalls. The second part fo cuses on the AltaVista Firewall 97 to r the DIGITAL UNIX operating system .

DigitJI Tec hnical Journal Vol. 9 No. 2 1997 17 It discusses the product's requirements and describes control over traffic at the IP level. The majority of fire­ its architecture. It then describes the important aspects walls of this type are custom-configured routers. of the product's implementation. Finally, fi.Iture enhance­ Application-level firewalls disable packet fo rwarding ments fo r the AltaVista Firewall are discussed. and provide application gateways (also known as proxies or relays) for each protocol tlut can cross the firewall. Firewall Overview The application gateway relays traffic 1attl crosses the firewall. It can impose protocol-specific and user­ Any organization that connects to the Internet should specific controls on each connection and can record all implement an appropriate mechanism to protect the operations performed by the user of the connection . private network against intrusion from the external This type of gateway tl1erdore allows an organiZ

network and to control the traHic that passes between control (I)which individuals em establish a connection, the two networks. The mechanism used depends on (2) when a user can establish a connection, and ( 3) what

the value of the asset being protected and the impact operations a user can perform. It also keeps a record of damage or loss to the business involved. Typical of tl1e session fo r tracking and reporting purposes.

reasons for using a firewall to protect a private network Most application-level firew

within a private network. • Level of control over connections

• To control internal user access to tl1eexternal network • Level of logging and reporting to prevent the export of proprietary information. • Ease of usc • To avoid the negative public relations impact of a • Flexibility break-in. • Ease of administration and configuration • To provide a dependable and reliable connection to the Internet, so that employees do not implement • Private nenvork information made available their own insecure private connections. Operating Philosophy A firewall is a device or a collection of devices that The operating philosophy that a firewall implements is secures the connection between a private, trusted net­ the ti.mdamentaJ element of the security of the nenvork work and another network. All the traftlc benveen connection. For maximum security, firewalls must these two networks must pass through the firewall; apply the principle "tl1at which is not expressly permit­ tllis enables the firewall to control the traffic. The fire­ ted is denied ." That is, unless the firewall permits a con­ wall permits only authorized traffic to pass benveen nection, that connection is not allowed. Many packet the 1:\vo networks. The organization's nenvork secu­ filters allow any connection that is not expressly pro­ rity policy ddines what traftic is authorized. The fi re­ hibited (screend is an important exception) ." " wall is immune to attack and penetration and provides Packet filters are unsuitable for usc as a firewall a reliable and dependable connection between the 1:\vo because they require the operator to specify careti.I ily networks. It also provides a single point of presence fo r what traff-ic is allowed and what is denied. Application­ the organization when, fo r example, connecting to a level tirewal ls are specifically designed to impl em en t public ncnvork such as the Internet. this philosophy, so that each application gateway The ti.mdamental role of a firewall is to provide a con­ allows only those connections specified by its configu­ trol mechanism t(x the IP traffic benvcen nvo connected ration and denies all other connections by dd�1U it. networks. Firewalls provide nvo types of controls and arc

categorized either as packet-filtering (packet-screening) Level of Control over Connections or application-levelim plementations. Application-level firewalls allow a significantly greater Packet-filtering or packet-screening fi rewalls con­ level of control over who can establish a connection and trol whether individual packets are fo rwarded or what operations can be performed over that con­ denied based on a set of rules.' These rules specify the nection. For example, a File Transfer Protocol (fTP) action to take for packets whose header data matches application gateway, with the assistance of a user the rule criteria. Typically, a rule specifies source and authentication system, can identi�r a user who wishes to destination IP addresses and ports and the packet type establish a connection and can control what fTP opera­ (for example, TCP and User Datagram Protocol tions that user performs. For example, the gateway can [UDP]). The actions arc usually to allow or deny the permit GET operations and deny PUT operations. packet. Packet-filtering firewalls provide a basic level of

!8 Digiral Tt·rhnic;ll Journ�l Vol . 9 No. 2 IYY7 Packet filters can only support host-.level control over Application-level firewalls suffer the same disadvan­ who can establish a connection, with no restrictions on tage. The more sophisticated products on the market the individuals who can connect and what operations C

Digital Technical Journal Vol. 9 No. 2 1997 19 out sacrificing quality of security and service. Users • The performance of the firewall is an important also require protection without modifyingtheir usage concern. Application gateways must run as dae­ of the Internet. They expect the firewall to protect mons to improve performance and avoid process them without obstructing their work. A firewall must start-up delays.

be as transparent as possible to users establishing con­ • It must be possible fo r engineers in the field to nections between the private and external networks. extend the functionality of the AltaVista Firewall without requiting code changes to the product. Product Requirem ents In addition to firewall products, DIGITAL also offers a The AltaVista Firewall comprises the fo llowing three custom firewall installation service as part of its net­ major components:

work services. The AJtaVista Firewall has some of its • The graphical user inte rface ( G U I) provides the origins in technology developed fo r systems integra­ functionality for the user to configure and manage tion engineers constructing custom firewall solutions the firewall and manages the configuration files that fo r large cusromers. These engineers continue to build specif)'how the firewall will operate. custom solutions using the AltaVista Firewall to pro­ • The application gateways and support daemons vide the basic firewall technology. The AltaVista provide the nenvork security specified bythe con­ Firewall must also be designed to accommodate the figuration files. needs of integrators delivering custom firew:�lls. As described above, there are nvo types offirewall. • The kernelis hardened and modifiedto prO\'ide the Ty pically, filter-based firewalls perform better. Most extra security fi.mctionality required fo r the fire­ detailed product evaluations compare the perfor­ wall's operation. mance of the products under review, so the AltaVista Fire\vall must equal or better the performance of its Th e Graphical User Interface leading competitors. The GUI is based on a set of hypertext markup lan­ J\ll ost firewalls require that the administrator Jog on guage (HTML) template files, computer graphic inter­ to the host at the console-remote logins are not fa ce (CGI) scripts, and an HTML browser. It is enabled fo r security reasons. However, as organiza­ menu-driven and guides the user through the config­ tions centralize their nenvork management opera­ uration and management of the firewall. The design of tions, a key requirement fo r the AltaVista Firewall is to the GUI involved many trade-offs benveen si mplicity provide secure remote management functionality. of the UI and access to the firewall fu nctionality. We The AltaVista Firewall must itself be secure, and oftencho se to maintain the simplicity of the ur at the the host it is installed on must also be secure. It is expense of functionality tor nvo reasons: a product requirement that, while being installed, the 1. We must protect unskilled installers from installing product secures the operating system against attack. the firewall in an insecure manner.

AltaVista Firewall Architecture 2. Skilled instal lers must still have access to the fi.mc- tionality through the configuration files. This section describes the constraints that applied to the The CGI scripts are written in the C language and use design of the AltaVista Firewall. It then introduces the a set ofHTML templates to generate the GUI pages. major product components and summatizes the main Few static pages are used, so the firewall can be modi­ functions th ey provide. Then it lists the steps taken to fied in the field to add more application gateways, establish a connection through :�n applicationgateway. authentication methods, abrms, and alarm actions. The fo llowing constraints were applied to the Because the GUI is HTM L-b:�sed, it can be ported design ofthe AltaVista Firewall: easily to any platform that supports an HTML browser. • In every design decision, the security of the firewall must be the primary concern. The Kernel The DIGITAL UNIX kernel has been hardened to • The basic installation and use of the firewall rnust protect the firewall host and modifiedto add fi.mction­ be simple and must be guided by a simple user ality required by the firewall. The fo llowing modifica­ interface. tions were made to the kernel to improve security and • All firewall component fu nctionality must be con­ to support the operation of the firewall: figuredusing text-based configuration files so that systems integrators can install custom solutions. • Two mechanisms have been added fo r protection These files must be consistent across platforms, so against routing attacks. that the same set of configuration files can be used We added nvo mechanisms to the kernel to prevent on diffe rent platforms to configure heterogeneous attackers from using routing information to launch firewalls. an attack through the firewall. The first mechanism

20 Digital Technical Journal Vo l. 9 No. 2 1997 �1 llows the firewall to be configured to ignore • dnsd: The name sen·icc daemon provides the requests to r it to change its routing tables. These Domain Name Service (DNS) to the application requests come in the to rm of I mernet Control gateways and to the users of the firewall. Message Prorocol (ICJ\II P) redirect messages,' which • fwcond: The tirev.:all control daemon monitors the the fi rewall ignores. The second mechanism allows application gateway and support daemons that th e firewall to be configured to drop al l IP packets must run on the tircwall and restarts any that stop that have source routing options set in the packet operating. header. Source routing can fo rce a packet to take a specified route through the Internetand is not nor­ In additionto these support daemons, the to llmving mally used. IP packets with source routjng options statically linked libraries provide common services to the application gateways and the other firewall components: specified are considered to be evidence of an attack, so it is valid fo r the firewallto ignore these packets." • The access control list (ACL) library allows or • Interrace access filtering has been implemented to denies access to clients based on the appropriate prevent IP spoofing. access control Jisr. Interface access fi ltering provides a mechanism to • The firewall logging (fwlog) library provides a uni­ specif)r what IP packets are to be accepted on a form logging f(x mat fo r all firewall components. particular network interface. The source address fo r This library also provides the interface between the each packet is inspected and compared agajnst a other firewall components and the alarm daemon. filter list. The packet is passed through to the kernel The main fe atures of the application gateways, sup- or dropped, depending on the corresponding filter port daemons, and libraries are described in more list action. This provides the ability ro prevent detail below. Figure l and the fo llowing steps show attackers on the external network fr om fo rging IP how these components interact when a network con­ packets so that they appear to come from trusted nection is established through an application gateway. hosts on the private network • l. The application gateway receives the connection • Interbce trust group support has been imple­ request. mented to support packet filtering. 2. The applicationgateway sends requests to the name Trust group support provides a mechanism ro spec­ service ( ro get the host name of theclient and if)' ,, color for a network interface. When a packer dnsd) the host name or I P address of the server. arrives on a given interface, the kernel marks the packet with the color of the interface. The applica­ 3. When the application gateway gets all the infor­ tion gateways and the packet-filtering daemon can mation available to it fl·om the name service, it use this information to apply filtering rules to al low sends a request to the ACL system to ask whether or deny the packet. to allow the connection. At this stage, the gateway does not have an authenticated name for the user • Transparent proxy support has been implemented attempting to connect, so it asks the ACL system to support transparent application gateways. Kernel whether unknown users are allowed to make the support tix transparent proxying is described in the reg uested connection. section on Packet Filtering. 4. If the connection is allowed , the application gate­ way makes a connection with the server. Data can Th e Firewall now flow between the client and the server The main fu nctionality of the firewall is provided by through the firewall. The gateway also generates a the application gateways that provide connection ser­ Jog message to r the connection. vices ro users on each side of the firewall. A number of daemons also implement common services to support 5. If the connection is denied by the ACL system, it the operation of the application gatewa}'S. These dae­ is still possible that the connection is allowed fo r mons are described here: certain users, so the :.1pplication gateway prompts the user f()r a user identifier so that it can be • screend The packet-screening daemon provides packet authenticated. filtering and transparent pro>..')'ing fu nctionality. 6. The user specities an identifier to the application • alarmd: The alarm daemon provides logging and gateway. The gateway sends a request to the alarm fu nctionality to the application gateways and authentication service (authd) to authenticate the other firewall components. user and initiates an authentication sequence. • authd: The authentication service daemon pro­ 7. The authentication service responds to the appli­ vides user authentication services to the applica­ cation gateway with a challenge for the user. The tion gateways. gateway displays the challenge to the user.

Digital Te chnic:tl Journ�l Vo l. 9 lo. 2 1997 21 FIREWALL SYSTEM

AUTHENTICATION SERVICE

SERVER APPLICATION CLIENT APPLICATION GATEWAY APPLICATION

� t t � ACL SYSTEM DNS DAEMON

ALARM SYSTEM (INCLUDES LOGGING)

Figure 1 Interaction between Appliution Gateway, Support Daemons, and Libraries

8. The user uses this challenge to generate a response. filtering fu nctionality. Packet-filtering functionality is The user then enters this response. This is passed provided by the well-known packet screening dae­ by the application gateway back to the authentica­ mon, screend,''·' which is shipped with DIGITAL tion service. The authentication service uses the UNI X. The AltaVista Firewall replaces the standard response to authenticate the user and confirms or DIGITAL UNIX version of screend with a modified denies that the user is authenticated to the gateway. version that extends its packet-filtering functionality If the identifierspe cifiedis unknown, the authenti­ and provides support fo r transparent proxying. This cation service fa kes an authentication sequence and section provides an overview of how screend operates then denies authentication. When authentication is as parr of the packet routing in a dual- or multi-homed denied, the authentication service logs an event host. It then outlines whv a firewall requires transpar­ that may result in an alarm being triggered. ent proxying and describes how screend is used to 9. If the user is authenticated successful ly, the appli­ implement transparent proxying. cation gateway sends another request to the ACL Screend is a daemon that runs in user space. It exam­ system, this time with the additional information ines every IP packet that is being fo rwarded by a fire­ of the authenticated user identjfier. wall and decides whether that packet should be fo rwarded or dropped. Screend bases its decision on a 10. Ifthe com1ection is denied , the �1pplication gateway set of filter rules specified in a configuration file that logs an event, which may result in an alarm being it reads at start-up. When an I P p:1cket is received triggered. The user can try to authenticate again. by a DIGITAL UNIX host, the fi rst check that the 11. 1f the connection is allowed, the application gate­ DIGITAL UNIX kernel performs is whether or not the way logs a message specifying that the connection packet is destined fo r this host. If the packet is destined has been accepted and makes a connection with tor another host, then most hosts discard the packet. the server. The client and server may now However, a DIGITAl UNIX host can be configured exchange data. (by using iprsetup to switch on ipfc:>r warding) to route 12. When either the client or the server terminates the packers that are destined for other hosts. This is useful connection, the application gateway logs two when hosts that are not on the same physical network messages: one specifics that the connection has need to communicate. In this c:1sc, a number of terminated, and another reports statistics such as routers must exist between the communicating hosts duration and amount of data exchanged. that can pass packets from one physical network to :mother. These routers have at least two network inter­

Packet Filtering bees, as shown in Figure 2. The router can then be The AltaVista Firewal l fo r the DIGITAL UNIX operat­ configured to pass packets between the nerworks on ing system provides both applicatjon-level and packet- each of its interfaces. In Figure 2, fo r example, if the

22 Digital Te chnical Journal Vo l. 9 No. 2 1997 ROUTING packet to the appropriate gateway. Then, if the gate­ NETWORK A GATEWAY NETWORK B way allows the connection, it builds a new connection between the firewall and the server. From the user's 1-- - perspective, the firewall is transparent and has played no part in the establishment of the connection. However, the gateway processes al l the user's requests. L ' L � When a connection is made by a transparent gateway 100.0.0.1 200.0.0.1 between a client and a server, two TCP connections 100.0.0.2 I I 200.0.0.2 exist. The first is between the client machine and the firewall. The client host address and the server host

Figure 2 address are the source and destination addresses of Routing Packets between Networks packets on this connection. The second TCP connec­ tion is between the fi rewall and the server host. The firewalland server host address are the source and des­ host on network A with address 100.0.0.1 wants to tination addresses of packets on this connection. send a packet to a host on network B, the packet is The AltaVista Firewall uses screend to implement initiallv sent to the routing host. The router receives transparent prm.)'ing. To do this, changes were made the pa ket on interf:1ce 100.0.0.2. The routing soft­ to the DIGITAL UNIX kernel and to the screend appli­ ware in� the kernel then sends the packet to the host on cation. The most significantcha nge to screend was to netvvorkB. add a third option to how screend deals with a packet. When a DIGITAL UNIX machine with ipforward­ Although the standard screend implementation can ing switched on receives a packet that is not destined only accept or reject packets, the implementation of fo r itself the packet is passed to a fo rwarding module screend that is shipped with the AltaVista Firewall can in the kernel. This module decides whether the packet also proxy a packet. In this case, the packet is not passed can be fo rwarded and what network interface to send on to the IP fo rwarding module in the DIGITAL it to. When screend is running on the host, a third UNIX kernel on the firewall. Instead, the kernel sets a operation is inserted afterthe host has decided that the flag in the packet to indicate that it is being proxied, packet is not tor itself and before the packet is sent to and the packet is sent back into the IP input module. the fo rwarding module. This intermediate fu nction When the IP input module detects that the packet sends the packet to the screend process, which decides is being proxied, it treats the packet as if it were whether the host is allowed to forward the packet, addressed to itself, and passes the packet to its own based on its set of rules. If screend rejects the packet, it TCP input module. This then passes the packet on to is discarded. Ifscreend accepts the packet, it is sent to the application gateway that is listening on the appro­ the fo rwarding module as normal. priate port. The application gateway determines what The AltaVista Firewall is primarily an application the intended destination fo r the packet is and, subject gateway firewall. This means that all hosts interacting to the access control specified fo r the gateway, creates with the firewall route packets to the application gate­ a new connection to the requested server. wavs on the firewall. The application gateway then Screend imposes an overhead on packet fo rwarding. op ns a connection to the remote service if the Most of this overhead occurs because of the need fo r a requested� connection is allowed. Because the fi rewall system call from screend to deal with every packet tra­ never has to fo rward packets, IP fo rwarding could be versing the AltaVista Firewall. The designers of the switched off at the firewall, and screend need not run. AltaVista Firewall decided that this overhead would If a user attempts to connect to a server on the other seriously degrade performance, and a cache fo r screend side of a firewall, however, the user must understand was implemented in the DIGITAL UNIX kernel. This that the application gateway is present. Instead of con­ cache mai ntains the ways in which screend deals with necting directly to the server wanted, the user must most open connections, so that only new connections fi rst connect to the firewall and then request an appli­ suffer the overhead of a system call fr om screend. cation gateway to establish a connection to the remote Screend can also be used as a packet filter fo r connec­ server. This can be especially awkward with some tions to pass through the firewall at the IP level. GUI-based clients. Packet-screening routers are not as safe as application Transparent proxying was developed to overcome gateways, so the AltaVista Firewall does not use this problem. When an application gateway that sup­ screend to fi lter packets through the firewall and does ports transparent proxying is running on the fi rewall, not provide a fe ature in the user interface to configure the user can make what appears to be a direct connec­ screend. Experienced firewall administrators, however, tion to the remote server. In reality, the connection can configure screend to allow IP-Ievel connections to fr om the client machine is routed through the firewall, pass across the firewall when they need to support pro­ which intercepts the connection and redirects the tocols fo r which application gateways are not available.

Digital Technical Journal Vol. 9 No. 2 1997 23 Application Gateways This requires that each of the support daemons, As previously stated, the AltaVista Firewall is primarily such as the alarm daemon, the name service daemon, an application gateway firewall. As the name implies, and the authentication service daemon, must also application gateways operate in user space at the appli­ process requests fi·omthe application gateways without cation layer of the open system interconnection (OSI) blocking. Th is design resu lted in significant improve­ model, controlling the traffic between directly con­ ments in performance and in memory usage. nected networks. A separate gateway listens on the Performance has also been improved by caching name appropriate TCP/UDP port on the fi rewall fo r each service lookups in the name service daemon. protocol that the firewall relays. This approach pro­ Independent tests have demonstrated that the vides a high level of control over all major TCI'/IP AJtaVista Firewall 97 pertC>nns better than packet-filter services and allows extensive logging, neither of which firewalls, even at Fast Ethernetspeed s. 1'1 packet-filtering techniques can provide. For example, The AltaVista Firewall currently provides the tell­ it is possible to allow only authenticated users to copy lowing application-level gateways: ti les, using FTP, out of a private network using the • vVW'0/ (including Hypertext Transfer Protocol FTP PUT command, while also allowi ng anyone to (HTTP), gopher, fTP, and secure socket layer copy ti les from external servers into the private net­ (SSL) fo rwarding) work using the FTP GET command. This high level of control is also apparent in the log Jiles, which show the • Simple Mail Transtcr Protocol (SMTP) source and destination addresses (both in rr fo rmat • FTP

and as a fully qualifieddomain name [FQDN]), as well • Te lnet as the commands executed, names ofriles transferred, • RealAudio/ReaiVidco and the number of bytes transferred in each direction across the firewall. • SQL*Net By ddault, the AltaVista Firewall is configured with­ • Finger our :.my screend packer filter ru les that might allow net­ • Generic TCP work traffic to cross thefirewall at the I P level. Theretore, • Generic U D P traffic traveling in both clirections must be cli rected to the firewall where it will be relayed by tJ1e appropriate Web traffic comprises the majority of activity on application gateway. This highlights another significant a typicaJ firewall. The W'vVW application gateway advantage of application-level firewalls-the ability to includes specificfi. mctionality for perform network address hicling. This allows customers • Blocking Java class files to protect users to use RFC 1918 (Address Allocation to r Private lnternets)" addresses on their internalnet work, since the • URLfil tering to deny access to certain Web sites gateway hides tile rr addresses of tile hosts on tile private • Caching Web documents to improve perrcmnance network inside the firewall.

Historically, application-level firewalls pertormed Alarm System poorly, because a new process was forked to handle each The Alta Vista Firewall uses a dedicated alarm system connec6on. For FTP and Te lnet, tl1 is is not a serious to monitor the security of the firewall and to respond issue; however, witi1 vVe b traffic,a single URL request to attempts to circumvent the security of the firewall. nlJy result in numerous oti1er requests for in-line images. It can also respond to less serious anomalies such as f(x The O\'erhead involved in to rking a new process each incorrect passwords. The alarm system is implemented connection is not acceptable. The application gateways by the alarm daemon alarmd, which monitors events used in the Alta Vista Firewall have been designed to generated by application gateways or other servers on address tilis limita6on, witi1out sacrificingsecurity. the ti rewall. These arc communicated to alarmd by Each application gateway is implemented using a sin­ means of log messages. gle process to handle all connections. To process each The alarm system associates one or more alarms connection, the gateway multiplexes between open I/0 with each event. E�Kh alarm includes one or more file descriptors using ti1e select( ) system call, performing actions to take when the event occurs . The alarm that nonblocking reads and writes, and buffering all data is b·iggered when an event occurs depends on the state until it can be passed on to tl1e client or server. This non­ of the firewall, that is, the current level of security blocking fi.mctionality allows the gateway to handJe awareness of the firewall. The state of the firewall is other connections without having to wait tor ti1e represented using the colors green, ye llow, orange, client/server to process the data already sent to it, or to and red. Each color represents a different security wait t(J r the name service daemon or tJ 1e autJJelltication awareness level, ranging ti·om under no threat to service daemon to ren1rn with a response. under serious threat.

24 l)ig;it:JITe chnical Journal Vo l. 9 No. 2 1997 The fo llowing list describes each state and the level Although users logging directly into a computer of awareness associated with the state. system or an internal network run certain risks of hav­ ing their passwords discovered by another person, • Green: The firewall has not detected any events, most organizations arc happy to usc what arc termed and all appears normal. reusable passwords to protect access to computer • Yellow: The fi rewall has detected one or more resources within a private network. However, when a events that may indicate a malicious attempt to user wishes to gain access from an external network compromise the firewall or the private network. If through the firewall to an internal network, the usc of no further event occurs in the next two hours, the reusable passwords presents an unacceptable risk due firewall returnsto the green state. to the availability of IP packet snifters. A packet snifter • Orange: The firewall has detected events that arc is a tool that allows an attacker to read the data in an IP construed as malicious. Events that cannot be cre­ packet and idcntif)r a user's identifier and password ated by harmless or accidental operation cause esca­ without the user becoming aware that this has hap­ lation to this state. For example, if the DEBUG pened. It is dear that a stronger fo rm of user authenti­ command appears during a mail (SMTP) session, the cation is required for usc with a firewall that controls firewall cannot revert from tl1c orange state to the access between rwo networks. yellow state; the firewallad ministrator must explicitly Several authentication mechanisms have been set the state back to a lower level once the threat has devised that arc based on the concept of a one-time been analyzed and appropriate action tal(cn. password. In the one-time password scheme, a parti­

• Red: The firewall has detected events that may cular user has a mechanism to generate a new password result in a breach of the security of the firewall or of each time he or she is requested to respond with a pass­ the private network if not addressed immediately. word, and the authentication system has a similar vVhcn an escalation to this state occurs, the inter­ mechanism to validate the one-ti me password as cor­ face to the external interface is usually disabled rect fo r that user. Most one-time password systems are immediately, preventing any fu rther IP traftic from based on a shared secret that is used in conjunction that network. with some software or a hardware device (commonly known as a handheld authenticator [HHA]). Typically, The current tircwall state is constantly displayed in the system challenges a user, often with a value. The the background of the G I and on the console moni­ user then uses an HHA to generate a response. The tor. The alarm system can be configured manually or authentication system receives tl1e response, and deter­ using the GUI and allows the fi rewall administrator to mines if it is the one expected from the user. Further specifY the alarm to be triggered tor each event occur­ attempts to authenticate involve a different challenge ring at one or more tircwall states. Each alarm speciE­ and a correspondingly diftercnt response. cation includes one or more actions. For example, an The AltaVista Firewall supports a wide range of user alarm can advance the fi rewall state, shut down an authentication mechanismsand can be easily modifiedto application gateway or the firewall, and notify the fire­ support additional mechanisms ifne cessary. The aut11cn­ wall administrator. The administrator can write shell tication mechanism used to autl1cnticatc a user can be scripts tor additional specificactions to take. This can different depending on whether the access requested is enable the fi rewall to actively counter a threat to its inbound or outbound. This allows an organization to security or that of the private network. apply less stringent controls to accesses tTom tl1c internal network. Users arc registered with the authentication Authentication Service system tl1rough the and their details are stored in a When users log on to most computer systems, they GUI, database. This database includes the inbound and out­ spccif)r a password to prove their identity to the sys­ bound authentication mechanisms that apply to tl1c user tem. User authentication is the process by which a

Disir.ll Technical journal Vo l. 9 No. 2 1997 25 When an application gateway on the fi rewall nizations may decide to prevent external hosts ftom requires a user to supply an identifier, a sequence of viewing information about internal hosts. Restricting actions occurs as fo llows: this type of- information allows an organization to pre­ vent a would-be attacker from gaining a fo othold to l. The gateway connects to authd and specifies the launch an attempt to subvert the secutity of a ptivate net­ identifierfo r the user. work. For example, an attackermay support a claim to be 2. Authd accesses the user database and, aftercheck ing an employee of-an organization by showing knowledge that the user is registered and that the user record of the network structure, or a recruitment agency m::�y has not expired, determines what mechanism to use identif)r personnel engaged in development activities. to authenticate the user. Traditionally, hiding information about internal 3. Authd contacts the relevant authentication server hosts required providing two separate name servers. :m d passes the user identifier to it. An internal name server ran on a server behind the 4. The authentication server generates a challenge fo r firnvall and handled internal DNS queries. The fire­ the user. Authd passes this challenge to the user. wall ran the second name setTer and handled DNS queries from external hosts. In this way, the complete 5. The user generates a response, which the gateway name service database was available to internal hosts, provides to the authentication server, and valida­ and the restricted database (which usually contains tion proceeds. records fo r the firewall itself and any external WWW or 6. Ifvalidation fails, the gateway is informed and access is FTP servers) was available to external hosts. The draw­ denied. If validation succeeds, the user may proceed. back of this approach is that a second host was In the event a request is received to validate a user who required to run the internal name service. is not registered or whose user record has expired, authd If an organization is not concerned that its internal will take a user authentication sequence. This additional name service is available to the outside world, the fire­ security measure ensures that a potential attacker cannot wall can act as a name server fo r both internal and determine ifthe specifieduser identifieris valid. externalque ries. This approach raises some questions Adding a new authentication mechanism is simple, regarding how information is delivered. For example, because configuration files specify the port to contact the mai l exchange (MX) records specified for internal fo r each type of authentication mechanism, and a hosts will be incorrect fo r hosts on the external net­ common authentication protocol governs the interac­ work. If the proper hierarchy is shown fo r mail deliv­ tion between authd and the authentication servers. ery (that is, the internal host has the lowest MX The authentication protocol used between authd priority, fo llowed by the local mail hub, then the main and the authentication servers is modeled on the FTP corporate mail hub, and then the firewall itself),ext er­ protocol but is much simplitl ed. Al l user authentica­ nal hosts will attempt to connect unsuccessfully to the tion sequences involve an initial step in which the three internal hosts before eventually sending the mail client (authd) specifies the user's identifier to the to the firewall. Although this does not consume con­ authentication server, fo llowed by one or more siderable bandwidth, it does cause a noticeable delay in challenge-response cycles in which the server requests mail delivery to hosts on the private network. a reply from the client (by challenging the server The AltaVista Firewall name service is implemented fo r some information), and the client responds. The using a name service daemon ( dnsd). Dnsd acts as a server processes each response and will either issue a gateway accepting DNS queries fr om both the private further challenge to the client or issue a result indicat­ and external networks. It also accepts DNS queries ing if the user is authenticated or not. directly from application gateways and supports dae­ The authentication service supports the fo l lowing mons running on the firewall. Two name servers based authentication mechanisms: on the standard Berkeley name daemon (bind) also

run on the firewall in a protected environment. One • SccureNet Kev acts as an internal name server containing information • CRYPTOCard concerning the fi.tll ch tabase of internal hosts; the

• WatchWord key other acts as an external name server containing only selected information. Both name servers are config­ • SccuriD card ured to accept requests fl .-om dnsd only. • S/Key The name service daemon dnsd considers both the • Reusable passwords (for outbound connections origin and the type of each DNS query it receives. only) DNS queries th::�torigi nate on the private network arc treated diffe rently fl·om queries that originate on the Th e Name Service external network. Requests received by the fi rewall Unlike the traditionalmodel of the Internet in which all are forwarded to one or the other of the name dae­ DNS'' information is available to all hosts, certainorga- mons, based on the form and origin of the question.

26 Digital Tc clmical journal Vo l. 9 No. 2 1997 Ta ble l shows rhc relationship berween q ueries origi­ appl ication gatevvays to load and i nterpret rl1e rules, and I1ating on the external and internal net\vorks, as well a comprehensive user interface to allow firew

as queries by gateways and other d aemons r unni ng on istrators to exploit most of the power of rl1c language .

rhe fi rewall. An authoritative query is a query that The architecture of the ACL system is very simple . It

concerns a host wi thin rhe domain of the firewall. is i mpl emented as a static l ibrary. This static library is Queries fr om external sites, fo r domains other than linked to an application gateway or any server that con­ those fo r which the fi rewall is authori tative, are trols access. The API to the library is minimal; it con­ rejected by def1ll lt, though this behavior can be dis­ sists of a fiJI1ction to load a rule set fr om an ACL fil e and

abl ed , a l l owi ng such q ueries to be fo rwarded to the a function that takes the details of a user's request as external name server. parameters and returns a decision to deny or allow rhe The main advantage of this duai-DNS system is that action based on the rule set. The AltaVista Firewall has it removes the requirement fo r a second host ro run been designed so that each application gateway or the internalname service. It also allows precise control server that uses the ACL system has a separate rule file. over rhe dissemination of name service information to However, the architecture ofthe ACL system does not

hosts on the external net\vork. The AltaVista Firewall require this, and ACL fileshari ng is possible. GUl allows an administrator to enter a host name and In the rest of this section, we describe the require­

specify whether or not that particu lar host name is ments fo r the ACL system and how they were addressed. visible to the externalnet \vork as weU as to the i nternal • The access control system must be reliable. Since net\vorks. Previousl y, it was possib le to specif)r only access control is a core function of a firewall, the whether or nor all internal hosts were visible, but the design ream considered the reliability of the ACL sys­ AltaVista Firewal l now al l ows this decision to be made tem as primary. The ACL system was designed as a fo r each host. The MX records are correctlyg enerated separate library that is statically linked ro the applica­ so that external hosts do not see MX records fo r hosts tion gateways. Static l i nking is used to ensure that or mail hubs on the pri vate net\vork. Similarly, if the this component cannot be easily replaced by

r l transter initiated fr om rhe external network. Zone specit) times ar which access is al owed. tra nsrcrs from secondary name servers on the internal • Becwse the access control language was extremely net\vork do not need to be authorized. tlexible, it was considered rl1at policies must be fa il­ safe

Ta ble 1 Handl ing DNS Queries

Queries for which the Queries for which the firewall is authoritative firewall is not authoritative

Queries from internal network Internal name server External name server Queries from external network External name server Reject the query

Digir

The ACL grammar is a simple rule-based language request is denied. This implements the requirement that su pports the definition of a list of deny and allow that any request that is not explicitly allowed is denied. rules for application gateways. Each application reads an ACL ti le to load the rule set it uses to make access Reporting and Logging control decisions. It is a primary fi.1 nction of a fi rewall to keep an audit The core of the ACL system is the algorithm used to trail or all network traffic tor both allowed and denied allow or deny access through the gateway. Each time connections. Most authors on the subject of ti rewalls the user makes a request to an application gateway, stress the importance of maintaining comprehensive the gateway builds a data structure describing the user logs so that break-in attempts can be identified and the requested action and makes a call to the ACL quickly, and the necessary actions can be taken against system. The data passed to the ACL system includes the attacker. These actions can inc.lude contacting th( the f()llowing information: fi rewall administrator of the host the attacker is using, blackl isting the host the attacker is using, or temporar­ • FQDN The canonical fo rm of the ofthe host fr om ily shutting down some of the services that the attacker which the user is making the call. is trying to nploir.

• The IP address of the host fr om which the user is Loggi ng in the AltaVista Firewall is implcmcmcd connecting. using a common library. This librarv is statically linked to all the fi rewall components and provides a unif(m11 • The user identifier, if the user has been authenticated . logging interface and unitorm entries in the fi rewall's • The action the user is requesting. Jog ti les. Static linking is used to ensure that this com­ • The server that the user wishes to access. Note that ponent cannot be easily replaced by an agent attempt­ this could be the firewall itsel f. ing to subvert a gateway. Each firewall component

The ACL system uses the ro llowing algorid1m to decide initializes the logging function on start-up, passing it whether to grant or deny access to a user request. an acronym that the library uses to tag all subsequent entries in log files. \Vhen a firewall component cal ls the • Search through all the time-based rules. If any time logging library to write an entry to the log flies, it rule denies access to the requested operation, log speciries a flag indicating the type of the log message that a request was denied because of a ti me rule and and the log information. Log messages arc of the set a Hag to indicate that the request was denied. t()JIowing types:

• Search through all the blacklist rules. If the source • fW_LO G. The message is informational. or target host is blacklisted, log that a request was denied because of a blacklist rule and set a tlag to • rvV_WA RN. The message is a warning; there may ind icate that the request was denied. be some security implications. This level usuallv

28 Vo l. 9 No. 2 1997 indicates that some configuration information The key requirements fo r remote management of a expected by the component is not present or can­ firewall are as tollows : not be accessed . • A secure channel between the firewall and the • F\t\I_EVENT. This message is a security event. remote management client Event messages are detected by the alarm system • A consistent user interface tor both local and as described above. remote management All logged messages incl ude the following information • Support fo r multiple administrators with the ability in the log entry vvritten to the log file: to control the tasks each administrator can perform • The current date and ti me Clearly, any firewall remote management capability • The component generating the message must be completely secure, offering no possibility of compromise. If an attacker can break into a firewal l's • The log message type remote management channel, then the complete sub­ • The log message information version of rbe firewall is likely. As well as ensuring that all firewall activity is logged, For ease of management, it is Jlso important that it is also critical that the firewall manages the Jogs files the user interface that is avJilable remotely is tbe same that are generated to ensure that the log information is as the one available locally, so that all firewall monitor­ retained fo r later analysis. The Al taVista Firewall auto­ ing, control, and configuration t:1cilities are available matically archives logs ti les on a monthly basis and fr om the remote host. includes fe atures to allow fo r the retention and/or Finally, the ability to support muJrjpJe administrators deletion oflog archives afterone year. The firewall also is required when a firewall is being managed remotely, monitors log disk space usage and will autornaticallv so that the organization that operates the firewall can intonn the firewall administrator if the amount df control who has access to the firewall and what opera­ space available fa lls below a specified size. If the log tions they can perform. Typically, an organization disk is fu ll, the firewall will disable itself so that no net­ restricts the ad ministrators who can configure or work trafficcan occur that is nor logged . reconfigurethe firewallso that changes to the security The AltaVista Firewall also provides comprehensive policy the firewall implements are tjghtly controlled. traffic-reporting fa cilities, including an automated As described previously, the user interface fo r the reporting service and a GUI-based custom report gen­ AltaVista Firewall is implemented using an HTML­ erator. The firewall's reporting system allows a firewall based approach, thus enabling any platform capable of Jdministrator to choose fr om a number of report supporting a Web browser to manage the firewall. The types, including summaries or detailed information, key requirement is to secure the connection between on individual gateways or on all firewall services. the tirnvall and the remote host the GUI traffic travels Reports can be generated tor daily, weekly, and over, because al l management actions are performed monthly periods. These reports can be mailed auto­ by the GUl Web server running on the firewall. The matically to a specifieddi stri bution list. need to authenticate both the fi rewall and the remote host is critical to this. Although SSL-based solutions

Remote Management can easily deliver firewall authentication, it is less easy Firewall manage ment has rapidly become a key issue to deliver remote host authentication, without provid­ to r organizations implementing connections to the ing an appropriate key generation mechanism fo r Internet, or fo r organizations required to protect sev­ potential clients within the firewall prod uct. A solution eral parts of their internal private network. Typically, is req uired that provides an encrypted channel fo r the fi rewall products allowed operator Jccess only at the GUI traffic and includes a mechanism that allows fo r firewall system's console to ensure the security of the authentication of both parties. installation. Most organizations today prefer to cen­ The Al taVista Tunnel product was selected to pro­ tralize their network control operations and expertise. vide a secure channel fo r remote management fo r two Firewall products that require console-only access are reasons: ( 1) It delivers a mechanism by which traffic often not considered. Also, many organizations place between the host running the AltaVista Firewall and a their fi rewalls at separate locations where console remote host is secured through encryption, and (2) Tt provides an easy-to-use mechanism tor authentication access fo r monitoring and control may be restricted of the two parties. For remote management, a tunnel to r various reasons. The AltaVista Firewall's remote is established between the remote host and the fire­ management provides a mechanism by which a central wal l. This tunnel provides the secure channel through network management body can control every firewall which the firewall can be managed fr om a remote within an organization. host. The firewall acts as a tu nnel server (running the

Digital Te chnical journal Vol. 9 No. 2 1997 29 AltaVista Workgroup Edition Ttmnel software), while possible that separate administrators can log in locally the remote host may be ei ther a tunnel server or a run­ and remotely. Ad ministrators arc granted privileges to nel cl ient (running the AltaVista Personal Edition monitor, control, and configure the firewall for each Tu nnel software ). In this way, an ad ministrator can GUI subsystem. One administrator may be able only manage a firewall fr om any plattorm that supports the to monitor the firewall status, while another adminis­ AltaVista Tu nnel . A modified version of the AltaVista trator may have the necessary privileges to configure Wo rkgroup Edition Tu nnel software is shipped with the security policies for application g:neways or to the Alta Vista Firewall, as \vell as AltaVista Personal manage the user authentication system. In addition, Edition Tunnel sofuvare tor Windows 95 and GUI privileges arc allocated separately t(1 r local and Windows NT platforms. The firC\vall tunnel sofuvare remote access, providing fu rther fl exibility to r admin­ is modified to restrict the number of concurrent tu n­ istrator privi lege control. nels that can be started to one (for licensing reasons) and to operate the tu n nel server on a nonstandard Future Enhancements port. This avoids clashes with organizations that arc relaying tu nnels by means of the firewal l. Large organizations that have JocaJ private networks in A ·we b browser running on the firewall connects to several geograph ies currently link these networks using the GUI Web server using the local host's address. In expensive private con nections. The growing avai lability contrast, a Web browser running on the remote host of inexpensive, high -speed Inte rnet connections and connects to the GUI Web server using the IP address of the development of secure IP tu nneling sofuvareprod­ the tu nnel endpoint. The endpoint is the pseudo-II' ucts, such as the AltaVista Tunnel, is prompting many address allocated to the ru nnel on the firewall. The GUI of these organizations to construct virtual private net­ Web server on the firewall is configured to allow only works (VPNs). Since each Internetconnection must be connections fr om the local host (the firewall) and fr om secure, the next logical step is to integrate the IP tun­ the tunnel endpoint (that is, the pscudo-IP address) of neling upability into tl1C AltaVista Firewall. the remote host. The \Ve b server automatically manages Larger orga nizations are moving away ti·om tl1e idea the GUI universal resource locators (URLs)generated ofhav ing one tirewall as a single choke-point connec­ to r each user inrer�ace component, depending on tion to the Intemet. Instead, multiple firewaJ is may be whetJ1er the connection originates locally or through a dispersed at several locations throughout a private net­ remote management channel. Because the AltaVista work, perhaps on diffe rent continents. It wi ll therefore Tunnel product adds a route on the host at one end of a be necessary to ensure that the network security of tunnel to the pseudo-IP address of the tunnel endpoint each fi rewall remains synchronized with the others. on another host, traffic between the browser and GUI The ability to provide secure enterprise management Web server is automatically routed through the tunnel of several ti rewalls ti- om a single location is a major

30 Di�;itol Tec hnical found Vol. 9 No. 2 1997 sumed at an alarming rate. !Pv6 provides a much References larger address space. l. W. Stevens, TCP/IP Illustrated. lloluml! 1: Tb e Proto­ • Security. 1Pv6 addresses authentication, integrity, and confidentiality issues. Because IPv6 corrects cols (Reading, Mass.: Addison Wcslev, ISBN 0-201- many of the threats and vulnerabilities associated 63346-9, 1994). with IPv4, the architecture and security policy of an 2. S. Be:Uovin, "SecUJitv Problems in the TCP/IP Protocol IPv6 firewall will significantly diffe r from those of Suite," Co mputer Co mmunicutiun Review vol . 19, an !Pv4 firewall. The integration of IPv6 support no. 2 (1989) 32-48. into the AJtaVista Firewall will require researching 3. R. Morris, A We akness in the 4.2 8Sf) !!nix TCPIIP any outstanding securitythr eats, as well as address­ Sojhmre (Murrav Hill, N.J.: AT&T Bell Laboratories, ing the practical issues of performance slowed by February 1985). cryptography. 4. CER T Coordination Center 1996 A111zua/ Report (http:/jwww .cert.org/ cert. re port.96. h tml ). Summary 5. ]. Mogul, R. Rashid, and M. Accetta, "The Packet The emergence of new technologies and the growth Filter: An Efficient Mechanism to r User Level Net­ of electronic commerce on the Internet means that work Code," Proceedings o,lthe 1 lth SFmpusiwn on network security will continue to increase in impor­ Ope rating Sy stems Principles. AGM SYSOPS, tance. The AltaVista Firewall addresses customer Austin, Texas (November 1987). requirements t(x securing their Internetcon nections 6. ]. Mogul, "Simple and Flexible Datagram Access Con­ by providing a powerful and fl exible fi rewal l product trols fo r UNIX-based Gateways" (Digital Equipment that includes

11. P. Albitz and C. Liu, DNS and LJIND. 2nd Edition We would like to acknowledge the fOllowing people (Sebastopol, Calif.: O'Reillv & Associates, Inc., ISBN t() [ their contributions to the design and development 1-56592-236-0, December 1996 ). of this product: Frank O'Dwyer, for his contributions to the access control system; Chris Evans and Eugene 12. IP M u l ticast Initiative (http:/jwww .ipmulticast.com). O'Connor tor their extensive user interface and docu­ 13. S. Deeri ng and R. Hinden, "InternetProtocol Ve rs1on 6 mentation work; Sarah Keating, whose testing con­ (!Pv6) Specification," RfC 1883 (December 1995). tributed to the security and robustness of the product; Tim Shine, engineering manager, who ensured the timely delivery of the product; Brendan Noonan, General References product manager, whose market knowledge was invaluable; Jeff Mogul, t(x his advice on packet fi lter­ \V. Cheswick and S. Bellovin, Firewalls and ln.ternel Security, Repelling the Wily Hacker ( Readi ng Mass.: ing; Dan Walsh and Ken Alden, fo r their advice and Addison Wesley, ISBN 0-201 -63357-4, 1994 ). assistance with the AltaVista Tunnel product; Kevin Carey and Tony Pitt, tor their help with beta testing; D. Chapman and E. Zwicky, Building bztemr!l Firewalls and the members of the AltaVista Internet Security (Sebastopol , Calif.: O'Reilly & Associates, Inc., ISBN Product Group. 1-56592-124-0, 1995).

Digital Technical Journal Vo l. 9 No. 2 1997 31 Biographies

Oliver J. Leahy A principal engineer in the AltaVista Security Products Group, Oliver Leahy has worked in security engineering fo r several years. He led the POLYCENTERSecurity Compliance Manager (PSCM) fo r NetvVare project and ]. Mark Smith Mark Smith joined DIGITAL in 1982 and is a principal worked on the development ofPSCM fo r OpenViY!S. Prior engineer in the AltaVista Internet Security Product to this, he worked in the Publishing Technology Group Group. In his current position, he has contributed to and in the DIGITAL Semiconductor Acquisition Group, the design of the AltaVista firewall product since its developing quality control and semiconductor test soft­ inception. In previous proJects, Mark worked on the ware. He holds a B.Sc. ( 1982) fi·om Trinity College Dublin development of DIGITA L's Firewall Service and on and an M.Eng. ( 1988) from the University of Limerick. the POLYCENTER Security Intrusion Detector pro­ duct. Mark was also project leader fo r a number of publishing products, including DECwrite version 3.0, and was chair of the Technical Committee fo r the ODA Consortium. In addition, J'vlarkhas represented DIGITAL within several publishing-rebted standards groups. Mark received a B.E. in electronic engineering ( 1982) from University College Dublin, Ireland and an M.Sc. in computer systems design ( 1990) fr om the University of Dublin, Ireland. He has lectured in com­ puting at University College Galway, Ireland and holds one patent.

Dermot M. Ty nan A principal engineer in the AltaVistJ InternetSecuritY Product Group, Dermot Tvnan is the architect of the AltaVista Firewall product. Prior to joining DIGITAL, he worked as a UNIX kernel engineer fcJr Altos Computer Systems, Unisys, and ICL. He designed sofuvJ.re and hardware components tor a real-time FFT system fcJr Ultrasystems Space and Defense, Inc. and developed inter­ process communication systems f(x British Telecoms' Work Management System. Dermot is an active member of the Internet Engineering Task Force (IETf ), the ISO Motion Picture Experts Group (MPEG ), and the InternetSociet y (ISOC). He is currently working on three Imernetdr aft Sean G. Doherty standards and is coinventor on a firewall patent. Sean Doheny joined DIGITAL in 1993 and is a senior somvare engineer in the AltaVista InternetSecurity Product Group where he has designed and developed components of' the AltaVista Firewall since its inception. Previously, Sean contributed to the design and development of the POLYCENTERSecurity Compliance Manager tor NetWare product. Prior to JOining DIGITAL, Sean was responsible fo r the development of system management somvare fo r Third Wave Ltd. Scan received a B.Sc. in computer applications from Dublin City University, Ireland (I992 ).

32 Digital Te chnical journal Vo l. 9 No. 2 1997 I Nick Shipman

Developing Internet Software: AltaVista Mail

The emergence of the Internet as a place where In late 1993, the Mail Interchange Group (MIG) people can conduct business prompted DIGITAL within DIGITAL started the AltaVista Mail develop­ to investigate the development of products ment program. At that ti me, the members of MIG bad substanti al experience in the development of elec­ specifically for use in this environment. Electronic tronic mail (e-mail) technologies; however, the new messaging systems based on Internet technolo­ products were being targeted �o r use on the Internet gies provide the communication medium for in an environment that was quite diffe rent fr om the many businesses today. The development of one fo r their previous prod ucts. In an effo rt to satisf)r AltaVista Mail illustrates many of the concerns the new customer base, the members of MIG reexam­ facing engineers who are designing products ined their design and development process. The Alta Vista Mail product emerged from efforts to for this new customer base. The results of our improve MIG's support fo r Internet-based e-mail experience can be helpful in many ways and technologies. Our previous products were electronic should be of interest to those involved in mail and directory servers fo r net:\vork backbone use, designing technologies for running Internet based on the most recent X.400 and X.500 standards. applications. Products suitable to r the Internet environment would clearly be quite diffe rent. This paper begins by presenting our analysis of Internet services and support sofrware and describing the transmission of e-mail on the Internet.The paper then discusses the implications of developing a prod­ uct fo r the Internet environment and explains the impact of those implications on the design and imple­ mentation decisions that defined rhe AltaVista Mail product. The paper concludes with the engineering assumptions and habits that had to be overturned to build the product set.

Internet Services and Software

During our initial analysis of the product possibili­ ties, we made several interesting observations about Internetservices and software,part icularly in compari­ son to the mission-critical products in MIG's existing portfolio. (Our observations could more accurately be called assertions-it was and is remarkably difficult to get hard information about Internetus e.)

l. The academic/research/technical community determined the nature of the Internet's service offerings. Most of the software defining the Internet's services was generated by and �o r this community.

Digit� I Technical Journal Vo l. 9 No. 2 1997 33 2. The real market opportunity would not be among a gateway to Lotus cc :Mail post offices; and a gateway the academic/research/technical community but to Microsoft Mail post offices. would be drawn fr om ordinary businesses. The SMTP/POP3 server is the mail system compo­ 3. Much of th e service software on tJ1e Internet was nent responsible to r accepting messages from mail unsatisf..1 ctory fo r routinebusi ness usc, either because client programs. It transmits them toward the recipi­ SMTP it was unreliable or because it was difficult fo r end ents' servers and performs local delivery using users to deaJ with it. Even tJ1 ough !Teesoftware was the POP3 protocol. This process is described in more abundant, much of it did not work. Support to r tJ1e detail in the next section. !Tee software was a risk: some software had excellent peer support; Lm fortunately, not aJI end users were E-mail on the Internet aware of its existence or were able to access it. This section briefly describes how Internet e-mail is 4. We judged the operating system platforms com­ tr;mskrred fr om the originator to the recipient. monly used fo r service softwareto be unsuitable to r The originator's mail client program constructs a a large part of the business community. The various message according to the rules described in the UNIX platforms need skilled local stan; corrupted Internet standard , R.FC 822.1 The R.FC 822 standard or poorly configured Windows version 3.1 and defines a message as a sequence of short lines of7-bit Macintosh run-time environments would be diffi­ ASCII text, each terminated by a carriage-return and cult to diagnose and expensive to support. line-ked sequence (CRJ�F). The first lines are header Expanding into support of the Internet environ- fields; these are extensible but typical ly include the ment would require us to build native equivalents fo r originator and recipienr e-mail addresses, the date, and some of our existing server software. We believed that the message subject. The header ends with a blank the Windows NT platform offered a good fra mework line, and the remaining lines constitu te the body of the fo r systems that would work well in any business envi­ message. Figure 1 shows an SMTP dialogue that ronment and be easy to support. includes an .RFC 822 message. lni tiaJJy, we req uired the fo iJowing products: a server Where appropriate, the mail program can also to iJow that supported the Simple Mail Transfer Protocol the Multipurpose Internet MaiJ Extensions (MIME) (SMTP) and tJ1cPost OfficeProtocol version 3 (POP3); rules in R.FC 1521 and R.FC l522L'; these describe

SMTP command/response Comments Caller opens connection 220 serve r1. altavista.co.uk AltaVista Ma il Server's welcome message V1 . 011.0 BL22 SMTP ready helo client1 .altavi sta .co.uk Cliem gives irs own host name 250 OK Host name was acceptable maiL fro m: Identifiesreturn pathfo r nondelivcrv reports 250 OK Re turn pad1 was acceptable rcpt to: A recipient [( )r this message 250 OK Recipient was acceptable data Message to llows 354 Star t ma il inpu t; end with . OK to start message header Da te: Mon, 7 Jul 1997 08 :30:13 +0100 Message's date From : Fred Originator field To : Bi ll Recipient field Subject: Examp le message Subject field Blank line ends message-header fi elds Hi Bi ll, Conrem lines ..

This is a test message . It's not very Long .

Fred ... End of content 250 OK Message has been accepted quit No more tnessagcs; signing off 22 1 reds v r.altavista.co.uk closing connect ion finished

Figure 1 Example SMTP Dialogue

34 Digital Technical journal Vo l. 9 No. 2 1997 how to construct a message body to transfer typed and value. Eventually, the mail must be delivered to the structured data and how to pass non-ASCII characters most preferred host. in header fields. Figure 2 shows an example of a mes­ For each target domain, the SMTP server uses the sage constructed according to the MIME standard. SMTP protocol to transfer the message to the The originator's client submits the message to a domain's most preferred, reachable M.,'{ host. (The nearby SMTP server using the SMTP protocol:' This most preferred host may be unreachable from the local very simple protocol uses short, CRLF-terminated server: it may be switched off for a while, or it may be lines of7 -bit ASCII text to transfer its commands and behind a firewall, a machine that protects a private net­ responses. To submit a message, three commands are work by limitjng access fr om the open Internet to the used: the MAIL, RCPT, and DATA commands intro­ machines inside the protected network.) The chosen duce the originator's e-mail address, the recipients' host, if it is not the most preferred, fo rwards the e-rn;lil addresses, and the RFC 822 message data, message to a more preferred host andso on, until the respectively. message reaches the recipient domain's most preferred The SMTP server examines each recipient's e-mail host. That host then delivers the message to an area address to decide where the message should be sent. from which the recipient's mail client program can Recipients

MIME data Commems

Da te: Man, 7 Jul 1997 08 :30:13 +0100 Normal RFC 822 header fields From : Fred To : Bi ll Subject : Bi na ry attachment MIME-ve rsion: 1.0 This is a MIME message .. Content-type : mu ltipa rt/mixed; boundary="zzzB ounda ryzzz" ... consisting of a list of body parts Blank line ends message-header fields Start .. --zzzBounda ryzzz ... of first body part .. Content-type : text/plain; charset="us-asci i" ... in ASCII plain text ... Content-Transfer-Encoding: ?bi t ... using 7 -bit encoding Blank line ends bod)'-part-header fields Hi Bi ll, Bodv-part contents

Here 's a binary fi le. It's four bytes of all 1 's.

Fred End ... --zzzBoundaryzzz ... of first bodypart and start of second ... Content-type : application/octet-stream; ... a stream of bytes called too.dat. name="foo .dat" Content-Transfer-Encoding: base64 ... using base64 encoding Blank line ends body-part-header fields /1/!!w== Hex FFFFFFFF encoded in base64 End .. --zzzBoundaryzzz-- ... of message

Figure 2 Example MIME lv!essage

Digital Te chnical journal Vo l. 9 No. 2 1997 35 4 (IMAP4 ), do offer certain user advantages; however, Our extstmg approach to development involved they do not perform the basic job of delivering mes­ obtaining agreement fr om many groups within sages any better than POP3.'' Even though support fo r DIGITAL concerning the nature of a problem area IMAP4 vv as added to AltaVista Mail version 2.0, POP3 and the architecture of any solutions, and then imple­ remains the method of choice fo r retch ing messages menting product versions ag:�inst the architecture. fr om a remote server: this protocol is so si mple that it This process is slow and expensive with considerable is hard to implement incorrectly. management overhead. For the Al taVista Mail product, we decided insteJd Product Design Decisions to direct a small team to ge nerate a product-quality prototype as quickly as possible and to ship that proto­ The definition of the AltaVista Mail prod uct set did type as a product. In the interests of rapid develop­ not start with technical issues. Instead, it started with ment, we would deliberately discard much of tbe an assumption about the purchase price of a product. traditional Phase Review Process but would use regu­ Even though the price we chose was not used f(>r the lar, informal monitoring to ensure that the prototype released product, our assumption turned out to be, remained acceptable to our target customer. perhaps, the most useful and powerful design tool Al l design and implementation decisions would available during development. be judged by their efte ct on this target customer, not by their ad herence to an architecture. All fu ture Product Pricing and Organiza tional Concerns development would be guided by customer te ed­ We were interested in exploring the implications of back. This method is fa r less expensive, delivers a offering a product at a very low price and selling in product fa r sooner, and is more JikeJy to reflect cur­ very large quantities to make the business worthwhile. re nt customer needs. MIG's previous products had been priced at the oppo­ site end of the price scale: they were expensive, but Te chnological Concerns thev were valuable to the relatively rew customers who Our ideas on product pricing and the design process needed their fu nctions. led to three initial design decisions. (Interestingly, as we explored the necessary organi­ First, nonexpert users must be able to get the fu ll zational and technical changes involved in moving to value fr om the product. Setting up and configuring the low end, we realized that they would in no way the product must involve answering the minimum jeopardize our ability to sell at the high end. The orga­ number of quest.ions. Each question must relate to a nizational changes improve efficiency no matter what topic on which the user can re:�sonably be expected to the product price, and the technical changes make for have an opinion. The user must not be asked questions a better, more usable product regardless of the cus­ about the internal operation of the product, only tomer profile.) about topics with an external significance. Our starring point was to investigate tl1e engineer­ The product must offer the minimum number of ing implications of building a product that would sell operational controls. (Some high-end customers fo r $100. We concluded the fo llowing: demand many controls. If necessary, these controls could be added in a later version; but the product must • To maximize the number of produ cts sold, we not depend on them , they should not be presented to would have to satiS�' the largest imaginable cus­ the average user, and those users who insist on seeing tomer base and not exclude a potential customer them should be charged a premium to cover the addi­ fo r any reason. tional support costs. We wo uld explicitly accept that • Customers attracted to a low purchase price wil l there are certain customers we should not aim to sat­ also require low running costs. No hidden costs is�' and certa.in features we should never offer. ) could be Jssociated with ru nning the product. Second, the product must never go wrong. The • Support costs would have to be kept to a minimum. prod uct must never encounter any internal errors, Ifeach customer needed telephone support several only those caused by fa ilures in its operational en vi­ times over the life of the prod uct, the $100 price ronrnent. Any environmental fa ilure must be would not cover support expenses. We wou ld hJve reported completely and accurately in terms that the to ai m at receiving zero support calls. user can understand. After a fa ilure has been fixed , the

• Vve would have to minimize the implementation product must start working again with no fu rther and maintenance cost and deliver products and intervention. If an environmental fa ilure or an opera­ updates as early as possible. The Internet market tor intervention corru pted the software, reinstalling moves quickly, particularly at the low end, and a the kit must get the system working again. The prod ­ long development cycle loses sales. uct must not depend on any product that docs not fo llow these rules.

36 Digiral Tcch niCl! Journal Vo l. 9 No. 2 !997 Third, the product must be Inexpensive to build the mm1mum and maximum retry intervals fo r and maintain and must use a rapid development cycle. SMTP connection attempts, not the specific ti mes The product-and each of its components-should at which the server tries to communicate. deliver the maximum customer-perceived value fo r the • The server determines if an event is relevant to sys­ minimum engineering investment. Although the tem security and responds according to its own rules. number of fe atures should be minimized, the fu nc­ Re peated authentication f3.i.l ures result in mailboxes tions delivered should be sufficient to be useful to a and originating host addresses being locked out to r a large customer base . time; the administrator can manually reset the lock­ out but cannot control how long it lasts, nor how Implementation Decisions many fai lures are judged to be an attack.

The most important decision was to aim fo r simplicity Unfortunately, user interfaces cannot be avoi ded above all else: simplicity of design , implementation, entirely. Therefore the goal must be to minimize the and presentation. Simplicity delivers reliability and amount of user interaction required witl1 tl1e server and inexpensive implementation and maintenance. It also to make user interaction easy to perform and as reassur­ helps to ensure that a prod uct is comprehensible to its ing as possible to the user. We were able to design the end user and does not behave in baffling ways, even SMTP/POP3 server to be easy to set up and use, and when it is not working due to an external infl uence. we reduced the user interaction with the gateways A related decision was to use no method or tool that to the minimum. Gateways are notoriously difficultto migbt encourage complexity by helping to manage it: set up; we simplified tl1e process of installation to the no fo rmal design methods, no automated design veri ­ extent that a central cc:Mail or MSmail administrator fication, and no automated system test. The developer can talk an inexperienced user th rough the installation. should immediately fe el the pain of building a complex The SMTP/POP3 server requires an administrator structure or one that requires an elaborate system test to perform only the fo llowing to ur actions:

and th us be encouraged to think again. • Load the software to a chosen volume Of course, proper engineering practice dictated that • Te ll the server about the local network configuration we should use a repeatable system test with properties • that were we ll understood, but we deliberately never Set up mail accounts (mailboxes) fo r tl1e local users automated the test. We also used remedial tools such • Check that the server is not experiencing problems as locally developed libraries to look fo r memory and The AltaVista Mail product implements these tasks handle leaks. (These tools do not tempt developers through three user interfaces: the setup (instal lation) into bad habits.) procedure, a Windows-based ad ministration graphical user interface (GUI), and Wo rld Wide Web fo rms. Us er Interaction with the Server Ideally, the server should perform its job with no com­ Setup Procedure. Apart ti·om loading the software, ment, and the user should fe el no need to think about the setup procedure has several other fu nctions. lf the server's performance. Product documentation the software has been corrupted, the setup proce­ should be minimized , and there should be no printed dure repairs the service by resetting the server's documentation at all. entire environment to a known state. The server's We designed the server to make many of the prod ­ account privileges are reset, a new password is gener­ uct's operational decisions, rather than leave the deci­ ated, and the database directories and ti les have their sions to an administrator: fi le protection reset. • The administrator cannot control the ro uting Setup continues by asking the minimum number of process. Messages are sent to the targets defined in questions required to ailow the server to work. The DNS, and no local rules nominating other targets administrator responds by naming any firewall used , or are supported . The administrator can nominate a if in dial-up mode, naming the remote mail server. fi rewall but cannot say when to use it (the server Final ly, it runs the server's self-test, which is described uses it automatically, when all the other targets have in more detail later in this paper. It is important that been to und unreachable). self-test run during setup. If the system is not working,

• The operational logs are purged automatically. The the administrator needs to be told the precise causes of administrator can only control how long any the problem immediately, before he or she can become Jogged event is guaranteed to be stored before confused by subsequent symptoms of system fa ilure. being purged, and this interval cannot be selected on a per-log or per-event basis. Windows-based Ad ministration GUI. The Windows­ based administration GUI controls the Alta Vista Mail • The server's network connections are scheduled by server thro ugh a simple, text-based administration the server itself The administrator can on ly control

Digiral Te chnical Journal Vo l 9 No. 2 1997 37 protocol over a Transmission Control Protocol/ marion on all the visible controls, and assistance in Internet Protocol (TCP/I P) link. Therefore the GUI troubleshooting. Apart from the equivalent help text can control servers elsewhere in the network. in the vVorld Wide Web fo rms, this is the only product The GUI has two modes. The default mode is a sim­ documentation. ple menu with links to functions that invoke the most The self-test is one of the most important adminis­ common administrative tasks: network configuration; tration fu nctions fo r sites at which mail is a mission­ adding users and mailing lists; red irecting mail and critical service. The self-test checks that all aspects of changing passwords; and self-test. This mode, shown tbe local machine's environment that are necessary to r in Figure 3, is a nonthreatening interface fo r inexperi­ the server to operate do indeed work; it also checks enced administrators; they do not need any other that the server is responding correctly on all the net­ information to use the server successfully. This means work ports it serves. In a redundant environ ment, the that locJI users with no specialized knowledge can set self-rest checks every element to make sure partial fJ..il­ up sma.l l and undemanding sites. ures are reported. For example, a host generally knows Administrators of large or busy sites need to usc the of t\vo or more DNS servers, only one of which needs advanced mode, shown in Figure 4. This interface is to be working fo r the mail server to run. Because the similar to the style of the Windows Explorer GUI and mai l server will not see a problem until the last DNS gives access to every server and mailbox control, the server dies, the self-test must report any partial failure. messages in the server, and its operational logs. The self-test is vital : a regu l ar check is tl1e only way to Both modes include an on-line help file thatgives be sure rh:n a background server is working. A server a brief introduction to the system, reference infor- cannot be guaranteed to inform an administrator of

�AltaVista Mail administration tasks R.Ef

Pick an administration task for mrmen.altavista. co.uk

�··························································································�

�et up the AltaVista Mail server :...... ;

Create a .!!!ailbox for a user

Change a mailbox Qassword

fiedirect a mailbox to another user

Create a mailing Jist

Iest the AltaVista Mail server

.E,xplore the AltaVista Mail server

Figure 3 Wi ndows-based Administration GUI, Simple Menu

38 Digiral Technical journal Vo l. 9 No. 2 1997 @. AltaVista Mail - Exploring mrmen.allavista.co.uk I!I�EJ � erver Idil �iew I ools !::!elp

-� Server · mrmen.allavisla.co.uk Conligu alion Status Oienl·specific a I I I l+ .!.1 D mains Clienl lype M ailbOlC isa jinal destination p El lit Logs tit AdminislralionOperalions !undefined :::1 MailbOlC J!assword.. tit EIIOIS Usernames delivered lo this mailbox I lit I mapO per alions Mailbox memJlers... Username Redirecl lhis usern�� lit M essageO peralions I Administrator tit Pop30 peralions Postmaster Adminislr at ion privileges Security tit fOOl P A9minislrate lit SrnlpOperalions SYSTEM r MQOitor - £9 Mailboxes r Read a!l mail B Oeadlellers 8dd.. Modify ... B.emove r Y{rile to logs r9 atetG />Jays I I I t9 Mailing lists R§direcl remaining usernames lo " liffltmM! IN ick. S [email protected] + B Tests AfJer redirection or lisl expansion return reports lo + i9 Users I

Message idenl�ier I Last modified lime I

II

Figure 4 Windows-based Ad ministration GUI, Advanced Mode

problems because the problems may affect the notifi­ Several details help make the three user interfaces cation path used. The AltaVista Mail self-test aJiows an approachable and nonthreatening. administrator to perform the necessary check with a When the software req uires the user to answer a single "button click." The self-test also runs as part of series of questions, it presents a dialogue box chajn a regular cleanup procedure. Errors are reported to (sometjmes known as a "wizard''). Used properly, this the server's error log, so a less active administrator can technique allows the user to concentrate on one thing monitor the system by reading this log every fe w days. at a time, with all rustracting material rudden. Every question in a diaJogue box cbajn gives an World Wide Web Forms. The World Wide Web explanation of what information is needed, any suitable (W'NW) fo rms interface is provided by a built-in defaults or examples, a suggestion of whom to contact Hypertext Transfer Protocol (HTTP) server. This to findth e answer, and a safe way to abort the process. interface ofte rs the same fa cilities as the Windows If the user knows the answer, he or she will be able to administration GUI. Because a Windows Explorer­ recognize it in the example. Users who do not know style GUI is more difficult to present using a Web the ans>ver wiiJ not be intimidated by the wording. browser, we implemented the "power user" options The logic works in terms the user understands, not on the top- level task menu of the WWVVfo rm. These in terms of the software'soperati on . The gateways, fo r options make the initial interface somewhat intimidat­ example, contain a question that the software does not ing, because they include controls whose functjon may use other than to make the text of the succeeding dia­ not be understood by an inexperienced administrator. logues relate to the user's environment. The fa miliar controls, however, are grouped together. Some controls can easily be invoked in error but The Web GUT is shown in Figure 5. cannot be redefined to make the error less likely. In

Digital TechnicalJour nal Vo l. 9 No. 2 J 997 39 [lAltaVista Mail Administration - Microsoft Internet Explorer l!lliJEJ Eile fdit Y:iew §.o Fgvorites !::!elp

co. Address http://mrrnen.altavista. uk· 1 656/ avrnail/rnodse• Q c

"'c \lta \:'ista l\Iail �4. dtninistration Explore the server

(· I ,) erver attn.b utes

Firewall host

Mailse rver host

Aliases j L _r Maximum message hops 132 hops Jviessage exptryint erval 11 72800 seconds Log event exptry in terval lsG�OO seconds Configuration poll interval IGo seconds Work poll ir1terval 15 seconds Cleanup in terval 13600 seconds Sender mirwnum retry interval 1120 seconds Sender maximum retry intervzJ 17200 seconds Modifyth ese attributes Reset these attributes

Done

Figure 5 WWW Forms Ad ministration GUI

these cases, the resulting dialogue box confirms the function might be "Next" or "Skip"; but the moment control's fi.mction and offers the opportunity to try any text is entered , the dctault fu nction changes to again. For example, it is easy to hit the add-username­ "Add" (or whatever normally happens to the input to-mailbox control instead of the add-mailbox con­ text). This helps the user ignore the interface and con­ trol, and this confi.tsion cannot easily be eliminated centrate on the meaning. with a revised ddinition. The add-usernamedia logue therefore warns that it does not add <1 m

40 Digir:1l Technical }ourml Vol. 9 No. 2 1997 fu l of top-priority goals, and from these we had ofcode is directly affected by asynchronous events, it learned a great deal about designing for the highest becomes difficult to be convinced that the code possible performance. We had also learned that the works. In addition, the code is unlikely to keep work­ single-minded pursuit of performance is expensive, ing as it is maintained. Synchronous metllods-mulri­ disruptive to implementation, and prone to error. threading with thread-synchronous calls, polling, This experience yielded a set of informal rules tor and timeours-will normally yield perfectly adequate cost-effective design regarding application perfor­ performance; they need only be avoided for very mance. These rules proved to be etkctive during the low-level code such as GUI window procedures. development of the AltaVista Mail product set. 8. Where latency requirements permit, polling fo r (Remember, these rules relate to the performance of new work or configuration changes can be a better typical applications: they would not apply to writing an solution than an active notification path. Polling is operating system or other low-level code.) easy to implement and robust. There is no need to

l. Estimate the minimum disk ljO that the product ensure that an idle program is completely inactive: operations will require, but treat this value only as a since modern operating systems page rather than sanity check against gross waste of resources. Do swap, minor CPU attention every few seconds not insist that this minimum be achieved. imposes no noticeable performance penalty. 2. Avoid checking special cases in a task's input data to Our approach to performance-and its nearly com­ avoid processing steps. Such optimizations are very plete subordination to simplicity-can be seen in sev­ likely to cause maimenance errors to go unnoticed eral aspects of AltaVista Mail's operation. and greatly increase the cost of the system test. The server's fu nction is to switch messages, so its 3. Never optimize tasks that consume only CPU time; database is a set of messages organized by target. Each examine only those algorithms suspected of being message is held as a single text file; the text consists of high-order consumers. It is almost never worth the SMTP commands that would introduce each mes­ optimizing error cases. sage to the server. A message fileis held in a directory that denotes irs 4. Do not use complicated buffer management target, whether that target is a remote domain, a local schemes to avoid copying data. Complicated code mailbox, or a thread of the SMTP server itself When is prone to maintenance errors, and performance the router decides on the target, it copies the message will not be helped much. Modern computers are to its target directory and deletes the original. When a very good at copying buffers of data but relatively message splits to multiple targets, no attempt is made slow at executing the complex branch logic that to share the common message data. might be required to avoid copying data. Although we could have designed a tar more effi­ 5. Take advantage of the high-performance system cient way of representing the message database and routines in modern operating systems. Do not splitting the message data as it flows through the server, build a memory allocator or a disk cache: the oper­ our experience with previous products suggested tl1at it ating system developer has already spent t:u more was not necessary. Because the storage system we chose on performance than an application developer can was a simple one, we could affo rd to throw it away if it affo rd to spend. (Even if the operating system rou­ did not work under load. Happily, it did work, and we tines do not pert(xm well, the problem wi ll be com­ gained a highly robust storage system with excellent mon to all applications that use the platform.) performance tor a trivial investment. Spend time solving the customer's problem, not When a message is passed from one thread of the repeating operating system development. SMTP server to another, it is lettin the target thread's 6. Use the operating system disk cache. It can be input directory. There is no notificationpath; the tar­ worthwhile to read even quite large amounts of file get thread discovers the message by polling. data in multiple passes if that will simplif)' program To make the source code as comprehensible as pos­ logic: the data will normally stay in memory tor a sible, the server uses extremely simple protocol subsequent pass. If the data has been flushed from parsers. The source code is organized in terms of a the cache tor the next pass, then the system had a programmer's understanding of the protocol, not better use fo r the memory, and any attempt to some abstraction that might be more efficient. The avoid multiple passes by remembering substantial parsers scan the input data as many times as is conve­ state will degrade performance, nor improve it. nient to extract the data they need at each level. The

7. Never offer an asynchronous application program­ result is secure and reliable protocol machines that are mer interface (API), and avoid using asynchronous easy to verifY and modifYwhen necessary. modes of otherwise synchronous services, even if Despite the apparent lack of care for conventional that technique is presented as the way to obtain performance concerns, extreme workloads must be good performance. When more than a tiny region supported. Al l the server's components must support

Digital Tc chnic1l journJ! Vo l. 9 No. 2 1997 41 huge numbers of messages, messages of huge size, (A related observation is that the code will be more messages with huge recipient lists, and messages bound reliable when onJy one kind of success outcome is fo r huge numbers of targets. ldeally, the system should allowed. We fo und this to be true to some extent: it is impose no limits below those imposed by the underly­ equivalent to saying that a linear, nonconditional fl ow ing machines. An extreme workload should cause no of control is more reliable than a highly conditional problems or substantially reduce the work rate. one. However, as long as every success outcome does The POP3 server has to be able to support tens of occur during the product's normal operation, the code thousands of messages in a mail box. On connecting, that handles it will probably continue to work, and the most POP3 clients ask fo r a mailbox listing, which system test will have reason to check that it does.) involves counting the size of each message in the mail­ The server also has to report any errors it encoun­ box. Ifthousands of messages arc present, this can tJke ters: it must say which server operation could not be so long thJt the client disconnects, believing the server performed, precisely what system condition caused it has fa iled. Subsequent reconncctions see exactly the to fa il, and what to do about the problem. This appar­ same problem, and no mail fl ows. To avoid this prob­ ently cont1icting requirement is handled with a second lem, the server returnsa partial listi ng to the client if it rule: error handling must be completely separated believes the fu ll listing is taking too long. When the fr om error reporting. full listing is finished, it is written to disk to be Error reporting works with the structure of the returned to the client the next time it connects. server. The server's flow of control and breakdown SMTP rules state that servers must support at least into threads were designed specitically to support pre­ 100 recipients per message, and that ideally there cise error reporting. Each thread has a well-defined should be no limit. Our existi ng customer base expects purpose that can, if necessary, be explained to the no limit. Arbitrarily large recipicm lists can be allo­ administrator, or at least named in a diagnostic report cated in virtual memory simply by configuring a page without causing confusion. Because a thread is fi.d ly in ti le of suffi cient size. However, very large lists then charge of its task, it is able to report the significance of impose a severe and highly variable load on the system. any fa ilures it encounters to the admin.istTator. To keep the system load within reasonable bounds, we A routine uses a simple stack-based error-posting placed an upper bound on the amount of virtual mem­ module to report an encountered error. The report ory claimed . The server limits the number of splits a includes a description of the tailed operation, the oper­ single message can mal

42 Digiral Tcchniul Journal Vol . 9 No. 2 1997 Additional Necessary Features and a low-level attribute handling system to read and In general , we tried to avoid adding fe atures and keep write server data, a customer can change both the in mind the server's one basic function: to move mes­ appearance and the function of the We b pages, simply sages fr om place to place with minimal user interven­ by editing HTML files. tion. When fo rced to add a fe ature, we aimed to keep it as simple and inexpensive to implement as possible, Experience with the Product yet ensure that the fe ature offered the greatest real value. Obviously, this involved a trade-off, but it was The AltaVista Mail product set has achieved its design usually clear how fa r to go. goals and has validated the implementation rules we The server needed a log subsystem to report important imposed on it. It has been inexpensive to develop, sup­ events, fo r example, errors encountered and suspected port, and maintain; is a well-behaved and effective securityviolations (attempts to break into the server). To solution; offe rs excel lent performance; and has yielded gain the maximum benefit fr om the log subsystem, we very fe w bugs. also used it to report the normal activities of the server Its major deficiency is that AltaVista Mail, by itself, in sufficient depth to perform a complete analysis of its does not fo rm a complete solution. Despite our efforts work. This allows the logs to be used fo r load monitoting, to ensure that it can be run by inexperienced adminis­ performance analysis, message tracing, and billing and trators, it relies on complex external technologies. accounting. Enough information is logged to identify Dial-up networking, Internet Protocol address assign­ exactly what has happened to each message submitted to ment, and the other aspects of the interface to the the server. Header information allows the originator and Internet service provider present problems that are the recipients to be identified,and checksums fo r enve­ not easy to deal with. lope and content al low duplicate messages to be detected. Our major problem is the configuration of M.X Duplicate detection is useful as a diagnostic aid and to (routing information) records in DNS. Although the avoid billing multipletimes fo r a single message. product reports misconfiguration accurately, users call The POP3 protocol does not report a message's us to findout what to do about it. Better integration actual recipients (as opposed to the To: and Cc: fields, with DNS would substantially reduce our support which may not be complete or even related to the real load. recipients). It therefore cannot be used as a way of Overall, we tried to keep t.he number of controls to delivering messages to gateways: it is only suitable fo r the minimum. In retrospect, we did build in a fe w that final delivery to recipients. For this reason, the addi­ perhaps should not have been provided. For example, tion of a proprietary interface could not be avoided. the administrator has complete control over the SMTP We chose to implement an APIbecause it offers the timeouts. Although this is required by the relevant simplest possible interface to the message data: RFCdocu ments, we should have had the confidence sequential access to the return path, the recipients, the to pick values that worked everywhere rather than pro­ header fields, and the lines of content data. At the vide the controL same time, it provides routines that encapsulate much On the other hand, providing more control in cer­ of the complicated and error-prone logic that gate­ tain areas would have expanded the range of cus­ ways often need. tomers who could use the product. For example, some For example, the API allows a gateway that is fe tch­ low-end customers need to control the schedule on ing a message to handle each recipient individually: it which SMTP connections and DNS requests are made, can accept, nondeliver, redirect, expand, or send fo r and the dial-up fa cilities we provide are too crude to retry each of the recipients it sees. The gateway auto­ do this. maticaLly generates any required new messages, includ­ ing nondelivery reports, messages containing those Conclusions recipients sent fo r retry, and messages with new recipi­ ents added by redirection or mailing list expansion. This section reviews the organizational, design, and The use of an API also guarantees that a gateway's implementation rules we fo und most helpfl1l in build­ operation is fu lly Jogged. When message-IDs and orig­ ing the initial Alta Vista Mail products. These ideas are inator and recipient addresses are translated between the ones we recommend to engineers who are starting SMTP and the foreign representation, the correspon­ a new area of work. dence is logged . Messages can then be traced, even across gateway boundaries. Running a Development Project In addition to these features, we extended the ser­ If possible, develop a new application as a product­ vices of the built-in HTTP server, which offers the quality prototype, not as the implementation of an administration We b pages. With a combination of architecture. A prototype can be brought to market server-parsed hypertext mark-up language (HTML) quickly and inexpensively and will generate helpful

Digital Technical Journal Vol. 9 No. 2 1997 43 fe edback from customers. It is much more difficult to Design and Implementation make sure an abstract architecture is not expensively Only a limited number of clever and difficultcompo­ addressing problems that do not need to be solved. nents can be designed and built into the product. Do not develop prototypes that are not product­ Make sure they are the ones that will make the most quality. A body of unshippable code that has been difference, and make sure each difti.cult part is as small built as a proof of concept does not demonstrate the

practicality of building a product. Take shortcuts, bur Outside the most critical areas, use the simplest designs not in any area that affects the application's fitness tor possible. Do not simplif)' to spend less time on design: use or maintainability. If the outcome is a shippable simplit-)1 to improve the quality of the product and to prototype, the choice can then be made whether to reduce the cost of implementation and lll

opment would choose the extra functionality a devel­ 5. C. Partridge, "Mail Routing and the Domain System," oper would like to build. RFC 974 (January 1986).

6. P. Mockapetris, "Domain Names-Concepts and Facili­ Defining a Product ties," RrC 1034 (November 1987). Technical issues are never the most important consid­ erations in defining a product. The most important 7. P. l'vlockapctris,"D omain Names-Implementation and thing to analyze is the customer profile. Instead of Specification," RFC 1035 (November 1987). building a product, the goal should be solving prob­ 8. J. Mvers and M. Rose, "Post Office Protocol-Version lems fo r the customer. The more accurately the cus­ 3," RFC 1725 (November 1994). tomer's problems can be characterized, the more 9. M. Crispin, "Internet Message Access Protocol-Ve r­ effectively the product will solve them. sion 4, Revision 1 ," RFC 1730 (December 1996 ). At the beginning, assume that the product will be sold at a very low price. Think through the implica­ tions; they will indicate the nature of an acceptable Biography product. Then, if the price is higher, add the extra fea­ tures that the increased price will require. Nick Shipman joined DIGITAL in 1982 afterearning Assume that the product will be supported directly a B.Sc. in computer science fr om University College London. As a member of the Mail Interchange Group from the engineering group. Imagine what would be (MIG), Nick has been involved with the design and devel­ necessary to support the product and definereq uire­ opment of a variety of e-mail and directory server products ments to impose on the product that will minimize the and their test tools. He is currently a principal software cost of providing support. Then, if the product is sup­ en gineer with the DIGITAL UK Engineering Group, ported elsewhere, add the extra teatures that the exter­ where he is working on advanced development of Internet­ based servers. nal support structure will require.

44 Digital Tcchnid journal Vol. 9 No. 2 1997 I Kenneth M. We iss Kenneth A. House DIGITAL Personal Workstations: The Design of High-performance, Low-cost Alpha Systems

The new DIGITAL Personal Workstations In 1995, DIGITAL began development of a low-cost for Windows NT (a-series) and DIGITAL UNIX system implementation of the 21164 microprocessor, incorporating emerging memory technologies to (au-series) incorporate a 21 164 Alpha micro­ improve system performance. This paper discusses the processor, a highly integrated core logic inter­ major architectural and design fe an1res of the DIGITAL face, synchronous main memory and cache, Personal Wo rkstation a-series and au-series, low-cost, and commodity PC parts. The traditional core high-performance systems powered by the 21164 logic chip set has been designed as a single-chip Alpha microprocessor. It fo cuses on some of the ASIC. The high-performance uniprocessor work­ unusual design fe atures incorporated into the new core logic chip, the 21174 application-specificint egrated cir­ station includes a low-cost interrupt scheme, cuit (ASIC). The paper also addresses a novel clock dis­ tight timing control of clocks for maximal per­ u·ibution strategy with fe edback fo r skew reduction, a formance, and a flash ROM interface. ]ow-cost interrupt scheme, and a capability to boot directly from reprogrammable fl ash read-only memory (tlash ROM). The original project was to build a platform fo r the DIGITAL Personal Wo rkstation running the Microsoft Windows NT operating system. This system is desig­ nated as the a-series (e.g., 433a, 500a, 600a, etc.). Soon alter shipping this product, the same hardware compo­ nents were qualified and shipped with the DIGITAL UNIX operating system as the au-series (e.g., 433au, SOOau, 600au, etc.).

Motivation

For years, main memory technology has been based on fa st-page-mode dynamic random-access memory (DRAM). In the personal computer (PC) market, fast-page DRA1\11s were soldered onto single in-line memory modules (SIMMs). Although the access times of these memories have been decreasing (i.e., 80 nanoseconds [ns]-+ 70 ns -+ 60 ns), improvements in CPU technology and core logic chip sets have evolved more rapidly. In 1995, the PC market began to move to extended data out (EDO) memories, which were roughly twice as fast as the fa st-page memories, and burst EDO was on the horizon. Several D RAl\11 ven­ dors began to talk about new synchronous DRAMs (SDRA1\11s), which provided even greater performance and the promise of reaching commodity price levels by 1997.

Digital Te chnical journ:1l Vol. 9 No. 2 1997 45 Red uced instruction-set computers (RISC) in gen­ Design Overview eral, and Alpha machines in particular, require faster memories. Alpha crus run much fa ster than the lead­ In traditional system-level designs, the memory data ing complex instruction-set computer (CISC) (e.g., bus goes through the core logic chip set.'·' This Intel x86) processors and need a fa ster path to memory approach isolates the CPU side of the data bus fr om to keep pace \V.ith rl1e demands to r instructions and the large electrical load of the memory chips and inter­ data. DIGITAL and orl1er Alpha system developers have connect, thereby al lowing the CPU to access its cache been satis�1ing this need with wider memory buses; fo r more quickly on the lightly .loaded bus. This approach inst

r------1 I 1 CACHE I ... r� --- ___ ADDRESS

r---.. r---.. r---· I I I I I I 21 164 OUICKSWITCH •:2a: '':2 a: ''2 a: ' ALPHA :2- ...... :2- ...... :2-I DATA BUS ,-

64

PCI{

Note: Components with dashed outlines are optional.

Figure 1 Syste m Block Diagram

46 Digit.11 Te chnical Journal Vo l. 9 No. 2 1997 In the open mode, they allow the CPU to access its component engineers within DIGITAL, we switched B-cache in isolation. When closed, memory data trav­ over to synchronous memory. (This was a fo rtuitous els through the bus separator with a small ( 2 50 pico­ decision, since commodity PC platforms are now mov­ seconds [ ps] through 5 ohms) propagation delay, ing to SDRAM, assuring its place as the next high­ thus avoiding the latency/perf ormance penalty of volume memory technology. ) going through the chip set. From our first look at memory bandwidth with the Incorporating high-volume commodity parts into a new SDRAM, we realized that designing a B-cache to design is Jess expensive than adding extra pins to a rel­ keep pace with it would be difficult. The system's atively low-volume, high-pin-count ASIC. The small memory, running at 66 megahertz (MHz), would have size of the bus separators allows them to be placed more bandwidth than any other Alpha workstation's closer to the CPU and cache fo r a shorter, quicker B-cache. In fa ct, it was difficultto design a B-cache that CPU interrace. Finally, a low-end cacheless configura­ was not the limiting factor in memory bandwidtl1 . tion can be built without them. From tl1e cache technologies supported by the 21164, Another feature of our CPU-to-memory interf:1ce we selected synchronous How-through static R.Ai\11 is that the "chip set" is a single chip. Although the cache parts. Although this choice was expensive, an original design called fo r three chips, new packaging external cache that significantly increased performance technologies were making larger pin-count devices was imperative fo r our product. more affo rdable and manufacturable.' We compared Our next major decision was choosing our ASIC various implementations fr om several ASIC vendors technology and package. After evaluating quotes for and eventually chose a single-chip implementation. different chip configurations from several vendors, we (The Te chnology Choices section in this paper dis­ selected the single-chip implementation in a 474-pin cusses the derails.) ceramic ball-grid array package (BGA) offered by Finally, making the B-cache optional is a new feature International Business Machines Corporation (IBM). fo r Alpha workstations. Recent Alpha workstations The package measures only l.O by 1.25 inches and have been hampered by the relatively slow, fast-page­ uses flip-chip technology, in which the die is bonded mode main memory coupled to an increasingly high­ directly to the ceramic package without lead wires. speed processor. With this imbalance, a large external The ceramic carrier is a seven-layer design that con­ B-cache has been essential to speed up CPU accesses by nects the die pads to a 50-mil (.050-inch) BGA fo r red ucing average memory latency and memory band­ assembly onto the printed circuit board . Contrary to width requirements. our initial concern, the package was not too difficultto We investigated an internal research project in which route on the printed circuit board. In addition to its very fast memory systems were coupled to the 21164 small size and reasonable price, the package offered CPU. A.n experimental cacheless machine used a 512- .low inductance (short) power, ground, and signal bit-wide (fast-page mode), low-latency memory system connections, which provided fo r good signal integrity. and performed very weU on many memory-intensive Figure 2 shows a photograph of the ASIC package. benchmarks. This work helped inspire the cacheless con­ Most of the other technology choices we faced do cept fo r the 21164 CPU; however, the performance of not warrant specific mention, except to r the system th e experimental system was relatively mediocreon some enclosure. Since our system was destined to become a benchmarks (cache intensive), which led us to conclude we needed to offer an optionalB-cache as well.

Te chnology Choices

As mentioned previously, system development was concurrent. with some significant changes to main memory technology. Our main issue was determining which of the new technologies-EDO, burst-EDO, synchronous, or Rambus-would be the next stan­ dard in the commodity PC business. This was a key decision : a wrong choice could lead to prohibitively expensive memory, effectively making the system a high-cost platfo rm . Our firstana lysis pointed to burst-EDO, since it was the most compatible with the existing fast-page mem­ ories. Several months into the design, however, many ofthe key memory vendors abandoned burst-EDO in Figure 2 fa vor ofSDRAM .. \Nith help and advice fr om memory Core Logic ASIC Package

Digital Tc chnie<1l JournJI Vo l. 9 No. 2 1997 47 DIGITAL Personal Wo rkstation, it had to be highly com­ The external PCI bus is specified to run as high as patible with DIGITAL's leading-edge Intel processor­ 33.3 MHz, a limit required fo r successn1l operation of based workstation (i-series). The use of common industry-standard I/0 option cards. The core logic components makes it easier tor development, manufac­ AS IC runs its PCI interface at PCI speeds and thus turing, distribution, and service. As a result, we had to needs a separate 33.3-MHz PCI clock. Since SysCLK fitour system into an existing DIGITAL PC enclosure, is running at 66.6 MHz, we can simply divide it by t\VO use the same power supply, have similar option config­ to arrive at a nominal 33.3-MHz PCLK. urations, and so fo rth. This constraint had both good In addition, the core logic AS IC has a clocking sys­ and bad ra mifications, which we discuss in a later sec­ tem capable of running these clock domains at many tion, Logic Partitioning and Enclosure. diffe rent ratios. It allows fo r oper

In Zen aml the Art of /VIolorcyc/e Ma /11/enance, 83.3 M Hz. With the CLK divisor set at t\vo (N = 2 in Ro bert M. Pirsig argues that one can look at a com­ the drawing), the output of the PLL will be running at

plex system from many diffe re nt viewpoints.; In that 166.6 MHz. With the PCLK divider at five (P = 5 ), the spirit, the logic/system design is illuminated by its resultant PCf clock rate becomes 33.3 MHz. The clocking details. We will touch on the major points dividers are clocked by both a positive PLL clock �m d of interest in r.hissect ion. an in,·erted PLL clock, allowing for symmetry in the output clocks fo r odd divisors. Clocking Overview Any significantdr iftor skew in rbe two major clock­ The main memory system W An odd frequency mix divide ratio set to nine, yielding a 66.6-MHz SysCLK, such as this one is often achieved with asvnchronous whereas a 500-MHz system wou ld set the ratio to boundaries and often with much pain. Asynchronous eight fo r a 62 .5-MHz system clock. A programmable, interfaces are difficult to design and to verifY. delayed version ofSysCLK, SysCLK_2, also is available Furthermore, the transitions across these boundaries from the 21164 CPU and is used as the main clock can often be slow and add data latency. The 21174 source in the 21174 core logic ASIC. In the system, core logic AS IC design allows tor flexible operation SysCLK_2 is delayed nearly a fi.dl clock cycle, so that by fr equencies, while maintaining a synchronous design. the time it reaches the core logic ASIC, it arrives The 2117 4 ASIC was designed to accommodate roughly coincident to SysCLK. several difkrent frequency combinations, and the internal data and control signals that cross the CLK to Clock Divisors and Clock Domains PCLK boundary were implemented accordingly. As Figure 3 shows a simplified block diagram of the the design progressed, however, we fo und it was too logic/system clock system. Once it enters tbe core difficu lt to run the external memory and B-cache sub­ logic ASIC, SysCLK_2 makes irs way to the input of systems fa ster than 66.6 MHz, given the current tech­ a phase locked loop (PLL) by means of a multiplexer nologies. As an unfortunate result, these fe atures have (MUX). SysCLK_2 passes straight through the MUX, not yet been used. which is a power management te aturc and is not cur­ rently used . The output of the PLL goes into two pro­ SDRAMClock Generation grammable divisors, one to generate CLK, the main Assho wn in Figure 3, the memory/DRAM clocks are clock fo r the core logic ASIC, and the other to gener­ generated fr om the 21174 AS IC. Normally, large ate PCLK, the Peripheral Component Interconnect ASICs-such as this one-have relatively slow IjO (PCI) bus logic interface to the core logic ASIC. Since cells and significant delay variation fr om chip to chip. CLK is the fe edback source to the PLL, Sys_CLK and This typically makes large ASICs a poor choice for dri­ CLK arc always at the same frequency. ving critical clocks (such as the memory clocks) in

41-: Digital Tec hnical Jou rnal Vo l. 9 No. 2 1997 SRAM_FI LL_EN SRAM_FILL_CLK 2 3 .,.------'=----=------,

21174 CORE LOGIC ASIC B-CACHE (OPTIONAL) DRAM CLK - TO SRAM_CLK [ (24 COPIES SRAM_CLK_IN 6 DIMMs) 3 COPIES l 13 12

ST_CLK

21 164 ALPHA CPU DRAM_CLK_IN FEEDBACK

PLL INPUT

ALTERNATE POWER CLOCK MANAGEMENT SWITCHOVER

PCI_CLK 6 (OPTIONS)

PCI CLK DISTRIBUTION (TO OUTPUT LATCHES ONLY)

Figure 3 Blo.:k Diagram of Clocks

high-speed system design, because right ti ming control The variable delay line consists of 128 independent of clocks is a key fa ctor fo r maximal performance. delay cells, wh ich have a delay of ro ughly 200 ps each. Therefore, most high-speed designs use costly, skew­ Each delay cell maintains the same polarity ofthe clock controlled bu ffe r chips and elaborate distribution but inverts it twice to compensate fo r diffe rent gate schemes to get tightly controlled clocks. Often, these rise and t1 ll rimes. By adjusting the number of delay circuits use PLLs as well to control the absolute arrival cells in the path, the 21174 ASIC can control the tim­ rime of the clocks. AJI these decisions result in good ing of the output clock. With that accomplished , the performance but at a price, both in cost and board area. core logic ASIC must now determine and control the Like other large ASICs, the 21174 has a large riming number of delay elements used. variation on irs I/0 ceJis. Unlike other large ASICs, it As Figure 4 shows, the 21 174 AS IC generates 13 employs a novel feedback circuit that automatically identical copies of the DRAM clock; 12 copies go to compensates fo r delay uncertainty and ultimately deliv­ the memory arrays (two per DIMM). The thirteenth ers tightly aligned clocks. Figure 4 shows some details copy fe eds back into the ASIC. ConcepruaJiy, the extra ofthis circuit, which is used to generate the main mem­ copy goes to a phase detector, which compares it ory clock, DRAM_CLK. This clock originates inside against its own internal clock, CLK. If the sampled the ASIC as a copy of the tightly aligned internal clock, D.RAM_CLKleads CLK, then the phase detector auto­ CLK, which is then distributed through a variable delay matically increases the delay of the DR.Al\1_CU( vari­ line. The resultant delayed clock is then sent, in multi­ able delay line by adding one more element to the ple copies, out oftl1e chip to tl1e SDR.Al\11 memory dual delay chain. Conversely, if D.RAM_CLK trails CLIC, in-line memory modules (DIMMs). one element of the variable delay line is removed.

Digital Techniol journal Vo l. 9 No. 2 1997 49 2117� Ccm£LOGK: ASIC

OCI.K OELJW LiNE. 1 it'8ELE\ENTS OCutC> 13 DRAM I I I I o/., 1 I I I :::: I I' I CUt: INJECTION I t:e_lf!f : -� � -� ___ - -�

' ' . ' ' ' - .J.- _ � ------.J ------....

Figure 4 DRAJ'v!_CLKClock Aligner

The alignment circuitry continues to add (or remove) up and synchronize any metastable state that may one delay element until the clocks are slightly past rJ1e occur and send a clean output ro the variable delay line optimal alignment; at tl1at point, it then subtracts (or control circuitry. adds) a delay element. The circuit rh us brings the clocks Although ASICs can have a large variation in output close to alignment and then toggles back and fortl1 by delay from chip to chip and fo r diHe rent voltages and one delay clement, which adds slight but acceptable jitter tempc:ratures,ceU delays are correlated to a large: degree to tl1e clocks. within a chip. In short, if one 1/0 ceU in a chip is run­ The phase detector is logically a D flip-flop, with n.ing fa st, rJ1ey will all be running fa st. Thus, by lining up CLK driving its clock input and DRAM_CLKas its data one oftl1e DRAMCLKs, they are a.U in rough alignment input. The rwo output states tl1us represent "CLK leads witl1 it. Altl1ough tlliscompens ates fo r tl1e output buffer DRAM_CLK" and "CLK trails DRAM_CLI(." The cir­ delay, the rest of the clock fe edback path is not automat­ cuit pings back and fo rth betvveen the final two delay ically correlated. elements because the phase detector does not directly The external printed circuit board wiring, however, indicate that clocks are in alignment. Also, once the has relatively little skew contribution because propaga­ clocks are in alignment, timing could driftbecause of tion delays in etch are relatively constant and easily cal­ delay changes caused by temperature or voltage varia­ culated. (The wiring will correlate to the etch delay of tions, which require the circuit to be lefton. tl1e other clocks going to the DINI Ms.) The input The goal of th.is clock circuit is to precisely align the buffer cell delay ofthe ked back clock is not correlated, phase detector's clock, CLK, with its data, the input but this is relatively fast and not consequential. copy of DRAM_CLK. This nearly guarantees that the Some additional aspects of tl1e circuitry arc wortl1 flop will be operating in violation of its normal setup­ mentioning. For example, in rl1e 21174 AS IC imple­ and-hold timing window because its data are changing mentation, the automatic delay alignment circuitry can at the clock edge. (State elements like ro have a setup­ be selectively enabled or disabled by software (it powers and-hold window around the clock where the data is up disabled). Softwaree

50 Digiral Technic.ll journal Vol. 9 No. 2 1997 generated fr om DCLK, which made the normally dif­ optimize victim eject timing to the core logic ASIC, ficu lt job of interfacing to the high-speed memory a th e system uses a fo rwarded copy of the ST_ CLK clock relatively easy one in terms of timing. sourced fr om the 21164 CPU and sent to the 21174 The clock-alignment circuitry was one of the first ASIC (see Figure 3). This clock, SRAM_CLIZ_IN, things we checked after we powered on the first system. works similarly to the 21164 CPU's wave pipeline fe a­

As soon as we wrote the bit to enable the auto alignment ture and allows ejection of a victim fi-om the cache to fe ature, thecl ocks snapped into the calculated/expected the 21174 ASIC in five SysCLK cycles rather than the positions. eight cycles it would have taken without it.6 For flexi­ bility in future B-cache upgrade cards, this clock B-cache Clocks routes through the B-cache module, which allows its timing to be fine-tuned fo r the specificB-c ache mod­ Each of the two ST_CLIZ pins on the 21164 CPU ule parts. Figure 5 shows this timing. are split into two separate copies through a resistor­ splitting network. The resultant to ur copies of Flash ROM and Boot ST_CLI ( route to the B-cache module. Three of these copies are available fo r the synchronous static RAMs The 21174 core logic ASIC implements the boot on the cache module, and the fo urth copy routes to code, the console firmware, and the nonvolatile RAM the 21174 core logic chip. The current implementa­ (NVR) interface in the system, another example of a tion of the B-cache module uses only one copy of system fe ature that was integrated into the core logic ST_CLI(, which is buffe red and sent to the individual ASIC. A single, one-megabyte (MB) fl ash ROM holds synchronous static RAM chips. the power-on self-test (POST), two 16-kilobyte copies The system platform delivers three additional copies ofNVR, and the AlphaBIOS and SR.J.\1 consoles. These of the DRAM_CLIZs to the B-cache module, called separate components were combined to save board SRAM_FILL_ CLI(, which also take advantage of the space that would have been used by separate parts: DRAM_CLI( alignment just described. These clocks serial ROM, NVR, and nvo programmable ROMs. can be used to clock the synchronous static RAMs dur­ The more highly integrated design also reduced ing memory fills to give the least amount of clock skew costs, which was consistent with our design goals. The between the memory and the cache. A separate set of single flash ROM is connected on the address bus time-aligned signals, SRAM_FILL_EN, can be used to between theCPU and the core logic ASIC (see Figure 1 ). switch between the ST_CLIZ and SRAM_FILL_CLIZ By placing tl1ese critical software components logically sources. closer to the CPU, rather than on the I/0 bus, we The 21174 ASIC implements a victimbu fte r fo r dirty provided a greater ability to diagnose the system when (updated) B-cache data being evicted to memory.< To not all of its parts are working. Flash ROM was used

CPU CLK IJ'UUUU llSLilJlru mn_n.n_

SYSCLK, 66 MHZ

ST_CLK AT SRAM

INDEX AT SRAM

VICTIM DATA AT 21 174 -+------+-----{ D1 1

FORWARDED ST_CLK AT 21 174

DATA SAMPLING POINTS AT 21 174 ASIC (FIRST THREE CAUGHT AT FORWARDED CLOCK, LAST OCTAWORD CAUGHT AT SYSCLK.)

Figure 5 Victim Eject Timing Diagram

Digital Techn.icalJournal Vol. 9 No. 2 1997 51 to ease updating the fi rmware in the field, allowing inadvertent corruption of the flash ROl'vl. This fe ature upgrades from a fl oppy diskette (possi bly downloaded is also useful fo r module-level debug, because serial ti·om the DIGITAL Web site). ROM parts with specifictest scripts can be made. The initial power-on boot sequence on the system is also unique among Alpha workstations. All Alpha CPUs Interrupt Controller can usc a serial RO M to load the primary instruction cache (I-ca che) at power-on; once the I -cache is full, The system provides separate inputs fo r every possi­ execution begins. Alternatively, the 21164 CPU can ble PCI interrupt (Figure 6 ), avoiding the problems bypass the serial ROM load and begin executing that shared interrupts bring: longer latency, unneces­ directly fr om memory. These instruction stream sary l/0, configu ration restrictions, and so fo rth. We (!-stream) accesses miss in the CPU's empty internal implemented a serial I/O scheme to handle PCI inter­ caches, causing external fills. The system implements rupts and miscellaneous IjO. This has the advantage this fo rm of power-on, using the core logic ASIC to of bypassing the Industry Standard Architecture (lSA) intercept !-stream fills and retrieve the data from flash interrupt handlers entirely, except fo r true ISA inter­ ROM. This saves the board space and cost associated rupts. Many previous designs run the PC! interrupts with either a specialized serial ROM part or the logic through the PCI-to-ISA bridge, which requires addi­ to serialize a standard ROM. tional l/0 to determine the device that needs attcn­ At initial power-on, the core logic AS IC interprets tion. In the design of the system, we tried to avoid the CPU requests fo r reads from addresses 00.0000.0000 need fo r time-wasting l/0 accesses wherever possible, through 00.03FF.t:FFF as requests fo r data from moving the access closer to the CPU if it could not be the flash ROM. The transaction begins with a eliminated entirely. Further down the hierarchy of read_block_miss command from the CPU. The core buses (CPU to PCI to ISA), latencies increase and the logic AS IC asserts cack_l to acknowledge the command possibility of contention rises. For a high-frequency and then asserts addr_bus_req to request the private lU SC CPU such as the 21164, stalling the CPU has a use of the CPU -to-21174 ASIC address bus. lt then significantly higher penalty than fo r slower processors. issues reads ro the fl ash ROM by passing address and The core logic AS IC uses only three pins to control control information on the now-reserved address bus. external shift registers fo r interrupts and general­ Since the core logic ASIC owns the bus, it does not purpose inputs and incorporates a fo urth pin that han­ have to adhere to conventional usage: the address bits dles general-purpose outputs (such as control bits or are fr ee fo r reassignment. Addr<31:12> is used as the light-emitting diodes [LEDs]). The external logic is byte address into the flashRO M; the eight-bit datum shown in Figure 6, where int_clk is the shiftclock into is returned on addr< 11 :4>. Sixty-four successive bytes the external shift registers, int_sr_load_l is the load read fi·om the flash ROM arc packed into a bufferin pulse f(x both input and output, int_sr_in is the serial the core logic ASIC and returned to the CPU to com­ input stream, and the serial output is on gp_sr_out. plete the original fillrequest. Inside the core logic AS IC, we used several registers There are two diffe rent address ranges used fo r fl ash in the interrupt controller. Allint errupts and general­ ROM fills:one starting at address 00.0000.0000 (to purpose inputs are readable in the 64-bit int_rcq regis­ allow power-on fi·om code in flash ROM) and another ter, letting the interrupt dispatch code determine at address OF.FCOO.OOOO (above any possible mem­ relative priorities. A corresponding int_mask enables ory). Both ranges access the same fl ash ROM data. or disables each input as an interrupt. To save external POST code quickly jumps to the high address range, inverters, an eight-bit register (int_hilo) is used to disabling the low range and fi·eeingth ose addresses fo r change the polarity of individual inputs in the low bvtc use by memory before it begins to size the DIMMs. (inputs are assumed to be active-low signals, but some Byte read and write access to the flash ROM is also inputs arc contrary). Another eight-bit register is supported fo r access to NVR and fo r updati ng the int_routc, which can selectively route some inputs to firmware. This address range starts at C7.COOO.OOOO. different interrupt lines on the CPU; although most Only two function-specificpin s on the core logic ASIC interrupts arc delivered on the CPU's irq_hd > line, are used fo r the flash ROM interface, write enable this fe ature allows the clock to come in at a higher ( t1ash_wc_l , deasserted during fills) and chip enable priority and other inputs to cause a machine check, a (flash_ce_l ). The fl ash ROM's output enable is con­ power- fai l interrupt, or a halt condition. trolled through an address line, addr<39>. The interrupt controller can be configured fo r the A socket is provided on the system mother board number of bits used, in blocks of eight, fo r as many as to allow fo r the use of a real serial ROM part. The 64 bits. We set the interrupt controller fo r 32 bits, circuitry automatically detects the presence of a serial implemented as fo ur, cascaded, eight-bit parallel-to­ ROM and will direct the 21164 CPU to boot fr om the serial shifi: registers and two serial-to-parallel parts, serial ROM port if the part is installed. This allows which was sufficient fo r the system. The timing of the a serial ROM to be used in case of damage to, or shifi: clockis adjustable to balance the cost of the shift

52 Dit;iralTec hnical Journal Vol. 9 No. 2 1997 21174 CORE LOGIC ASIC

INTERRUPTS TO CPU

0 <{ f- 0 ::;) _j � 0 � I I I _j cr: cr: cr: (.) (/) (/) (/) I I I I f- f- f- (l_ � � z CJ

CLK LD SHIFT STORE

DATA GENERAL OUTPUT IRQ SHIFT REGISTER OUT SHIFT REGISTER

IRQ/ IRQ/ IRQ/ GPO 1 GP0 2 GPO n GPIt 1 GPI t 2 GPIt n

INTERRUPTS OR GENERAL-PURPOSE OUTPUTS GENERAL-PURPOSE INPUTS

Figure 6 Interrupts and Gener�l-purpose f/Os

registers against the interrupt latency. These t\vo para­ For example, the state of the CPU's interrupt pins meters are maintained in the int_cnfg register. during system reset determines the external clock A small additionalinterrupt latency was introduced by characteristics of the 21]64, such as the SysCLK divide the serial interface. We judged this to be an acceptable ratio and the SysCLK,_2 delay (mentioned earlier). At compromise, since we eliminated several I/0 accesses power-on, these are set by pull-up and pull-down per interrupt. Note th at when multiple interrupts occur resistors on the module to the slowest speed. After withinthe same shift cycle, all arc available to the CPU's sorrware establishes the CPU's operating fr equency, it interrupt dispatch routine simultaneously, with only one can configurethe interrupt logic to drive these lines to read of the core logic ASIC's int_req register. the correct value and reset the machine. The next rime the CPU restarts, software recognizes the right clock Soft Reset parameters and conrjnues with the boor code.

The core logic ASIC generates and drives all the sys­ Embedded Real-time Counter and Timer Interrupt tem reset lines, including irs own. It receives a DC_OK (DC power supply levels charged and okay) signal A 64-bir real-time counter in the rt_count register, froman external voltage monitor, waits tor its internal ticking with the system clock, is useful fo r microscopic PLL to Jock, and then delays fo r a programmable (by ri ming measurements. The core logic ASIC provides means of externalpull-up /pull-down resistors) num­ an int_timc alarm register to interrupt the CPU at spe­ ber ofcycles before deasserting reset to the system. cificvalues ofrt_count. The system uses the real-time Also, the core logic ASIC implements a softreset clock in one ofits peripheral I/0 chips (super I/0) fo r teaturewhereby softwarecan reset the entire machine, the system timekeeper, bypassing the ISA interrupt with the exception of a few key registers in the core controller and bringing its interrupt straight to the logic ASIC. This allows software to reconfigure certain input shiftregisters, which saves PCI bandwidth and key CPU and core logic ASIC clocking parameters, hit processor cycles. reset, and acquire the new timing.

Vol. 9 No. 2 1997 53 FC Interface

Two pins on the core logic ASIC were set aside fo r general-purpose inputs or outputs, under total soft­ ware conu·ol. In the system, these are used by software to implement an FC (inter-I C) interface fo r clock and data. This serial interface allows input of static config­ uration data from the six DIMMs and the B-cache module. At power-on, flrmware reads information fr om small, serial electrically erasable read-only memo­ ries (EERO?vls) on each of these modules. Thus the interface can easily establish system conflguration as well as set up memory and cache timing. The DIMM FC RO M contents are defined by the Joint Electron Devices Engineering Councils (JEDEC) standards and can be programmed in place. Since the PC data is read on.ly once at power-on, it was not important to make this interface fast. Consequently, it was implemented as a "bit-bang" port, in which each transition on eachline is controUed by firmware accesses to logic registers. Firmware is responsible fo r timing of Figure 7 the protocol's clock and fo r setup and hold of the data. Electronics, .MotherBoard, and Riserfo r rhc Also, firmware handles deserialization(b yte packing) of DIGITAL Personal Wo rkstation, a-Series the incoming data stream.

logic Partitioning and Enclosure and performance. For those users with a CPU­ intensive application who do not need high-quality, Most of the system logic is partitionedonto two boards: high-performance graphics, a less expensive adapter a riser card and the mother board (often referred to as may be adequate. Demanding users, such as those the main logic board [ MLB ]). The riser card is used to r doing mechanical computer-aided design (CAD), wi ll aU components common across the DIGITAL Personal find that tl1e high-end graphics cards offer a signincant Wo rkstation Line;it includes five option slots, audio, and boost in quality and performance. Ethernet logic. All internal cables connect to tl1e riser, A small computer systems interface (SCSI) controller which is intended to be common between the plat­ is provided as a PCI op tion card fOr several reasons. fo rms. Figure 7 shows a photograph of the system logic. First of all,it was more expensive to embed SCSI and Because ofthe shared Iiser, me a-series and i-series sys­ provide a bulkhead card fo r an external connection. tems have much in com mon. For example, th ey use the same PCI-PCI bridge chip, witl1 identicaloption slot lay­ outs. The CO-qual ity audio and 10/100 megabits per second Ethernet logic arc common, as are tl1e bulkhead cards fo r these signals. The MLB contains the CPU, core logic chip (set), cache, memory, and miscellaneous external connec­ tors. Figure 8 shows the optional B-cache module and a custom-designed memory DIM.!\1.Be cause there are no internal connections, it is easy to remove an Intel MLB and replace it with an Alphaboard . In a design compromise, we added a PCI-IDE chip on tl1e M LB to match me partitioning cl.i.ctated by co st­ ing i-selies machines, which have integrated device elec­ tronics (IDE) built into their core .logic. The natural place fo r such a device would have been on the 1iser card. A graphics controller was not embedded so the customer could select from adapters of varying cost

Figure 8 Optional 2-MB B-Cache and 64 -MR OIMM

54 Digital Te chnical Journal Vol. ') No. 2 1997 Using an option card allows us to move to better SCSI his work on the cacheless machine. Finally, special solutions as they become available. Customers with a acknowledgment is due to Reinhard Schumann, who light disk IjO load may choose to use a less expensive contributed an extraordinary amount ofin spiration to IDE hard disk. the design and perspiration to the implementation. Although there were many benefits to sharing so many components between the a-series (Alpha) and the References i-series (e.g., Intel x86), the development process was not conflict-ffee. The Alpha development team fa ced l. S. Nadkarni, vV. Anderson, L. Carlson, D. Kravitz, M. several significant problems. Because the i-series devel­ Norcross, and T. \Venners, "Development ofDIG ITAL's opment started sooner, much of the feature set and PCI Chip Sets and Evaluation Ki t fo r the DECchip logic partitioningwas already committed before vve had 21064 Microprocessor," Dil; ital Te chnical jo urnal. vol. 6, no. 2 (Spring 1994): 49-6 1. a chance to participate in the design. The original enclo­ sure proved to be inadequate, and we had to make 2. ]. Zurawski, J. Murray, and P. Lemmon, "The Design changes fo r cooling, FCC containment, and mechanical and Verification of the AJphaStation 600 5-series Wo rk­ support of option cards. Also, there were difficulties station," Digital Te chnical journal, vol. 7, no. 1 that reflected the ditkrent market requirements tor PCs (1995): 89-99. and workstations. For instance, an external HALT 3. Pe ntium Processors and Related Products (1\ll t. button, which fo rces a running machine to the firm­ Prospect, Ill.: Intel Corporation, Order No. 241732· ware console, is a standard fe ature in the UNIX and 002, ISBN l-55512-239-6, 1995 ).

OpenVMS markets but is missing in the DIGITAL 4. R. Schumann, "Design of the 21174 Memory Con­ Personal Workstation. troller for DIGITAL Personal Workstations," Digital Te chnicaljournal, vol. 9, no. 2 (1997): 57-70. Results and Summary 5. R. Pirsig, Zen and the A11 oflvlotorcyc!e Ma intenance

(New York: William Morrow & Co., 1974 ) . The 21174 core logic ASIC was implemented as a standard-cell ASIC design and uses about 320,000 6. DIGITAL Serniconductor 21764 (366MHz Th rott,r;h cells (250,000 gate equivalent) on a 7.2-miUimeter 433/VIHz) Alpba iVI icroprocessor Ha rdware Reference square die, routed in five layers of metal. The design iVI anua! (Maynard, Mass.: Digital Equipment Corpora­ uses 384 signal pins in a 474-pin ceramic BGA pack­ tion, Order No. EC-QP99A-TE, 1996 ). age; the remaining pins are used tor power and ground. The DIGITAL Personal Workstation a-series Biographies and au-series include the 21164 Alpha CPU, the 21174 core logic ASIC, synchronous DRAM, and a 64-bit PCI bus.

Acknowledgments

The authors would like to thank all the people who contributed their skills and time to the successful design of the Alpha microprocessor-based personal workstations known internally as the PYXIS/MIATA/ MX5 projects. Special thanks are due to the core engi­

neering design, development, and debug team: Arlens Kenneth M. We iss Barosy, Frank Calabresi, Jim Delmonico, JeffForsythe, Ken Weiss joined DIGITAL in 1983 as a software engineer Frank Fuentes, Paul Hill, Bob Hommel, Mark Ke lley, developing CAD (computer-aided design) applications Yong Oh, Rob Orlov, Rajen Ramchandani, Dan Riccio, fo r signal integrity analysis and physical design. He later Don Rice,Ty Rollin, Arnold Smith, and Dean Sovie. joined the Alpha Wo rkstations Group as a principal hard­ ware engineer and was the engineering project leader fo r Many thanks also are in order fo r our qualification the 21174 ASIC and the DIGITAL Personal Workstation team, including Bill Grogan, Bob O'Donnell, Phil a-series machines. In addition, Ken designed and imple­ Puris, Lynne Quebec, and Matt Twarog; our engi­ mented the ASIC and module-level clocking system and neering manager, Fred Roemer; our product manager, contributed to the signal integrity and timing verification Keith Bellamy; and our group's technical writer, of the system. He received a B.S. in electrical engineering trom Cornell University in 1983 and an M.S. in computer Carmen Wheatcroft. We also wish to acknowledge science from Boston University in 1991. Ken is now work· Dave Conroy fr om the Systems Research Center fo r ing fo r Sun Microsystems.

• Digital Technical Journal Vo l. 9 No. 2 1997 55 Kenneth A. House Kenny House is a principal software engineer f( Jr Wo rkstations Engineering and was the team leader fo r verification of the 21174 core logic ASIC in the Alpha microprocessor-based DIGITAL Personal Wo rkstation. Prior to this proj ect, he worked tor AlphaStation Engineering in software support and 1/0 integration, for VA Xstation Engineering as liaison between operating systems and hardware engineeringgroups, and fo r DECmate Engineering on tl rmwan:, diagnostics, drivers, and simulation. Kenny joined DIGITAL in 1992 after fi fteen years as an independent consultant. He holds a U.S. pa tcnr fo r a SCSI bus extender and was :�warded a B.S. in mechanical e ngi nee ring from the Massachusetts Institute ofTechnolog" in 1969.

56 Digital Tc dmical )ourn;tl Vol. ') No. 2 1997 I Reinhard C. Schwnann

Design of the 21174 Memory Controller for DIGITAL Personal Workstations

DIGITAL has developed the 21174 single-chip core As microprocessor performance has relentlessly logic ASIC for the 21164 and 21 164PC Alpha improved in recent years, it has become increasingly important to provide a high-bandwidth, low-latency . The memory controller in the memory subsystem to achieve the fi.1 ll performance 21174 chip uses synchronous DRAMs on a 128-bit potential of these processors. In past years, improve­ data bus to achieve high peak bandwidth and ments in memory latency and bandwidth have not increases the use of the available bandwidth kept pace with reductions in instruction execution through overlapped memory operations and ti me. Caches have been used extensively to patch over simultaneously active ("hot") rows. A new pre­ this mismatch, but some applications do not use diction scheme improves the efficiency of hot caches effectively. ' This paper describes the memory controJ ier in row operations by closing rows when they are the 21174 application-specific integrated circuit not likely to be reused. The 21174 chip is housed (ASIC) that was developed fo r DIGITAL Personal in a 474-pin BGA. The compact design is made Workstations powered with 21164 or 21164A Alpha possible by a unique bus architecture that con­ microprocessors. Before discussing the major compo­ nects the memory bus to the CPU bus directly, nents of the memory controller, this paper presents or through a QuickSwitch bus separator, rather memory performance measurements, an overview of the system, and a description of data bus sequences. It than routing the memory traffic through the then discusses the six major sections of the memory core logic chip. controller, including the address decoding scheme and simul taneously active ("hot") row operatjon and their eftect on latency and bandwidth.

Project Goals

At the outset of the design project, we were charged with developing a low-cost ASIC fo r the Alpha 21164 microprocessor fo r use in the DIGITAL line of low­ end workstations.2.J Although our goal was to reduce the system cost, we set an aggressive target fo r mem­ ory pertormance. Specifically, we intended to reduce the portion of majn memory latency that was attt·ibut­ able to the memory controJler subsystem, rather than to the dynamic random-access memory (DRAM) chips themselves, and to use as much ofthe raw band­ width of the 21164 data bus as possible. In previous workstation designs, as much as 60 per­ cent of memory latency was attributable to system over­ head ratl1er than to tl1e DRAM components, and we have reduced that overhead to less than 30 percent. In addition, the use of synchronous DRAMs (SDRAMs) allowed us to provide nearly twice as much usable band­ width to r tl1e memory subsystem and reduce the mem­ ory array data path widtl1 from 512 bits to 128 bits. Figure 1 shows the measured latency and bandwidth

Digital Te chnicaJ Journal Vo l. 9 No. 2 1997 57 400

350 r- 300 - - - 250 r- I - - r- 200 r- r-- - 150 f- 100

50

0 400-MHZ 180-MHZ HP 167-MHZ 433-MHZ 13j3-MHZ 200-MHZI 433-MHZ ALPHASTATION C180 ULTRASPARC 1 21174 CHIP PENTIUM PENTIUM PRO 21 174 CHIP 500 WITH 2MB TRITON 440FX WITHOUT L3 CACHE CACHE KEY: D BANDWIDTH, MB/S D LATENCY, NS

Figure 1 Measured LJtency and Bandwidth

characteristics attainable with the 21174 chip in the bandwidth fo r direct attachment to the data pins of DIGITAL Personal Workstation Model 433a and the 21164, and all three are available at little or no compares that data with measurements fo r se\'eral price premium over fa st-page-mode DRAMs. other systems. The STREAM (McCalpin) bandwidth vVe eventually chose SDRAMs for our design measurements were obtained fr om the Universityof because their bandwidth is better than that of the other Virginia's Web site.' The memory latency was mea­ two types, and because the industry-standaxd SDRAJ'vls sured with a program that walks linked lists of various use a 3.3-volt interface that is fullycompat ible with the sizes and extrapolates tile apparent latency from the 21164 data pins.s Fortuitously, SDRAJvlswere begin­ run times.' Note that the measured latency also ning to penetrate the mainstream memory market at includes the miss times fo r all levels of cache. the time we started our design, which provided assur­ Despite our ambitious memory performance goals, :tnce that we would be able to buy them in volume our primary goal was to reduce cost. We considered rrom multiple suppliers at little or no price premium. eliminating the level 3 (L3) cache fo und in earlier Although we eliminated the data path slices in the DIGITAL workstations from our system design -" Pre­ path fr om memory to the CPU, \.Ve still needed a data vious research indicated that a 21164 microprocessor­ path from the memory to the peripheral component based machine without an L3 cache could achieve interconnect ( PCI) I/0 bus. When we started the good performance, given good memory latency and design, it seemed clear that we would need multiple bandwidth 7 Clearly, elimination of the L3 cache chips to accommodate the data path because of' the would reduce perr(xmance; we chose instead to retain 128-bit memory bus, 16-bit error-correcting code the cache but as an optional fe ature. (ECC) bus, the 64-bit PCI bus, and their associated Typically, memory controllers used with Alpha control pins. We planned to partition the design to fit microprocessors have used a data path slice design, into multiple 208-pin plastic quad flat pack (PQfP) with two or to ur data path chips connecting the CPU packages. Our first design approach used two data to the DRAMs.Early in our design process, we realized path slices to connect the memory to the PC! bus and that these data path chips could be eliminated if the a third chip fo r the control logic.

memory cycle time were made fa st enough fo r nonin­ Aftera preliminary feasibility investigation with two

terleaved memory read data to be delivered directly to AS IC suppliers, we decided to combine all three chips the CPU. Previous designs used fa st-page-mode into one chip, using emerging ball grid array (BGA) DRAMs, bur they could not be cycled quickly enough packaging technology. Although a single chip would be to provide adequate bandwidth from a 128-bit-wide more expensive, the elimination oftwo chips and more memory. Consequently, a 256-bit or 512-bit memory than 100 interconnections promised a reduction in data path was used in those designs, and data from the main logic board complexity and size, thus partial ly off� wide memory was interleaved to provide adequate setting the additional cost of the si ngle chip design. In bandwidth to the CPU. Recently, several types of addition, the performance would be slightly better higher bandwidth DRAMs have been introduced, because critical paths would not span multiple chips

including extended data out (EDO), burst EDO, and and the design process would be much easier and bster. SDRAMs . All three memory types have adequate

58 Digiral Te chnical Journal Vo l. 9 No. 2 !997 System Overview nected, which speeds up CPU cache accesses signifi­ cantly. As a side benefit,direct memory access (DMA) Figure 2 shows a system diagram fo r the cacheless traffic to and from main memory can be completed design. Note that the memory array connects directly without interfering with CPU/cache traffic. The to tl1e CPU. Figure 3 shows a configuration with an arrangement shown in Figure 3 is used in the DIGITAL optional cache. In this configuration, a QuickSwitch Personal Wo rkstation, wiili the cache implemented as a bus separator isolates the CPU/cache bus from removable module . vV ith this design, a single mother­ the memory bus. (A QuickSwitch is simply a low­ board type supports a variety of system configurations impedance fieldeff ect transistor [PET] switch, typically sparming a wide performance range. several switches in one integrated circuit package, con­ The memory controller also supports a server config­ trolled by a logic input. In the connected state, it has an uration, shown in Figure 4. In the server configuration, impedance of approximately 5 ohms, so it behaves as many as eight dual in-line memory module (DIMM) almost like a perfect switch in this appJication.) From pairs can be implemented. To cope with the additional an architectural standpoint, it is not absolutely neces­ data bus capacitance fr om the additional DIM!Yls, the sary to separate the buses when a cache is provided, but DIMMportion of the data bus is partitioned into fo ur the bus separation substantially reduces tl1e capacitance sections using QuickSwitch FET switches; only one sec­ on the CPU/cache bus when the buses are discon- tion is connected to the main data bus at any one time.

21 164 ·---· ·---· ·-- I I I I I I ALPHA ' '2a:' '2a:' '2a: CPU DATA ::2;- 1-1:2-'-'2_ , 12B+ECC I -.0:I I -.0:I I-.o:, 10CL1 10CL1 10CLI 1- _ , _ _ ADDRESS , _ _ , , ,

I 36 CHIP ENABLE, WRITE ENABLE FLASH I 21 174 ROM I CHIP BGA ADDRESS -{> AND CONTROL I CONTROL 10

64 PCIf

Figure 2 CacheJess Wo rkstation Configuration

·------.. I I

.------'-r-�..., �c��-J ADDRESS ·---.. ·---.. ·-- I I 21 164 I I I I CKSWITCH ' g; g; Q; ALPHA ��� � :_: � :_: � DATA 1--.1..----1 r----,------7"-----t : CPU 128+ECC SEPARATOR :a�: : a�::a� : _ ,_ ADDRESS , _ - " , - " - " l 36 CHIP ENABLE, WRITE ENABLE FLASH I 21 174 ROM I CHIP �>----'------' BGA V ADDRESS AND CONTROL I CONTROL 10

Note: Components with dashed outlines are optional.

Figure 3 Wor kstation Configuration with Optional Cache

Digiral Te chnical Journal Vo l. 9 No. 2 1997 59 r---.. r---• I I I I r--- I r------�1 �� � ��: u 1_<{1 1-

·------.. I ::2' a: I I ::2'a:I I I �------�1 �� � �� : 1 on_ 11on_ 1 1- - J 1- - J r-��c��-J ·---.. ·---· .----.....l..--. I I I I ADDRESS 1 � � :_: � � 1------1 : 21 164 QUICKSWITCH .-c:r:: ••-

36 I L CHIP ENABLE, WRITE ENABLE F ASH 21 174 FAN-OUT ROM I CHIP AND �---L-�--�-L--�--L--� I BGA ADDRESS AND CONTROL I DECODE CONTROL 10

Figure 4 Server Configuration

In addition to the QuickS\\�tcb bus separators, the data cycles. Figure 5 shows the basic memory read server configuration requires a decoder to generate 16 cycle; Figure 6 shows the basic memory write cycle. separate chip select (CS) signals f[·om the 6 CS signals To provide full support fo r all processor and DMA provided by rJ1e 21174 chip. request types, as well as cache coherence require­

ments, the memory controller must implement :1 large

Data Bus Sequences variety of data transfer sequences. Table l s hows the combinations of data source and destination and rhe The 21164 CPU has a cache line size of 64 bytes, u·ansactions that use them. All memory and cache d:ua which corresponds to fo ur cycles of the 128-bit data transfe rs are 64-byte transkrs, requiring four cycles of bus. For ease of implementation, the memory con­ the 128-bit data bus. AJ I IjO transfers arc 32-byte troller does all memory operations in groups of to ur transfers, requiring two cycles.

66-MHZ SYSCLK I I I CMD ��: ----�-----T-----T-----T-----T-----r-----r----,r----,_---�

ADDR

DRAMADDR

/RAS

/CAS

/WE

D

Figure 5 Basic Memory Read Timing

60 Dip;iralTec hnical )oumal Vol . 9 No. 2 1997 66-MHZ SYSCLK I I I CMD ��:-----T----�----r----T-----r----T-----, I ��------,------,------,-----,-----�------,------, ADDR ��: ____ _L____ -L----�----L---��--�-----L

IRAS

/CAS

/WE

D

Figure 6 Basic Memory Write Timing

Ta ble 1 Data Bus Tra nsfer Ty pes

To 21 164 CPU Cache 21174 ASIC DRAM

From

21 164 CPU Private 1/0 write L2 victim Cache Private L3 victim DMA read DMA write' 21 174 ASIC 1/0 read Fill ' DMA read' L3 victim

Fi l l ' DMA write'·' DMA write DRAM Fill Fi ll DMA read DMAwrite'

Notes 1. Data from victim buffer. 2. Merge data.

The memory controller is designed to permit par­ tions is needed to avoid drive conflicts on the bus tially overlapped memory operations. The 21164 resulting from clock skews and delay variations CPU can emit a second memory-fillrequest before the between the two data sources. fill data from the first request has been returned. Figure 7 shows a timing sequence fo r tvvo read com­ Memory Controller Components mands to the same group of memory chips. Note that the second read command to the SDRAMsis issued Figure 8 shows a block diagram of the 21174 chip. while the data from the first read is on the data bus. Table 2 gives an overview of key ASIC metrics. A dis­ The two read commands (four data cycles each) are cussion of the entire chip is beyond the scope of this completed with no dead cycles betv.reen the read data paper, but note that there are two major control for the first read and the read data fo r the second read . blocks: the PCI controller manages the PCI interface, For most other cases, the data cycles of adjacent trans­ and the memory controller manages the memory actions cannot be chained together without interven­ subsystem and the CPU interface. The memory con­ ing bus cycles, because of various resource conflicts. troller has six major sections, as shown in Figure 9. For example, when the bus is driven fr om different The detailed functions of these components are SDRAMchips, at least one dead cycle between u·ansac- described next.

Digital Te chnical journal Vo l. 9 No. 2 1997 61 66-MHZ SYSCLK

CMD

ADDR

DRAMADDR

/RAS

/CAS /WE �: D - :� -=�====�======�====�====�====*·----�·--�

Figure 7 Pipelined Read Timing

ADDR D COMMAND SYSCLK t PLL AND CLOCK GENERATO R DIMM CLOCKS

[------DRAM_ADDR MEMORY CONTROLLER

FLASH CE --.---t------­

FLASH WE --.---+------'

PCI CONTROLLER 1•------1 l

PCI CTL PCI AD

Figure 8 Block Diagram of the LogicChip

Memory Sequencer ory addresses and memory commands. Table 3 lists The memory sequencer provides overall timing con­ the transaction types supported by the main sequencer trol f(x all portions of the 21174 chip except the PCI and the number of states associated with each type. interface_ To facilitate partial overlapping of memory The data cycle sequencer is dispatched by the main transactions, the sequencer is implemented as two sep­ sequencer and executes the four data cycles associated arate state machines, the main sequencer and the data with each memory transfer. Altl1ough the data cycle cycle sequencer. sequencer is simple in concept, it must cope with a wide The main sequencer is responsible for overall trans­ variety of special cases and special cycles. For example, action tlow. It starts all transactions and issues all mem- in the case of reads and writes to flash read-only mcm-

62 DigitJI Tc chnicJI journ.1l Vo l. 9 No. 2 1997 Ta ble 2 Ta ble 3 21 174 ASIC Summary Main Sequencer States

Function Core Logic Transaction Ty pe Number of States

Data bus clock 66 MHz Idle 1 PCI clock 33 MHz Wa it for idle 2 Memory transfer size 64 Bytes Cache fill 4 Memory latency 105 ns (7-1-1-1) Cache fill from flash ROM 4 Hot row latency 75 ns (5-1 -1-1) Cache fill from dummy memory 2 Data bus width 128 bits + ECC L3 cache victim write back 3 SDRAM support 16 Mbit and 64 Mbit 3.3V L2 cache victim write back 2 SDRAM CAS latency 2 or 3 DMA read 9 SDRAM banks/ch ip 2 or 4 DMA write 14 PCJ bus width 64 bits SDRAM refresh 2 Gate count 250,000 gates SDRAM mode set 2 Pin count 474 total (379 signal pins) PCI read 2 Package Ceramic BGA CSR read 3 Process .35-f.l.m CMOS Flash ROM byte read PCJ or CSR write 1 Flash ROM byte write 1 ory (ROM), stall cycles are needed, because of the slow Errors 2 access time and narrow data bus of the flash ROM. Total states 55 Table 4 gives the cycle types supported by the data cycle sequencer. Some comrol signals fi·om the memory sequencer data is selected on the internal data path multiplexer are driven fi-om the main sequencer, and some are dri­ one cycle before it appears at the data pins, so the ven tram the data cycle sequencer. In addition, a num­ select signal fo r the internal multiplexer is driven by ber of control signals may be driven by either state the main sequencer fo r the firstdat a cycle of the DMA machine. This is necessary because of pipeJining con­ write and is then driven by the data cycle sequencer tor siderations, as transactions are handed off from the three additional data cycles. The shared control signals main sequencer to the data cycle sequencer. For exam­ are implemented separately in each state machine, and ple, during a Di\1A write transaction, the Di\1A write the outputs are simply ORed together.

COMMAND A

VICTIM - ADDRESS HIT BUFFER ,--

,--- REQUEST r- - SDRAM_ADDR ADDRESS MULTIPLEXER OMA ADDRESS /WE r-- A r-- r-- /RAS /CAS SDRAM r-- ADDRESS r-- /CS r- DECODER L MEMORY f-- SEQUENCER SELECT '--- '---- ARBITER OMA REQUEST If CAS ASSERTED 1

REFRESH TI MER Lf

Figure 9 G.l.ock Diagram of rileMemory Controller

Digiral Te chnical Journal Vol. 9 No. 2 1997 63 Ta ble 4 address logic, the address bits are assigned to row and Data Cycle Sequencer States column addresses in a way that allows support of many

Data Cycle Ty pe different SDRAM chip types, always using the same row address selection and using only rwo possible Idle selections of column addresses. T:t ble 5 gives the Cache fill address bit assignments. Cache fill from victim buffer To fu rther accelerate the address decoding, the Cache fill from flash ROM input to the address multiplexer is sent directly f'rom Cache fill from dummy memory the address receiver aheJd of the receive register, and a L2 cache victim write back significant portion of the address decoding and multi­ L3 cache victim write back to 21174 ASIC plexing is completed bdore the first clock. L3 cache victim write back from 21174 ASIC to memory SDRAM Address Decoder DMA read from memory To achieve minimum latency, the address path of the DMA read from L2 cache memory controller is designed to send a row or col­ DMA read from L3 cache umn address to the SDRAMs as quickly as possible DMA read from victim buffer after the address is received from the CPU. This goal DMA write to memory conflicts with the need to support a wide variety of Read merge data from memory for DMA write DIMM sizes in a random mix, and the overall address Read merge data from L2 cache for DMA write path is extremely complex in fa n-out and gate count. Read merge data from L3 cache for DMA write The opening and closing of rows (row access strobe Read merge data from victim buffer for DMA write [RAS]) ;md tile launch of transactions (column access CSR read strobe [CAS]) is controlled separately fo r each DIMM PCI read pair by one of eight DIMM pair controllers. Each DIMM Flash ROM byte read pair controller, in turn, contains eight bank controllers to control the fo ur possible banks within PCI or CSR write each ofthe two groups ofmemory chips on the DTMM Flash ROM byte write pair, to r a total of 64 bank controllers. Figure ll shows the details. Address Multiplexer The address path and the launch of memory The address multiplexer accepts an address from the transactions are control led by fo ur signals fr om the 21164 address bus, fr om a DMA transaction, or fr om main sequencer: SELECT, SELECT_FOR _READ, the victim buffe r and then selects a subset of the SELECT_fOR_WRJTE, and SELECT_FOR_R.JVIW. address bits to be driven to the multiplexed address The SELECT signal directs the DIMM pair con­ pins of the SDRAMs. Figure 10 shows a block diagr

SELECT_16M BIT_DI MM_PAIR

SELECT_64MBIT _DIMM _PAIR

DEFAULT_SDRAM_ADDR

COLUMN ADDRESS FOR 16-MBIT DRAMS COLUMN MUX CPU ADDRESS COLUMN ADDRESS ADDRESS FOR 64-MBIT DRAMS A[27:4] DMA ADDRESS SDRAM_ADDR[13 0[ VICTIM ADDRESS [25 12] ROW ADDRESS

'------A[33 12[ (TO DRAM TRANSACTION CONTROLLER)

Figure 10 Address Multiplexer

64 Digit:1lTe chnical Journal Vo l. 9 No. 2 1997 Ta ble 5 Row and Column Address Mapping

SDRAM Banks Number of Bits, Bank Select Row Bits Column Bits' Chip Ty pe Row/Column Bits

16M x4 4 14/1 0 25,24 23:12 27:26,11:4 8M x8 4 14/9 25,24 23:12 26,11:4 4M X 16 4 14/8 25,24 23: 1 2 11:4 16M x4 2 14/1 0 24 25,23:12 27:26,11:4 8M x8 2 14/9 24 25,23:12 26,11:4 4M X 16 2 14/8 24 25,23:12 11:4 4M x4 2 12/1 0 23 23:12 25:24,11:4 2M x8 2 12/9 23 23:12 24,11:4 1Mx 16 2 12/8 23 23:12 11:4

Note 1. All 1 0 or 12 column bits for each SDRAM size are sent to the SO RAMs. For x8 or x16 SDRAM chip configurations, one or two high-order column bits are ignored by the SDRAMs.

SELECT FOR READ SELECT FOR WR1TE SELECT FORsa REMWCT ======�ll QOCK _ ==·------======;l �------� GROUP 0. 0 1 SANK ArSS ERT RAS ROVJ ADDRESS COMPARE f------, ASSERT CAS f------, I ACTIVE ROWREG1STER I IASSERT WE f--- 1----lGROUP 0. BANK 1----lG ROUP . BANK 21 0 CS OPO RANKO GROU P 01. BANK 1 1----lGROUP I. 1----l::�:: :1::�1 ::1BANK:: :2� � �JJ[�I §I II�II �I--)���::����� 1----lG ROU P 1 BANK 3 A -- DIMM PA1R o BASE ADDRESS AND soZE .... � 1 �;:tr-r-:J4H:[ >i-�:o:sc��DPO�R�:':'_NK�1 1 +��---ji.D:::,M:.�, :r.:,R::o:----;D;;;I�:;; E lll,e ADDRESS GROUP 1 S�E;;;;PLAE:;;;IR��;,� TD COMPARE GROUP 0 SE'LECTED YL _':: ,Hi1 OIMM PA1R O y �::::::::: ::::: ::: ::::::::::::::::::::::::::::::: CHIP DIMM PA R 1 DIM PAIR , BASE--ADDRESS A o SIZE :,� � �F==m ���'h�e R I� LJ ICS -_-_ -_ -_ -_ -_ -_ -_- _-_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_-_ -_ -_ -_ -_ -_ -_ -_ -_ ��� � -��-� I - ______� ���m OIMM PAIR 2 B ASE_ ODRESS A _ _ _ A 0 SIZE _ ���� !��� ______-�� ����f= :��w=�Ttt1s__ ::: �=::: =F�::- -::------::::::::------:- �:::::::: : :------� -_- - DIM .1 PAIR 5 ASE ADDRESS SIZE ______=- ______- B AND _ _ -�------�------i-�0�- -�-��------�-� ���w'lTTffin DIM' PAIR ° - DIM� PAIR 6 BASE ADDRESS A D SIZE -�C �mtflmml _-_ -_ -_ -_ -_ -_-_ -_-_ _-_ _-_ _-_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_ -_-_ -_ -_ -_ -_ -�� -_ - -_ - - - -_ _ Li'- ���� -�:���- --1=='�:m m�:m1 m�mJ ttj OIMM PA IR 7 BASE AODRESS o\NDSIZE -c______------41 ll l '-r-r---' -H AAS T y =L>- '-----t-+=l_)-� �� / Co\S L------rr� ��ME REFAESH - =l)- PAECHARGE_FOR_REFRESH ____j

Figure 11 SDRAJ.'vl Address Decoder

Digital Technic11 journal Vo l. 9 No. 2 1997 65 must be closed by a precharge command before the Hot Rows correct row can be opened. The main sequencer asserts The SDRAMs have an interesting fe amre that makes it SELECT continuously until a data transfer is started. possible to improve the average latency of the main The DIMM pair controllers do not need to report memory. They permit two (or fo ur) hot rows in the the completion of the SELECT operation. They simply two (or fo ur) banks in each memory chip to remain open the selected row as quickly as possible. However, active simultaneously. Because a workstation configu­ if none of the DIMlvl pair contwllers recognizes the ration has as many as three DIMM pairs, with either address, a nonexistent memory status is signaled to the one or two separately accessed groups of chips per mam sequencer. DIMM pair, as many as 24 rows can remain open The SELECT signal can be issued speculatively. There within the memory system. (In the server configura­ is no commitment to complete a data transfer to the tion, the memory controller can support as many as selected group of SDRAl\lls. For example, on a DMA 64 hot rows on eight DIMM pairs.) read, SELECT is issued at the same time that a cache The collection of hot rows can be regarded as a probe is launched to the CPU. If the CPU subsequently cache. This cache has unconventional characteristics, provides theread data, no memory read is needed, and since the size and number of the hot rows vary with the main sequencer simply deasserts SELECT. the type of SDRAM chip and with the number and The main sequencer asserts a SELECT_FOR_READ , type of memory DIMMs installed in the machine. In SELECT_FOR_WRITE, or SELECT_FOR_RMW the maximum configuration, the cache consists of 24 signal when it is ready to start a data transfer. cache lines (i.e., hot rows) with a line size of 16 ki lo­ SELECT_F OR_RMW starts a normal read sequence, bytes ( KB) , fo r a total cache size of 384 KB.Al though with the exception that the row hit predictor is the cache is not large, its bandwidth is 16 KB every ignored and the row is uncondition:ll ly leftope n at the 105 nanoseconds or 152 gigabytes per second. (The end of the read, in anticipation that the next operation bus between this cache and main memory is 131,072

wi ll be a write to the same address. (The subsequent bits wide , not counting the 16,384 ECC bits') write sequence wo uld work properly even if the row Table 7 gives the state bits associated "'-i th each bank were not left open, so this is strictly a performance within each group of chips on each DIMM pair in optimization.) In response to a SELECT_FOR_x sig­ the memory system. To keep track of the hot rows, a nal, the selected DIMM pair controller starts the speci­ row address register is provided to r each bank. There fiedtransaction type as soon as possible. If necessary, are 64 copies of the state given in Table 5, although the transaction start is delayed until the selected row only 24 of them arc usable in the workstation configu­ has been opened. \Vhen the DIMM pair controller ration. Prior to each memory access, the row address starts the data transaction (i.e., when it asserts CAS), portion of the current memory address is compared it reports CAS_ASSERTED to the main sequencer. agai nst the selected row address register ro determine

The memory controller provides almost unlimited whether the open row can be used or whether a new fl exibility to r the use of difkrent DIMM types among row must be opened. (Ofcourse, if the ROW_ACTIVE the three DIMM pairs (eight DIMM pairs in the server bit is not set, a new row must always be opened.) configuration). Table 6 gives the fields in the DIMM As with any cache, the hit rate is an important fa ctor pair control registers. The DIMM pair size, as well as in determining the effectiveness of the hot row cache. some SDRAM parameters, can be set individually per If a memory reference hits on an open row, latency is DIMM pair through these register fields. red uced due to the fa ster access time of the open row.

However, if the memory reference misses on an open

Ta ble 6 row, the row must first be closed (i.e., a DRAM DIMM Pair Control Register Fields precharge cycle must be done) before another row can be accessed. In the past, some memory controllers Field Use

ENABLE Indicates that this bank is insta lled and usable Ta ble 7 BASE_ADDRESS[33:24] Defines starting address State Bits for Each of 64 SDRAM Banks

SIZE[3:0] Defines total size of DIMM State Use pair, 16 Mbytes -512 Mbytes TWO_G ROUPS Ind icates that DIMM pair ACTIVE_ROW[25:12] The row currently in use, has two groups of chips if any 64 MB IT Selects 64-Mbit column ROW_AGIVE Indicates whether row mappings for this DIMM is open pair ROW_HIT_HIS TORY[3:0] Hit/miss state from last 4BANK Indicates that the SDRAM four accesses chips each contain four TIMEOUT_CTR[3:0] Tracks busy time after banks command is issued to bank

66 Digital Tec hnical journal Vo l. 9 No. 2 1997 were designed to keep the most recently used row VORTEX open in one or more memory banks. This scheme pro­ PERL vides some performance benefiton some programs. IJPEG On most programs, however, the scheme hurts perfor­ Ll mance, because the hit rate is so low that the access COMPRESS time reduction on hits is overshadowed by the time GCC needed to close the open rows on misses. M88KSIM The memory controller uses predictors to guess GO whether the next access to a row will be a hit or a miss. CINT95 BASE If a hit is predicted on the next access, the row is left open at tbe end of the current access. If a miss is pre­ WAVES dicted, the row is closed (i.e., the memory bank is FPPP precharged), thus saving tl1e subsequent time penalty A PSI TURB3D for closing the row if the next access is, in fa ct, a miss. APPLU A separate predictor is used fo r each potential open MGRID row. There are 64 predictors in all, although only 24 HYDR02D can be used in the workstation configuration. Each SU2COR predictor records the hit/miss result for the previous SWIM four accesses to the associated bank. If an access to the TOM CATV bank goes to the same row as the previous access to the CFP95 BASE same bank, it is recorded as a hit (whether or not the 0.99 1.04 1.09 row was kept open); otherwise, it is recorded as a miss. SPEC95 RELATIVE PERFORMANCE The predictor then uses the fo ur-bit history to predict (ADAPTIVE HOT ROWS/NO OPEN ROWS) whether the next access will be a bit or a miss. If the previous fo ur accesses are all llits, it predicts a hit. Ifthe Figure 12 previous fo ur accesses are all misses, it predicts a miss. Performance Benefit of Adaptive Hot Rows Since the optimum policy is not obvious tor the other 14 cases, a software-controlled 16-bit precharge policy register is provided to define the policy fo r each of the enforces a substantial mimmum memory latency fo r 16 possible cases. Softwarecan set this register to spec­ memory reads to allow fo r the worst-case completion if)' the desired policy or can disable the hot row time fo r an unrelated 21164 cache access. In most c:1 ses, scheme altogether by setting the register to zeros. this minimum latency erases the latencyimpr ovement

The precharge policy register is set to 1110 1000 obtained by hitting on an open row. 1000 0000 upon initialization, which keeps the row open whenever three of the preceding four accesses Victim Buffer are row hits. To date, we have tested only with this set­ The 21174 chip has a two-entry victim bufferto cap­ ting and witl1 the all-zeros and all-ones settings, and ture victims !Tom the 21164 CPU and hold them fo r we have not yet determined whether the default set­ subsequent write back to memory. A two-entrybu ffe r ting is optimal. The adaptive hot row feature provides is provided so mat write back can be deferred in fa vor a 23-percent improvement in measured (i.e., best­ of processing fillrequests. Normally, fill requests have case) memory latency and a 7 -percent improvement in priority overvicti m write backs. When the second vic­ measured bandwidth against me STREA.!'vl. bench­ tim butler entry is allocated, the victim write-back pri­ mark, as reflected in Figure 1. Testing of the adaptive ority is elevated to ensure mat at least one of the victims hot row fe ature shows an average performance is immediately retired, tl1us avoiding a potential buffe r improvement of2.0 percent on SPEC95tpbase and an overrun. The victim bufters have address comparators average of0.9 percent on SPEC95int base. to check me victim address against all fills and DMA The level of improvement varies considerably over reads and writes. On a victim buffe r address hit, read the individual benchmarks in the SPEC suite, as shown data or merge data is supplied directly from the victim in Figure 12, with improvements ranging from -0.6 buffe r: the data is passed over the external data bus to percent to +5.5 percent. avoid me use of an additional data path within t11C chip. For completeness, we also tested with the all-ones case, i.e., with the current row left open at the end Arbiter of each access. This scheme resulted in a 5-percent The memory sequencer must process requests !Tomsev­ performance loss on SPEC95tpbase and a 2-percent eral diffe rent sources. The arbiter selects which request pertormance loss on SPEC95int base. to process next. The arbiter must serve two conflicting The row hit predictor is not as usefi.d for configura­ goals: ( 1) priority service for latency-sensitive requests, tions with an L3 cache, because the 21164 interface and (2) avoidance oflockouts and deadlocks. Table 8

Digit�! Technical Journal Vol. 9 No. 2 1997 67 Ta ble 8 Arbitration Prioriti es

Arbitration If Request Register is Latched If Request Register is NOT Latched Priority

Highest Set SO RAM Mode CPU request Refresh (if overdue) Set SO RAM mode Victim write back (if both buffers are full) Refresh (if overdue) DMA read Victim write back (if both buffers are full) DMA address translation buffer fill DMA read DMA write (if both buffers are full) DMA address translation buffer fill PCI read completion DMA write (if both buffers are full) CPU request PCI read completion Victim write back Victim write back DMA write DMA write Lowest Refresh Refresh

gives the arbitration priorities. Normally, CPU cache by the 21164. In general, the memory controller starts misses are treated as the highest priority requests. the row access portion of the memory access at the However, the arbiter avoids excessive latency on other same time that it requests the cache lookup fr om the requests by processing requests in batches. When a CPU, in anticipation that the lookup will miss in the request is first accepted, the existing set of requests ( usu­ cache. For DMA reads, the 21164 microprocessor ally just one) is captured in a latch. All requests within may return read data, which is then used to saris()' the that group are serviced before any new request is con­ read request. For DMAwri tes, several flows arc possi­ sidered. During the servicing of the latched request(s), ble, as shown in Figure 13. several new service requests may arrive. vVhen the last If the DlvlA write covers an entire cache line, the latched request is serviced, the latch is opened and a memory controller requests that the CPU invalidate new batch of requests is captured . For a heavy request the associated entry, and the DMA data is then written load, tl1is process approximates a round-robin priority directly to memory. If the write covers a partial cache scheme but is much simpler to implement. line, a cache lookup is pertormed; on a hit, the cache Normally, Dlv!Awrite requests, L3 cache victims, and data is merged into the write bufter before the data is refresh requests are low-priority requests. However, a wtitten to memory. If the cache misses on a partial line second request of any of th ese types can become pend­ write, two outcomes are possible: If each 128-bit ing before the first one has been serviced. In this case, memory cycle is either completely written or com­ the priority ror tl1e request is increased to avoid unnec­ pletely unchanged, the data is written to memory essary blockage or overruns. directly, and the write to the unchanged portions is suppressed using the data mask ( DQM) signal on the Refresh Controller memory DIMMs. Ifthere are partially written 128-bit A simple, h·ee-running refresh timer provides a refresh portions, the fu ll cache line is read from memory and request at a programmable interval, typically every 15 the appropriate portions are merged with the DMA microseconds. The refresh request is treated as a low­ write data. priority request unti l half the refresh interval has expired. At that time, the refresh controller asserts a Ca che Flushing high-priority refresh request that is serviced before all The memory contro.l ler provides two novel fe atures other memory requests to ensure that refreshes are not fo r assistance in flushing the write-back cache. These blocked indefinitely by other traffic.When the arbiter fe atures are intended to be used in power-managed grants the refresh request, the main sequencer per­ configuratjons where it may be desirable to shut down fo rms a refresh operation on all banks simultaneously. the processor and cache fr equently, without losing memory contents and cache coherence. If the cache Cache Coherence contains dirty lines at the time of shutdown, these lines The Alpha I/0 architecture requires that all DMA must be f(m:ed back to memory before the cache can operations be fu lly cache coherent. Consequently, be disabled. The 21164 microprocessor provides no every DMA access requires a cache lookup in the direct assistance fo r fl ushing its internal cache or the 21164 microprocessor's level 2 ( L2 ) write-back cache external L3 cache, so it is generally necessary to read a and/or in the external L3 cache, which is controlled large block of memory to flush the cache. The mem-

68 Digital Te chnical journal Vo l. 9 No. 2 1997 but has frequent, short bursts of activity to service key­ board input, clock interrupts, modem receive data, or similar workloads.

Conclusion COPY PCI TO 21174 BUFFER The 21174 chip provides a cost-effe ctive core logic

PARTIAL subsystem fo r a high-performance workstation. The CACHE use ofSDRAMs, high-pin-count BGA packaging, and LINE FULL CACHE LINE hot rows make possible a new level of memory pertor­ mance, without the need tor wide memory buses or a t SENDt SEND FLUSH TO CPU INVALIDATE complex memory data path. TO CPU DIRTY NOT DIRTY Acknowledgments PA RTIAL OCTAWORDS • I would like to express my appreoatton to Jim

MERGE CACHE MERGE FROM Delmonico, Jeff Forsythe, Frank Fuentes, Paul Hill, DATA INTO 21174 MEMORY TO 21174 Kenny House, Mark Kelley, Yo ng Oh, Rob Orlov, WRITE BUFFER WRITE BUFFER Rajen Ra mchandani, Don Rice,Dean Sovie, and Ken NO PA RTIALS Weiss, who gave a year of their lives to implement and simulate the 21174 chip and who made it work on the t first pass! I would also like to thank Lance Berc, who t COPY 21174 • provided most of tl1e latency data, and Greg Devereau, BUFFER TO Arlens Barosy, and John Henning, who provided other MEMORY performance measurements.

References

DONE l. S. Peri and R. Sites, "Studies of Windows NT Perfor­ mance using Dynamic Execution Traces," Proceedings

of the Second USENIX Symposium em Ope ratinp, Sy s­ Figure 13 tems Desip,n and Implementation ( 1996 ) . DMA Write Flows 2. J. Edmonson et a!., "Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS !USC ory controller provides a large block of ROM (which Microprocessor," Digital Te chnical fourna!, vol. 7 no. 1 always reads as zeros) fo r this purpose. Of course, no (1995): 119-1 35. memory array is needed to support this fe ature, so the 3. Digital Se miconductor 21164 (366-.V!Hz Th rough implementation complexity is negligible. 433 1V!Hz) Alpha ;V/icroprocessor (Maynard, Mass.: In addition to the E1ke memory, the memory con­ Digital Equipment Corporation,Order No. EC-QP99A-TE, troller provides a cache valid map that can make cache 1996). flushing fa ster when the cache is lightly used. The 4. J. McCalpin, "STREAM : Measuring Sustainable Mem­ cache valid map is a 32-bit register, in which each bit ory Bandwidth in High Performance Computers," indicates whether a particular section of the cache has http:/j www. cs.virginia.edu/stream/ (Charlottesville, been (partially) ti lled. When a cache fill request is Virg.: University ofVirginia, 1997). processed, the memory controller sets the correspond­ 5. DIGITAL ing bit in the cache map register. The bits of the regis­ L. Berc, "Memory Latency," internal Web pages, September 1996. ter can be individually reset under program control. To use this fe ature, the cache must first becompl etely 6. J. Zurawski, J. Murray, and P. Lemmon, "The Design flushed and the register reset to zeros. Thereafter, the and Verification of the AlphaStation 600 5-serics Wo rk­ cache map register reflects which sections of the cache station," Digital Technical Jo urnal. vol . 7 no. l (1995) 89-99. can potentially contain dirty data. During a subse­ quent cache flush, the flush algorithm can skip over 7. D. Conroy, "Cacheless Ev5," DIGITAL internal document. those sections of the cache whose corresponding bits 8. KJYI44S4020B/Kj\IJS102JH CM OS S!JRAM Data Sheet in the cache map register have not been set. This fe a­ (Seoul, Ko rea: Samsung Electronics Company, Ltd., ture is useful primarily fo r cache shutdowns that occur 1996). bet\veenkeystrokes, i.e., when the system is nearly idle

Digital Tec hnic1l journal Vol. 9 No. 2 1997 69 Biography

Re inhard C. Schumaru1 Until recently, Reinhard Schumann was the architect fo r value-line Alpha workstations. In that capacity, he pro,�ded technical leadership fo r the design of the 21174-based DIGITAL Personal Workstation a-series and au-series and led the design effort fo r the 21174 ASIC. During his 18-year tenure at DIGITAL, he contributed to the design of many other products and was awarded fo ur patents. He holds B.S. and M.S. degrees in computer science fr om Cornell University and has published two papers on inte­ grated circuit design. He is now employed at Avici Systems.

70 Digital Te chnical Journal Vo l. 9 No. 2 1997 I

Further Readings

The Digital Technicaljo urnal is a refereed , quarterly Database Integration/ Alpha Servers & Wo rkstations/ publication of papers d1at explore d1e fo undations Alpha 21164 CPU of DIGITAL 's products and technologies. jo urnal Vo l. 7, No. 1, 1995, EY-Tl35E-TJ (Available only on the Internet) content is selected by the Journal Advisory Board, and papers are written by DIGITAL engineers and RAID Array Controllers/Workflow ModelsjPC LAN engineering partners. Engineers who would like to and System Management Tools contribute a paper to the Jo urnal should contact the Vo l. 6, No. 4, FaLl 1994, EY-Tl 1 8E-TJ managing editor, Jane Blake, at [email protected]. AlphaServer Multiprocessing Systems/DEC OSF /1 Symmetric Multiprocessing/Scientific Computing To pics covered in previous issues of me Digital Optim.ization for Alpha Te ch nical.fournal are as fo llows: Vo l. 6, No. 3, Summer 1994, EY-S799E-TJ

FX!32 Emulation and Translation/Visual Fortran/ AlphaAXP Partners-Gray, Raytheon, Kubota/ MEMORY CHANNEL 2 Interconnect/ObjectBroker DECchip 21071/21072 PCI Chip SetsjDLT2000 Security/StrongARM Microprocessor Tape Drive Vo l. 9, No. I, 1997, EC- N7963- 18 Vol . 6, No. 2, Spring 1994, EY-F947E-TJ

AlphaServer 4100 System/Oracle and Sybase High-performance Networking/Open VMS AXP Database Products fo r VLM/Instruction Execution System Software/Alpha AXP PC Hardware on Alpha Processors Vo l. 6, No. 1, Winter 1994, EY-Q01 l E-TJ Vo l. 8, No. 4, 1997, EC-N7629-l8 Software Process and Quality Internet Protocol V.6/Preservation of Historical Vo l. 5, No. 4, Fall 1993, EY-P920E-DP Computer Systems/Fortran fo r Parallel Computing/ Server Performance Evaluation and Optimization/ Product Internationalization Internet Collaboration Software Vol . 5, No. 3, Summer 1993, EY- P9 86E- DP

Vo l. 8, No. 3, 1996, EC-N7285-18 Multimedia/ Application Control

Spiralog Log-structured File System/Open VMS Vo l. 5, No. 2, Spring 1993, EY-P963E-DP

fo r 64-bit Addressable Virtual Memory/High­ DECnet Open Networking performance Message Passing fo r Clusters/Speech Vo l. 5, No. I, Winter 1993, EY-M770E-DP Recogn ition Software Vo l. 8, No. 2, 1996, EY-N6992-l8 Alpha AXP Architecture and Systems Vo l. 4, No. 4, Special Issue 1992, EY-J886E-DP DIGITAL UNIXClus ters/Object Modification Tools/eXcursion fo r Windows Operating Systems/ NVAX-microprocessor VAX Systems Network Directory Services Vo l. 4, No. 3, Summer 1992, EY-J884E- DP Vo l. 8, No. l, 1996, EY-U025E-TJ Sem.iconductor Technologies Audio and Video Technologies/UNIX Available Vo l. 4, No. 2, Spring 1992, EY-L521 E-DP Servers/Real-time Debugging Tools PATHWORKS: PC Integration Softwa1·e Vo l. 7, No. 4, 1995, EY- U002E-TJ Vo l. 4, No. I, Winter 1992, EY-J825E-DP High Performance Fortran in Parallel Environments/ Image Processing, Video Term.inals, and Printer Sequoia 2000 Research Technologies Vo l. 7, No. 3, 1995, EY-T838E-TJ Vo l. 3, No. 4, Fall l991, EY-H889E-DP (Available only on the Internet) Ava ilability in VA Xcluster Systems/Network Graphical Software Development/Systems Performance and Adapters Engineering Vo L 3, No. 3, Summer 1991, EY- H890E-D P Vo l. 7, No. 2, 1995, EY-UOOl E-TJ

Digital Tc chnie<1ljournal Vol . 9 No. 2 1997 71 Fiber Disn·ibuted Data Interface R. Dixson, N. Su llivan, J. Schneir, T McWaid, V. Tsc1i, Vol. 3, No. 2, Spring 1991, EY-H876E-DP J. Prochazka, and M. Yo ung, "Measurement ofa CD and Sidewall Angle Artifact with Two Dimensional CD A.FM Transaction Processing, Databases, and Fault-tolerant Metrology," Proceedings of the !ntemotiunal Saciet_v Systems of Ph oto-Op tiwl Instrumentation Engineers (SPIEJ Vo l. 3, No. l,Winter 1991, EY-FSSSE-DP iv/e tmlogy. Inspectio n. and Process Cuntrol)or VAX 9000 Series J\lficrolithography X (March 1996 ). Vol. 2, No. 4, Fall. l990, EY-E762E-DP A Elsherben i, 0. Rarnahi,and C. Smirh, "FDTD Analysis DECwindows Program of New Types of Micros trip Tra nsmission Lines for Hip;h Vo l. 2, No. 3, Summer 1990, EY-E756E-Dl' Frequency Applications," Prup, ress in Electruma::;uetics Res('etrch Sy mposium (Pltl

C. Gianos and D. Hobson, "Design Considerations fo r N. Anderson and D. Manley, "A Matrix Extension of Digital's PowerStorm Graphics Processor," IBMJournal Winograd's Inner Product Algorithm," Th eoretical of Research and !Jcllelopment (July 1996 ). Computer Science 131 (1994). B. Gonzalez, S. Knecht, H. Handy, and J. lb mirez, "The A. Barbieri, "Benefits Applications and Data Analysis Effect of Ultrasonic Frequency on Fine Pitch Aluminum Techniques fo r Linewidth Multilevel Experimental Wedge Wirebond," Proceedin[!,s of the 46th Flectronic Design," Proceedings of the lnternutiona! Sociel )' of Compo111.:1l!s and Te chnology Conference (May 1996 ). Ph oto-Optical lnstmmentation Engineers (SPit). iv!etro lor,;_)', Inspection, and Process Co ntroljiJY R. Hellweg, Jr., H. Pei, and L Wittig, "Precision ofa New Microlithography X (March 1996). Method to Measure Fan Structure-Bourne Vibration," Proceedings of !uternoise 96 (July 30-August 2, 1996 ). G. Bouclurd and P. Bannon, "Design Objective ofthe 0.35-micron Alpha 21 164 Microprocessor," 1996 HOT K Hornbach, "Competing by Business Design-The Chips V/1/Sym posium (August 1996 ). Re shaping of the Computer Industry," Long Ra nge Planning, vo l. 29, no. 5 (1996). C. Brench, "Experiences in EMI Model ing to r Product Design," JEtE /nternotirmal Sy mposium on ci vi C S. Knecht, B. Gonzalez, and K Sieber, "fusing Current (August 1996). of Short Aluminum Bond Wire," Fifth !ntc!Socicty Couference on Ihermal Ph er wmenu in !:lec trunic S)'.l!ems A. Charny, "Time Scale Analysis and Scalability Issues Fo r 0-Tf-IFRM V) (May 29-June 1, 1996 ). Explicit Rate Allocation in ATM Networks," JEJ;F/AOol/ Transactions on Netu·r;r/d ng (August 1996 ). J. Kowaleski, G. Wo lrich, T Fischer, R. Dupca.k, P Kro.:sen, T Plum, and A. Oelsin, "A Du:d-Execution Pipelined K. Coar, "Converting an AdvFS Filesystem to UFS," Floating-Point CMOS Processor," 1996 /h'f:'E Jnternutiunal DWtal Svs tems Repott (January/February 1996). Solid-Stote Circuits Co nference. D(gest of Tr?Chn ica/ Papers R. Collica, J. Ra mirez, and W. Taam, "Process Monitoring (February 1996). in Integrated Circuit Fabrication Using Both Yield and Spatial Statistics," Quali()' and !?e/iahility Engineering K. Mistry, "New Hot Carrier Failure Criterion to r j. r-Channel ln!emational, vol. 12 ( 1996 ) . Transistors Based on Transistor Leakage Currents," Solid­ Stole Flee/ron ics, vol . 39, no 11 ( 1996). S. DiPirro, "64-Bit Addressing in Open VMS 7.0," [)!GJTAL Age (December 1995). K Orvek, S. Dass, L. Gruber, S. Dumford , M. Piasecki, G. Poi iJ.rd , and I. 1-'ink,"Mi grating Deep- UV Lithography S. DiPirro, "Managing 64-Bit OpenVMS Systems," to the 0.25-f.Lm Regime: Issues and Outlook," Procecdiugs D1gital 5)1slems}uu rna/ (September/October 1995 ) . of the !nternationat Society of Photo-Optical !nstn.lln('n/a­ r ra S. DiPirro, "The Open VMS Alp ha Syste m-Code fion �:·n,r; ineers (SPill)- Op tical Mic olithog phy IX Debugger," Dig ital Svsrems Report (January/February (Mat·ch 1996 ). 1996). M. Shaw, R. DeLine, D. Klein, T Ross, D. Yo ung, and

R. Dixson, J. Schneir, T. McWaid, N. Sullivan, V. Tsai, G. Zelesnik, "Abstractions fo r Software Architecture cm d S. Zaidi, and S. Brueck, "Toward Accurate Linewidth Tools to Support Them,'' /Et't' Transactions on Sojizmre Metrology L:sing Atomic Force Microscopy <1 nd Tip Eu,r;iuecn ng (April 1995 ). ( :baracterizarion," Proceeding,· of the lnternotional R. Wo ods, "How Emulators Do Their vVork," UNIX Society of Ph oto-Optical lnstmmentation Engineers RC'iliC'IIJ( October 199 5). (SP/E): .\1etrology, Juspection, and Pmcess Control fo r t'vficrolithogmphy X (March 1996 ).

72 Di�itcl l Tc dmical Joun1<1l Vo l. 9 No. 2 1997