<<

Infrastructure and Tools for the TRIPS Prototype

Bill Yoder Jim Burrill Robert McDonald Kevin Bush KatherineCoons MarkGebhart SibiGovindan BertrandMaher Ramadas Nagarajan Behnam Robatmili Karthikeyan Sankaralingam SadiaSharif AaronSmith DougBurger StephenW.Keckler KathrynS.McKinley

Computer Architecture and Technology Laboratory Department of Sciences The University of Texas at Austin [email protected] - www.cs.utexas.edu/users/cart/trips

ABSTRACT between the motherboard and chips. A key theme of software development has been to adapt The TRIPS hardware prototype is the first instantiation of an Explicit off-the-shelf solutions when possible and to create homegrown Graph (EDGE) architecture. Building the , solutions when necessary. toolset, and system software for the prototype required supporting the The remainder of this paper is organized as follows. Sec- system’s unique dataflow construction, its banked register and memory tion 2 discusses the TRIPS language toolchain and the ap- configurations, and its novel Instruction Set Architecture. In particu- proach we used to implement a full-featured software devel- lar, the TRIPS ISA includes (i) a block atomic execution model, (ii) ex- opment kit. Section 3 discusses several of the TRIPS software plicit mappings of instructions to execution units, and (iii) predicated simulators, enabling us to prove out a number of design ideas instructions which may or may not fire, depending on the outcome during every stage of development. Section 4 discusses the of preceding instructions. Our primary goal has been to construct TRIPS operating environment, with its use of a host-based re- tools to consume standard and and generate source manager to download, execute, and control TRIPS bi- binaries both for the TRIPS software simulators and hardware proto- naries on the target processors. Section 5 covers the TRIPS type. A secondary goal has been to build the software infrastructure on build, integration, and testing processes. Section 6 describes standard platforms using standard tools. These goals have been met the availability of the tools and simulators for the TRIPS pro- through a combination of off-the-shelf and custom tools. We present totype. a number of design issues and their resolution in enabling end users to exercise the prototype ISA using familiar tools and programming interfaces. Finally, we offer download instructions for those who wish 2 LANGUAGE TOOLCHAIN to test-drive the TRIPS tools. The TRIPS toolchain is responsible for consuming source files written in ANSI C, a subset of GNU C, and Fortran 77; apply- 1 INTRODUCTION ing both classic and novel optimizations to the control flow structures; and generating binary objects that can be linked The development of the TRIPS processor has required build- with standard math and C runtime libraries. Support for ing a large software infrastructure and set of tools (or ttools), C++ and Fortran 90 applications has thus far not been judged including functional and timing simulators; an assembler, critical. Toolchain construction has been guided by a well- , and binary utilities; a high-level language compiler and designed set of components or , each with clearly spec- instruction scheduler; a set of optimized and unoptimized run- ified syntactical and calling conventions. Figure 1 shows the time libraries; a resource manager to coordinate and control various components of the toolchain. system resources; and a variety of build and test utilities. To manage the complexity of compiler development, we 2.1 Architectural Constraints implemented major components using well-defined interfaces, a variety of Open Source products, and modular construction. The TRIPS architecture has several features that require un- To create a in a timely fashion with modest usual support in the software toolchain. The ISA defines a resources, we limited the feature set of the operating environ- block atomic execution model, in which program functions ment, put the bulk of command and control on a stock desktop are subdivided into variable-size blocks of up to 128 instruc- Host PC running /, used off-the-shelf software and tions [1]. When fetched, each block is mapped onto a row development tools for the motherboard, and defined simple but of architectural registers and a grid of execution units. Each powerful protocols between the Host PC and motherboard and TRIPS block comprises a header chunk and one to four in- All stack items are aligned on high virtual memory 8-byte boundaries. environment = 0xFFFFFFFF and argv strings

argument save area

return address =0 sp of _start() back chain pointer=0 register save area

local variable area Stack grows down.

argument save area

return address Fixed size sp of main() back chain pointer link area of main()

register save area

local variable area

argument save area

return address link area of func() sp of func() back chain pointer

Figure 1: Toolchain links. register save area

local variable area

return address link area of leaf() struction chunks. Register read and write instructions are em- sp of leaf() back chain pointer bedded in the header chunk, in addition to fields for magic unused stack area number, block type information, and processor control flags. Application bss The ISA places strict limits on the number of instructions in a and data area block, on the number of register reads and writes, and on the low memory combined number of load/store instructions. places data beginning at 0x80000000. Optimally mapping instructions on the grid is critical, due Figure 2: Application memory organization. to the TRIPS dataflow execution model, in which each execu- tion node fires when its arrive. Cycle counts depend not only on cycles but on arrival times, them- selves determined by feeds from banked architectural regis- ters, a routing operand network (OPN), and a banked L1 data ¯ A common set of application programming interfaces cache. that can be used by other tools, such as the simulators Despite such contraints, numerous features have made the and . prototype easier to program, including regular and clearly de- We chose to support a simple version of the Link- fined instruction formats; a global 40- address space, which ing Format (ELF) to output a small number of string tables and enables software to access any chip register or memory lo- text, data, and bss program sections. Figure 2 shows how the cation through its unique address; and a uniform processor linker lays out application data in virtual memory. The figure exception model, in which all processor exceptions occur at also reflects the simple but effective of the block boundaries and present a consistent interface to the soft- TRIPS Application Binary Interface (ABI) [9]. One simpli- ware. fication that made binutils development easier was dropping Throughout toolchain development, we relied on a ro- support for shared libraries–all TRIPS are stati- bust functional simulator, an accurate timing simulator, and cally linked, which dramatically simplifies installing, version- a sophisticated set of performance tools to analyze execution ing, and toolchain components and applications. traces, to visualize instruction placement, and to pinpoint crit- The utility front ends are largely platform independent. For ical paths and execution bottlenecks. example, once we specified the TRIPS data types, such as 64- bit pointers, longs, and doubles and 32-bit integers and floats, 2.2 Binary Utilities the GNU utilities transparently provided support for allocating and accessing TRIPS scalar variables. The assembler, linker, and other utilities, such as nm, obj- However, to support the prototype ISA and high-level com- dump, and ar, are ports of the GNU binary utilities (binutils), piler. the TRIPS-specific back ends required a significant port- available from the Foundation [4]. ing effort. Descriptions of major customizations follow. A number of benefits have accrued from the decision to use the binutils package: 2.2.1 Block-oriented rather than line-oriented

¯ A leveraged and rich set of functionality based on a ma- ture codebase. The GNU front end assumes that each line of a source file

¯ A large installed user base and an active, experienced translates into one instruction word. Once the TRIPS Assem- development community. bly Language (TASL) had been defined, encoding the individ- ual 32-bit instructions proved straightforward, simply requir- To save instruction count (4 vs. 2 constant instructions), ing the tas assembler to parse and translate the source file a the compiler emits CALLO and BRO instructions for all jump

line at a time [14]. targets. Because the offset field specifies 128-byte chunks and ¾7

However, to support the TRIPS block-atomic model, TASL supports a 20-bit offset, any target within a byte-spread is syntax groups instructions and register reads and writes into reachable. blocks, demarcated with “block begin” (.bbegin) and “block However, for very large applications the 20-bit offset field end” (.bend) directives for that block. The tas assembler can be insufficient. In such cases, the linker must detect the therefore tracks the individual statements after the opening shortfall, “relax” the current code section, and insert a tram- .bbegin and enters them into a memory structure rep- poline above the current block which fully specifies the abso- resenting the execution grid and architectural registers. When lute target address with a fabricated BR instruction. The orig- encountering the matching .bend directive, the assembler tra- inal CALLO or BRO instruction in the block is retargetted to verses the , marks instructions that require relo- the adjacent trampoline block, which bounces execution to the cation, uses the GNU Description (BFD) routines far-flung address. Code trampolines introduce an extra level of to translate x86 Little-Endian formats to TRIPS Big-Endian, indirection. Fortunately, the need for them rarely occurs. and commits the whole block to disk. Throughout binutils development, a balance needed to be struck between treating instructions as individual 32-bit words 2.2.4 Disassembling TRIPS binaries and treating “instructions” as variable-size code blocks. In The binary utilities and debugger provide several disassembly addition, the ISA definition of 32-bit instructions and 64-bit routines. Again, the GNU model of one instruction per code pointers and data types required syntactical support to con- word had to be extended to support consuming a header chunk struct 64-bit data from pieces of the 32-bit instructions. Al- of 128 bytes at once, parsing and rearranging the header into though the TRIPS prototype restricts global physical addresses register reads and writes, reading the following one to four to 40 and although the ttools compiler limits programs to instruction chunks, and finally disassembling the individual 32-bit virtual addresses, nothing in the ISA or assembly lan- code words, while maintaining integrity of the complete code guage definition precludes full 64-bit addressing. block.

2.2.2 Block integrity checking 2.3 Compiler and Instruction Scheduler It is easy to write incorrect TASL code. The tas assembler pro- The TRIPS compiler is Scale [8], a -based set of 700+ vides simple error checking during the first pass while parsing classes offering classic optimizations as well as TRIPS- the individual instructions, to ensure that instruction formats specific optimizations. Scale has proven an ideal research ve- are correctly specified, that immediate values lie within range, hicle, supporting several code generators (Alpha, Sparc, Pow- that proper target operands are in place, and that no block con- erPC, TRIPS), configurable optimizations, and compilation tains more than 128 instructions. phases that can be arranged in arbitrary order. In addition, due to the target formats of the ISA, a second One early design decision was to partition TRIPS-specific round of checking is desireable, to ensure that each block is compilation into two major components: well-formed. Hence, after each block of instructions is as- sembled, a two-pass checker is invoked to ensure that source ¯ The Scale compiler, which produces an intermediate instructions properly target destination instructions, that predi- form of assembly code. cate outputs target instructions which in fact expect predicates,

that instructions expecting one or two operands receive them, ¯ An instruction scheduler (tsch), which translates the in- and so on. termediate output from Scale into low-level dataflow in- An automated TASL generator tg, which produces thou- structions and maps each explicitly onto the execution sands of lines of randomized but legal TASL code, proved in- grid. valuable in shaking out assembler/linker bugs before compiler readiness. The same tool was used during hardware bringup to As described below, this organization has enabled the gener- test processor functionality. ation of a RISC-like, high-level from the Scale compiler and the generation of a dataflow syntax from the instruction scheduler, with little loss in tool efficiency. 2.2.3 Limited reach of call-by-offset instructions The prototype ISA defines a number of constant instructions 2.3.1 Hyperblock generator which embed immediate values in 16 bits of the 32-bit word. Using a combination of generate and append constant instruc- The TRIPS prototype accepts up to 128 instructions per block tions (GEN and APP), the compiler can construct efficient in- and executes them in dataflow order and in parallel. Because struction sequences to generate addresses and literals. blocks of control-dependent applications typically con- Additionally, the ISA provides both absolute and relative sist of 5-6 instructions between branches, and because the pro- branching instructions. The CALL and BR instructions sup- totype ISA disallows branches to targets within a block, one port 64-bit absolute addressing, just as the CALLO and BRO of our key challenges has been to aggregate basic blocks into instructions contain offset fields to specify target addresses. much larger predicated hyperblocks [11]. The original goal of the hyperblock generator was simple: 2.3.3 High-level vs. low-level assembly lan- join basic blocks into large hyperblocks, the bigger the bet- guages ter [7]. The early design of the hyperblock generator placed hyperblocking in one of the early phases of compilation, con- are familiar with RISC-like, algebraic- trolling the degree of while and for , if conver- formulated languages which execute in program order. We sions, and function inlining. The compiler broke up and then therefore defined the TRIPS Intermediate Language (TIL) [10] re-constituted any of the resulting blocks that exceeded any of as a user-friendly target language, employing a simple and the prototype constraints, such as those containing more than consistent syntax: 32 loads and stores. opcode result, operand1 [, operand2] What we discovered: Example: The following code for block main$4 shows how architectural registers G3 and G12 are read into temporary ¯ The prototype instruction contraints are hard limits. registers T0 and T1 of the execution grid, how some amount of Even if one block in a million exceeds a hardware con- decision-making occurs to determine the outcome of a branch, straint, the resulting binary cannot execute properly. and how grid outputs are written back to registers G12 and G13 before the block terminates. ¯ It is difficult for early analysis to form accurate instruc- tion estimates. In some cases, instruction counts will .bbegin main$4 be overly pessimistic due to later optimizations that re- read $t0, $g3 move instructions. In other cases, instruction counts will read $t1, $g12 be overly optimistic, ignoring potential store instructions mov $t2, $t0 for spilling by the register allocator, which is invoked far addi $t3, $t1, 1 downstream from the hyperblocker. extsw $t4, $t3

¯ Block-splitting is difficult, in itself requiring not only re- tlti $t5, $t4, 10 verse if-conversion but analysis to build new hyperblocks tnei $t6, $t5, 0 from resulting fragments. Such regenerated code is typ- bro_t<$t6> main$3 ically inefficient and requires redundant hyperblocking bro_f<$t6> main$5 . write $g12, $t4 write $g13, $t2 Consequently, a significant structural revision was forced, in .bend which the hyperblock generator migrated to the compiler back end, to good effect. Unrolling while loops has subsequently The instruction scheduler, charged with mapping instruction migrated; for loop migration is soon to follow. Other compiler blocks onto the execution grid, consumes TIL files and pro- optimizations are anticipated: duces target-form output. Example: In this translated form of main$4, the TASL ¯ Hyperblocking based on execution profiles, to pull “hot”

basic blocks into the same hyperblock. specifies exactly how inputs are fed into the grid, which exe-

℄ ℄ cution nodes ((N0 -N 127 ) are brought into play, and where

¯ Reducing the number of block exits, to train the proto- the execution nodes direct their results. type’s branch predictor in execution patterns. .bbegin main$4

¯ Instruction merging, to minimize the number of redun- ;;;;;;;;;;; Begin read preamble dant instructions. [3] read G[3] N[3,0] R[0] read G[12] N[0,0] 2.3.2 TRIPS compiler driver ;;;;;;;;;;; End read preamble N[3] <0> mov W[1] The dynamic nature of the toolchain has required frequent and N[0] <1> addi 1 N[4,0] sometimes intricate modifications of command line options to N[4] <2> extsw N[8,0] W[0] various components. For example, a variety of options for hy- N[8] <3> tlti 10 N[12,0] perblocking, both in the compiler front end and back end, co- N[12] <4> tnei 0 N[16,0] existed for a number of months. To hide these changes from N[16] <5> mov N[20,p] N[24,p] the end user, while providing full access to individual options, N[20] <6> bro_t [0] main$3 the tcc compiler driver was developed. N[24] <7> bro_f B[1] main$5 Originally implemented as a simple script, tcc has ;;;;;;;;;;; Begin write epilogue grown to 2,500+ lines of to support dozens of command- W[0] write G[12] line options, both TRIPS-specific and gcc-compatible, such W[1] write G[13] as canned optimizations (-O3 and -O4), debugger options (- ;;;;;;;;;;; End write epilogue

g), and component options, including - to the , .bend

> < >

-Wc, -Wl,

¯ Increasing the amount of inlining used by the compiler and grouping like functions into multi-compilation units, elmination, copy progation and useless copy removal, loop in- so that the compiler can inline across modules. variant code motion, scalar replacement, tree height reduction, and other optimizations, but without hyperblocking (-O3). For fully optimized applications, a set of runtime libraries com-

¯ Defining routines in the C runtime startup module (crt.o) piled with hyperblocking is available and presented to the to interface with the program loader and set up the stack, linker when users compile their applications with -O4. Also, provide software floating point division, determine the for users of the debugger, the -g flag causes the linker to base physical address of the chip configuration space, link debuggable libraries compiled neither with the above op- implement setjmp(), and so on. timizations nor function inlining so that debugging runtime functions closely resembles those of conventional architec-

¯ Increasing the amount of buffering used by printf() and tures. the memory manager in calls to malloc(), to minimize runtime overhead. 3 SOFTWARE SIMULATORS

2.4.2 Math libraries We have developed a variety of software simulators for TRIPS, from a simple ISA simulator to a detailed Verilog chip - We chose the SunSoft Freely Distributable LIBM (libfdm) C tor. math library for its transcendental math functions [12]. The Simulators in the ttools release include a functional simu- libfdm routines are IEEE-754 conformant, portable, and well- lator (tsim arch) and a cycle-accurate simulator (tsim proc). tested. The intention is to replace key functions such as sqrt() Despite disparate goals, both share the same C++ code base with those from the IBM IEEE math library (MathLib) [5], due and key attributes: to its accurate table method, which executes efficiently on the

prototype. ¯ A built-in program loader, using the BFD utility routines, Due to area constraints, the prototype includes no floating- to load the code and data into target memory, to create point divide unit and no FDIV instruction. Starting from John TLB entries for the text and data segments as defined Hauser’s platform-independent implmentation of the IEEE by the linker, to initialize the process registers and stack, Standard for Binary Floating-Point Arithmetic [6], we ex- and to schedule the executable to run. tracted the 64-bit software divide routines, hand-tuned them

for efficient execution, and “baked” them into the C runtime ¯ Support for argv, argc, and envp program variables and startup module .o, available to all applications. for stdio, stdout, and stderr console I/O. ¯ A common execution model of fetching the next block 3.1.2 System call support of instructions, mapping the instructions onto the ar- Both simulators provide limited system call support, driven by chitectural registers and execution grid, injecting block the requirements of targeted applications. Currently, the sim- inputs into the grid, executing each instruction as its ulators support the following: brk(), close(), creat(), (), operands become ready, forwarding results to target exe- fstat(), getpid(), gettimeofday(), lseek(), lstat(), open(), cution nodes, and outputting register writes and memory read(), stat(), time(), unlink(), and write(). stores. The C runtime library translates system calls into SCALL instructions, which trap into the simulator’s system services

¯ A set of proxy services for handling system calls. module. The trap handler in turn proxies the service request on the host, according to the POSIX definition of the call.

¯ Common tracing and debugging output formats. 3.2 Timing Simulator

¯ A set of statistical collection routines. In addition to simulating TRIPS processor functionality, Both simulators model single processor behavior. A system (tsim proc) offers a detailed, cycle-accurate model of the simulator (tsim sys) has been developed to simulate execution prototype TRIPS processor. It models the internal organiza- of multiple processors, enabling the user to download and ex- tion and latencies of the processor and is intended to support ecute one or more applications under the control of the TRIPS processor-level performance analysis. Resource Manager, discussed in section 4. Whereas tsim arch models the behavior of architectural To manage complexity and guarantee deterministic behav- registers and execution nodes at the software-visible level, ior, all simulators execute as single-threaded processes on tsim proc additionally models the struc- x86/Linux-based hosts. tures within each. For example, tsim proc details the behav- ior of individual execution nodes when sent commands from the global control unit, when reading packets from and writ- 3.1 Functional Simulator ing packets to the operand network, when filling and consum- ing internal buffers, and during block flushes and commits–in The tsim arch functional simulator is an architecture-level addition to actually selecting, executing, and retiring instruc- simulator intended to model accurately the TRIPS processor tions. architecture. It is equivalent to a traditional instruction-set On a 2.9GHz Pentium 4, tsim arch can execute 800,000+ simulator (or functional ) and does not provide real- simulated instructions per second. Although the increased istic cycle counts. level of detail causes tsim proc to execute 300-800 times slower than tsim arch, we have been pleased to confirm that performance metrics obtained from tsim proc closely match 3.1.1 Execution model those of the actual prototype, discounting memory latencies outside the modeling capability of the timing simulator. Sta- Beginning with the header chunk of the first fetched block, tistical output from tsim proc includes instructions committed typically from the start() routine of crt0.o, tsim arch maps and those flushed (both speculatively and non-speculatively), the block reads onto the designated architectural registers and branch predictor hits and misses, icache and dcache hits and maps the individual instructions onto the execution grid. Each misses, speculative load hits and misses, forwarded stores, register read is forwarded to its target execution node or nodes. OPN packet reads and writes, busy and stalled cycles, buffer Beginning at the “first” active execution node, the simula- occupancy rates, and total cycles per block. tor visits one node after another, executing each instruction if its input operands are available and forwarding the result to the target node or nodes, until it has traversed the grid. Then the 4 SYSTEM SOFTWARE simulator begins a second pass through the grid, again firing instructions with ready operands, and a third, and so on, until This section focuses on system software as it executes on the the block completion logic is satisfied, typically when all store prototype hardware. However, much of this same software ex- instructions have fired, or until a program exception occurs. At ecutes equivalently on our system simulator platform. Those block completion time, the simulator commits all block out- interested primarily in using the software toolchain and simu- puts, both register writes and memory stores. The next block lators can skip to Section 6, “Release Information.” is fetched, based on the virtual address specified in an executed BR or CALL instruction. 4.1 System software goals Through the TLB entries, the simulator translates all mem- ory references to physical addresses. If a TLB miss or other System software goals include enabling efficient hardware fatal program exception occurs, a block playback capability bring-up; enabling users to download, execute, observe, man- outputs an execution trace for the . age, and applications; demonstrating the performance As a convenience, the simulator maintains symbol table in- and capabilities of the prototype on targeted workloads, such formation from the binary in order to print block and variable as on EEMBC and SPEC CPU2000 benchmarks; and support- names during trace operations. ing full hardware utilization of processor, chip, memory, and ¯ To service processor and pass them on to the Host PC. A custom Linux 2.6 kernel module has been developed to man- age PowerPC 440GP bus transactions with the External Bus Control unit on each of the four TRIPS chips. Extensions to the software on the PowerPC 440GP will handle TRIPS sys- tem calls locally.

4.3.1 Boot process

Figure 3: System software components. Wolfgang Denk’s Free Software project, the capable and Open Source u-boot and Linux 2.6 kernel tree, greatly mitigated the complexities of the PowerPC 440GP into Linux [2]. During a motherboard reboot, the PowerPC 440GP is reset board resources. and begins executing u-boot from flash memory to initialize processor registers and peripherals, including its UART, Ether- 4.2 System software components net, and SDRAM controllers. After copying itself to SDRAM and jumping there, the bootloader connects across the LAN The TRIPS system employs these components: connection to the Host PC to download the kernel into proces- sor memory and transfer execution to the kernel entry point.

¯ An x86/Linux Host PC to manage the motherboards and After re-initializing board components, the kernel connects provide a networked file system. across the LAN to learn its hostname and network parameters (via DHCP), to set the onboard clock (via NTP), to mount its

¯ A local area network to support communication among root file system (via NFS), and to accept logins (via SSHD). the motherboards and Host PC.

¯ A hardware debugger to provide initial access to the 4.3.2 EBI adapter PowerPC 440GP registers, TLB entries, peripheral bus, SDRAM controller; and to program the onboard flash At the end of boot process, a boot script creates a /dev/trips memory. device file, installs the ebi driver.ko kernel module, and in- vokes the ebi adapter user-level process.

¯ A bootloader programmed into flash memory. After detecting which TRIPS chips are physically present on the board, the EBI adapter spawns two threads–one to lis-

¯ An embedded executing on the mother- ten on a well-defined socket for Host PC commands from the board. TRM and another thread to wait on processor interrupts poten- tially from all four chips.

¯ A and daemon process to mediate commu- nication between motherboard and target processors. 4.4 Resource Manager

¯ A cross-compilation toolchain for the embedded Pow- The TRIPS Resource Manager (TRM), a user-level process on erPC 440GP processor. the Host PC, functions as a light-weight , ini- tializing the chips, allocating memory, downloading applica- ¯ A resource manager running on the Host PC to monitor and control the target processors, tions, servicing system calls, and monitoring performance.

Figure 3 shows components of the system software. 4.4.1 Low-Level communication To manage complexity, system software is organized in layers, from device driver on the motherboard to end-user We defined a low-level Level (HAL) clients on the Host PC. To leverage engineering resources, consisting of a protocol and API supported between the TRM much of the software derives from Open Source and off-the- executing on the Host PC and the EBI adapter executing on shelf products. each motherboard. The protocol is simple but powerful, con- sisting of individual read and write requests, which always originate from the TRM, and a signalling mechanism by which 4.3 Board Software the EBI adapter can notify the TRM of processor exceptions. Up to four dual-core TRIPS chips on individual daughtercards populate the prototype motherboard. The embedded PowerPC 4.4.2 System call support 440GP processor serves two major purposes: The prototype relies on the Host PC for all system services,

¯ To read and write addresses in the global memory space including console and file i/o, memory allocation, wallclock shared among processors. access, and process termination. When the application issues a system call, such as an enable users to configure clock speeds and morph modes, to open(), read(), or write(), a well-defined sequence occurs, as examine and modify current application status, and to produce specified in [9]: statistical information.

¯ System call parameters are written to registers G3 and following, according to compiler calling conventions. 4.6 Symbolic Debugger A special TRM client is the TRIPS symbolic debugger, tdb. ¯ The C runtime library places the numeric id of the system call in G0, updates the stack pointer in G1, sets the return Although many applications can be instrumented with printf() address in G2, and issues the SCALL instruction. statements or debugged on other platforms, the desire has been to support TRIPS-specific data types, breakpointing, block

¯ The processor sets an bit in the External Bus stepping, memory and register accesses, and runtime libraries. Controller unit and halts. Due to its powerful feature set and familiar interface, we chose to port the GNU debugger gdb to the prototype. The

¯ The device driver on the PowerPC signals the user-level work was highly leveraged, based on: EBI adapter, which in turn raises an exception for the

TRM. ¯ The portable x86/Linux gdb front-end, for command- line editing and platform-independent operations.

¯ During its main loop, the TRM pulls the request from its

network queue and discovers which processor on which ¯ The BFD library, used to read ELF executables. which

chip on which board has caused the exception. was already part of the TRIPS binary utilities, ¯ ¯ After determining that the exception is due to a system The TRIPS opcode library, already built into the assem- call, as opposed to a DTLB translation error, a divide- bler, to disassemble instructions on a per-block basis. by-zero error, a breakpoint, or other exception, the TRM reads the numeric ID of the system call from G0, reads Those portions of tdb requiring further customizations are dis- the pass parameters, depending on the type of system cussed below. call, and executes the call on behalf of the TRIPS ap- plication. 4.6.1 Host-target protocol

¯ Upon proxying the call on the Host PC, the TRM up- gdb normally offers support for remote debugging in two dates processor memory locations as needed and writes forms: the system call return value into G3.

¯ A platform-specific stub library is linked into each target

¯ Finally, the TRM clears the Processor Status Register to executable and handles interrupts and commands from restart processor execution. the host.

With remote servicing complete. the application resumes ex- ¯ A gdbserver runs as a proxy on the target as a separate, ecution at the return address specified in G2. According to heavy-weight process and controls the behavior of the compiler calling conventions. the TRM leaves the return value target application. of the system call in G3. If error occurs, such as an invalid write(), the return value will be -1 and the TRM will set pro- Both forms use a low-level string-based protocol between host gram variable errno appropriately, according to POSIX con- and target. Building on the capabilities of the TRM, we de- vention. fined a higher-level protocol and an API to communicate be- tween (a) the gdb backend on the Host PC and (b) the EBI on the motherboard. The API includes routines to download 4.4.3 Client-level communication and execute programs, to set and clear breakpoints, to continue Running as a background process, the TRM server supports and block-step execution, to read registers and virtual memory multiple users and multiple applications executing on multi- locations, and to map virtual to physical addresses. ple processors. Communication between TRM server and the From the debugger’s perspective, the TRM is both: user’s client application occurs across a TCP/IP socket, con-

¯ A virtual processor, responding to memory and register nected to the user’s terminal session. To simplify control flow, requests and breakpointing commands. all client requests to the TRM are blocking. For example, when

the user requests the TRM server to download an executable ¯ An operating environment, with session management, file from the Host PC into chip memory, the user’s session will program loader, file i/o, and system call support. wait for the request to complete. 4.6.2 Symbol table entries 4.5 TRM Clients The binary utilities and gdb front end already provide exten- A number of x86/Linux client utilities have been developed sive support for symbol table, data type, and in- so that users can easily allocate system resources on the Host formation in the form of stabs entries. However, the Scale PC, download applications, and run them. Additional clients compiler required extensive effort to generate symbol and line information. A major challenge has been to match source lines 4.6.5 Physical addressing against instruction blocks, as blocks typically contain numer- gdb restricts itself to virtual addresses, as determined by the ous source lines. If a code block subsumes multiple source linker at link time and the at runtime. In order to sup- lines, which line should the compiler identify as the “right” port physical memory lookups, we developed a private inter- line to represent the whole block? Although the compiler can face to the TRM, by which the debugger can query the phys- emit code with one-line-per-block, the resulting binary is arti- ical address of a given virtual address. The gdb e(X)amine ficially and excessively bloated. command has been extended to provide such lookups. To that end, the GNU Dynamic Data Debugger (ddd) has Example: The following interaction is based on the dhrys- been ported to the prototype system as a front end to tdb [3]. tone benchmark, when a breakpoint is set at function Proc 7() By providing a friendly user interface and scrollable visual and program variable Int Loc is examined. listings, ddd enables users to orient themselves in their source code while stepping through their application. (tdb) c Continuing.

4.6.3 Breakpoints Breakpoint 1, Proc_7 (Int_1_Par_Val=2, Int_2_Par_Val=3, Int_Par_Ref=0xffffeec0) The processor offers both break-before and break-after hard- at dhry.c:473 ware breakpoints on a per-block basis. Built on top are TRM *Int_Par_Ref = Int_2_Par_Val + Int_Loc; functions to set and clear both types as well as GNU platform- (tdb) ptype Int_Loc independent functions. tdb uses a mix of functions for its type = breakpointing support. (tdb) p Int_Loc When execution has stopped on a breakpoint and the user $1 = 1 issues a continue command, tdb uses TRM functions to re- (tdb) p &Int_Loc move the current break-before breakpoint, set a break-after $2 = (One_Fifty *) 0xffffee50 breakpoint, and continue execution. When it almost imme- (tdb) x/p &Int_Loc diately hits the break-after breakpoint, tdb uses TRM func- 0xffffee50: paddr=0x0147ffee50 tions to restore the original break-before breakpoint, clear the current break-after breakpoint, and resume execution transpar- We learn from this interaction that Int Loc is an integer vari- ently. In this manner, the user interacts with the debugger in able, is defined as an application-specific One Fifty type, familiar fashion. holds a value of “1”, resides on the stack at virtual ad- Note that due to the TRIPS block-atomic execution model, dress 0xffffee50, and occupies physical memory location single-stepping in tdb translates to block-stepping through the 0x0147ffee50. binary; that is, breakpoints are set at block boundaries as the processor executes a whole block at once. Externally, program order is maintained, but within a block instructions fire in non- 5 DEVELOPMENT PROCESSES deterministic order, governed by latencies within the operand network and memory request fulfillments. A future enhance- 5.1 Source Code Management ment will enable the programmer to pinpoint processor faults, We use the Concurrent Version System (CVS) to automate such as a divide-by-zero instruction; namely, an ISA simula- nightly builds from several local and remote source repos- tor sewn into the debugger will replay the block in software to itories. Nightly cron jobs pull from the repositories, build identify the offending instruction. toolchain and supporting components, run some number of “sanity checks” against the newly built components, and if successful, create a 130MB root file system with those com- 4.6.4 Stack unwinding ponents. Unless one of the key components breaks, developers As with any gdb port, we needed to implement stack unwind- enjoy fresh bits each morning. ing routines to provide backtracing, pass parameter, and local variable information. Our debugger required two key enhance- 5.2 Regression Testing ments: To verify operation of constantly evolving toolchain compo- nents, thousands of nightly regression tests are run after the

¯ In order to examine the current stack frame at function invocation, the compiler changed its stabs generation to root file system is installed. Depending on a weekly rotation ensure that breakpoints are set always after the function schedule, tests last from 2 to 16 hours. prologue. 5.2.1 SPEC CPU2000

¯ In order to determine which values are pushed on the Because the prototype supports both streaming and irregular stack, the debugger uses extensive disassembly and anal- control flow applications, the SPEC CPU2000 benchmarks ysis of the prologue, to trace the possible source of represent important workloads. To maintain the integrity of the STORE instructions back to the stack pointer. SPEC CPU2000 , care has been taken to customize only platform-specific information, such as pointer type defi- A variety of published papers are also included, as are slides nitions and Fortran formatted-output specifiers, and to leave in from the HPCA-12 full-day tutorial, “Design and Implmenta- place the original runspec tool and directory structure from tion of the TRIPS EDGE Architecture.” the SPEC distribution media. With a few simple commands, users can specify the toolchain version and compiler flags to use, build the bench- 6.3 Download Instructions marks in parallel, package the binaries and download them A request page for the ttools software is simple to fill out: to the Host PC, run them on the hardware, and generate http://www.cs.utexas.edu/users/cart/trips/ttools-req/ email reports, Webpages, and graphs showing block and cy- The form asks you to enter name, institution, email address, cle counts. and technical interest. Within 2-4 days, you will receive a download invitation and details concerning how to install and 6 RELEASE INFORMATION configure the 27MB gzipped package. The current release package of ttools is TRIPS Tools, Version REFERENCES 1.0. It is freely available by means of a Web distribution page. A description of release contents follows. [1] D. Burger, S. Keckler, K. McKinley, and et al. Scaling to the end of silicon with edge architectures. In IEEE Computer, 37 (7), pages 44–55, July 2004. 6.1 Release Components [2] W. Denk. U-boot - open source firmware for embedded powerpc, A variety of x86/Linux-based tools, TRIPS binaries, and sam- 2007. http://www.denx.de. ple sources are included in the release: [3] F. S. Foundation. Ddd - data display debugger, 2005. http://- www.gnu.org/software/ddd.

¯ C and Fortran cross-, binary utilities, ISA [4] F. S. Foundation. Gnu binutils, 2006. http://sources.redhat.com/- and cycle-accurate simulators, and performance analysis binutils. tools. [5] S. Gal and B. Bachelis. An accurate elementary mathematical

¯ Application and system header files. library for the ieee floating point standard. In ACM Transactions on Mathematical Sofware,, Vol 17, No. 1, pages 26–45, March

¯ Optimized and unoptimized runtime libraries. 1991. [6] J. Hauser. Softfloat, 2002. http://www.jhauser.us/arithmetic/- ¯ C and Fortran test files. SoftFloat.html.

¯ Source code for hand-optimized routines. [7] S. Mahlke, D. Lin, W. Chen, R. Hank, and R. Bringmann. Ef- fective compiler support for predicated execution using the hy- Note that the package does not include system software, such perblock. In Proceedings of the 25th Annual International Sym- as the resource manager, system simulator, debugger, and posium on Microarchitecture, pages 45–54, June 1992. hardware-specific components. [8] K. S. McKinley, J. Burrill, B. Cahoon, J. E. B. Moss, Z. Wang, Note also that the cross-tools are supported only on and C. Weems. The scale compiler. Technical report, University x86/Linux systems. of Massachusetts, 2001. http://ali-www.cs.umass.edu/scale/. [9] A. Smith, J. Burrill, R. McDonald, and et al. Trips intermedi- ate language manual - tr-05-22. Technical report, University of 6.2 Manuals and Reference Information Texas, March 2007. The following documents are included in the release: [10] A. Smith, J. Gibson, J. Burrill, and et al. Trips intermediate lan- guage manual - tr-05-20. Technical report, University of Texas,

¯ Tools Quickstart Guide March 2007. [11] A. Smith, R. McDonald, R. Nargarajan, K. Sankaralingam, ¯ Processor Reference Manual 1.2 D. Burger, K. McKinley, and S. Keckler. Dataflow predication. In Proceedings of the 39th Annual International Symposium on ¯ Chip Reference Manual 1.0 Microarchitecture, pages 236–245, 2006.

¯ Performance Manual [12] SunSoft. Freely distributable libm, 2004. http://www.netlib.- org/fdlibm/readme.

¯ Intermediate Language (TIL) Manual [13] F. von Leitner. Diet libc - a libc optimized for small size, 2006. http://www.fefe.de/dietlibc. ¯ Application Binary Interface (ABI) Manual [14] B. Yoder, R. McDonald, and et al. Trips assembler and assem-

¯ Assembler and Assembly Language (TASL) Specification bly language manual - tr-05-21. Technical report, University of Texas, September 2003.

¯ Format (TOFF) Specification