Software Infrastructure and Tools for the TRIPS Prototype
Total Page:16
File Type:pdf, Size:1020Kb
Software Infrastructure and Tools for the TRIPS Prototype Bill Yoder Jim Burrill Robert McDonald Kevin Bush KatherineCoons MarkGebhart SibiGovindan BertrandMaher Ramadas Nagarajan Behnam Robatmili Karthikeyan Sankaralingam SadiaSharif AaronSmith DougBurger StephenW.Keckler KathrynS.McKinley Computer Architecture and Technology Laboratory Department of Computer Sciences The University of Texas at Austin [email protected] - www.cs.utexas.edu/users/cart/trips ABSTRACT between the motherboard and chips. A key theme of software development has been to adapt The TRIPS hardware prototype is the first instantiation of an Explicit off-the-shelf solutions when possible and to create homegrown Data Graph Execution (EDGE) architecture. Building the compiler, solutions when necessary. toolset, and system software for the prototype required supporting the The remainder of this paper is organized as follows. Sec- system’s unique dataflow construction, its banked register and memory tion 2 discusses the TRIPS language toolchain and the ap- configurations, and its novel Instruction Set Architecture. In particu- proach we used to implement a full-featured software devel- lar, the TRIPS ISA includes (i) a block atomic execution model, (ii) ex- opment kit. Section 3 discusses several of the TRIPS software plicit mappings of instructions to execution units, and (iii) predicated simulators, enabling us to prove out a number of design ideas instructions which may or may not fire, depending on the outcome during every stage of development. Section 4 discusses the of preceding instructions. Our primary goal has been to construct TRIPS operating environment, with its use of a host-based re- tools to consume standard C and Fortran source code and generate source manager to download, execute, and control TRIPS bi- binaries both for the TRIPS software simulators and hardware proto- naries on the target processors. Section 5 covers the TRIPS type. A secondary goal has been to build the software infrastructure on build, integration, and testing processes. Section 6 describes standard platforms using standard tools. These goals have been met the availability of the tools and simulators for the TRIPS pro- through a combination of off-the-shelf and custom tools. We present totype. a number of design issues and their resolution in enabling end users to exercise the prototype ISA using familiar tools and programming interfaces. Finally, we offer download instructions for those who wish 2 LANGUAGE TOOLCHAIN to test-drive the TRIPS tools. The TRIPS toolchain is responsible for consuming source files written in ANSI C, a subset of GNU C, and Fortran 77; apply- 1 INTRODUCTION ing both classic and novel optimizations to the control flow structures; and generating binary objects that can be linked The development of the TRIPS processor has required build- with standard math and C runtime libraries. Support for ing a large software infrastructure and set of tools (or ttools), C++ and Fortran 90 applications has thus far not been judged including functional and timing simulators; an assembler, critical. Toolchain construction has been guided by a well- linker, and binary utilities; a high-level language compiler and designed set of components or links, each with clearly spec- instruction scheduler; a set of optimized and unoptimized run- ified syntactical and calling conventions. Figure 1 shows the time libraries; a resource manager to coordinate and control various components of the toolchain. system resources; and a variety of build and test utilities. To manage the complexity of compiler development, we 2.1 Architectural Constraints implemented major components using well-defined interfaces, a variety of Open Source products, and modular construction. The TRIPS architecture has several features that require un- To create a runtime system in a timely fashion with modest usual support in the software toolchain. The ISA defines a resources, we limited the feature set of the operating environ- block atomic execution model, in which program functions ment, put the bulk of command and control on a stock desktop are subdivided into variable-size blocks of up to 128 instruc- Host PC running x86/Linux, used off-the-shelf software and tions [1]. When fetched, each block is mapped onto a row development tools for the motherboard, and defined simple but of architectural registers and a grid of execution units. Each powerful protocols between the Host PC and motherboard and TRIPS block comprises a header chunk and one to four in- All stack items are aligned on high virtual memory 8-byte boundaries. environment = 0xFFFFFFFF and argv strings argument save area return address =0 sp of _start() back chain pointer=0 register save area local variable area Stack grows down. argument save area return address Fixed size sp of main() back chain pointer link area of main() register save area local variable area argument save area return address link area of func() sp of func() back chain pointer Figure 1: Toolchain links. register save area local variable area return address link area of leaf() struction chunks. Register read and write instructions are em- sp of leaf() back chain pointer bedded in the header chunk, in addition to fields for magic unused stack area number, block type information, and processor control flags. Application bss The ISA places strict limits on the number of instructions in a and data area block, on the number of register reads and writes, and on the low memory combined number of load/store instructions. Loader places data beginning at 0x80000000. Optimally mapping instructions on the grid is critical, due Figure 2: Application memory organization. to the TRIPS dataflow execution model, in which each execu- tion node fires when its operands arrive. Cycle counts depend not only on opcode cycles but on operand arrival times, them- selves determined by feeds from banked architectural regis- ters, a routing operand network (OPN), and a banked L1 data ¯ A common set of application programming interfaces cache. that can be used by other tools, such as the simulators Despite such contraints, numerous features have made the and debugger. prototype easier to program, including regular and clearly de- We chose to support a simple version of the Executable Link- fined instruction formats; a global 40-bit address space, which ing Format (ELF) to output a small number of string tables and enables software to access any chip register or memory lo- text, data, and bss program sections. Figure 2 shows how the cation through its unique address; and a uniform processor linker lays out application data in virtual memory. The figure exception model, in which all processor exceptions occur at also reflects the simple but effective calling convention of the block boundaries and present a consistent interface to the soft- TRIPS Application Binary Interface (ABI) [9]. One simpli- ware. fication that made binutils development easier was dropping Throughout toolchain development, we relied on a ro- support for shared libraries–all TRIPS executables are stati- bust functional simulator, an accurate timing simulator, and cally linked, which dramatically simplifies installing, version- a sophisticated set of performance tools to analyze execution ing, and debugging toolchain components and applications. traces, to visualize instruction placement, and to pinpoint crit- The utility front ends are largely platform independent. For ical paths and execution bottlenecks. example, once we specified the TRIPS data types, such as 64- bit pointers, longs, and doubles and 32-bit integers and floats, 2.2 Binary Utilities the GNU utilities transparently provided support for allocating and accessing TRIPS scalar variables. The assembler, linker, and other utilities, such as nm, obj- However, to support the prototype ISA and high-level com- dump, and ar, are ports of the GNU binary utilities (binutils), piler. the TRIPS-specific back ends required a significant port- available from the Free Software Foundation [4]. ing effort. Descriptions of major customizations follow. A number of benefits have accrued from the decision to use the binutils package: 2.2.1 Block-oriented rather than line-oriented ¯ A leveraged and rich set of functionality based on a ma- assembly ture codebase. The GNU front end assumes that each line of a source file ¯ A large installed user base and an active, experienced translates into one instruction word. Once the TRIPS Assem- development community. bly Language (TASL) had been defined, encoding the individ- ual 32-bit instructions proved straightforward, simply requir- To save instruction count (4 vs. 2 constant instructions), ing the tas assembler to parse and translate the source file a the compiler emits CALLO and BRO instructions for all jump line at a time [14]. targets. Because the offset field specifies 128-byte chunks and ¾7 However, to support the TRIPS block-atomic model, TASL supports a 20-bit offset, any target within a ¾ byte-spread is syntax groups instructions and register reads and writes into reachable. blocks, demarcated with “block begin” (.bbegin) and “block However, for very large applications the 20-bit offset field end” (.bend) directives for that block. The tas assembler can be insufficient. In such cases, the linker must detect the therefore tracks the individual statements after the opening shortfall, “relax” the current code section, and insert a tram- .bbegin directive and enters them into a memory structure rep- poline above the current block which fully specifies the abso- resenting the execution grid and architectural registers. When lute target address with a fabricated BR instruction. The orig- encountering the matching .bend directive, the assembler tra- inal CALLO or BRO instruction in the block is retargetted to verses the data structure, marks instructions that require relo- the adjacent trampoline block, which bounces execution to the cation, uses the GNU Binary File Description (BFD) routines far-flung address. Code trampolines introduce an extra level of to translate x86 Little-Endian formats to TRIPS Big-Endian, indirection. Fortunately, the need for them rarely occurs. and commits the whole block to disk.