7HDFKLQJ&RPSXWHU$UFKLWHFWXUH2UJDQLVDWLRQXVLQJVLPXODWRUV

Herbert Grünbacher Vienna University of Technology Treitlstrasse 3/182-2, A-1040 Vienna / Austria E-mail [email protected]

Abstract ,QWURGXFWLRQ

([SHULHQFH VKRZV WKDW PDQ\ VWXGHQWV HVSHFLDOO\ Teaching the dynamics of pipelines and caches is WKRVH ZLWK OLWWOH KDUGZDUH EDFNJURXQG HQFRXQWHU rather difficult if done on a paper and pencil basis. In GLIILFXOWLHVLQXQGHUVWDQGLQJWKHFRQVHTXHQFHVDQGHYHQ our experience students find it difficult to understand FRQFHSWV RI FRQYHQWLRQDO LQVWUXFWLRQ SLSHOLQLQJ the principles and complications of pipelines and to a VXSHUVFDODU LQVWUXFWLRQ SURFHVVLQJ LV HYHQ PRUH lesser extend of caches. To support teaching and give FRPSOLFDWHGDQGKDUGHUWRXQGHUVWDQG,WLVSDUWLFXODUO\ students an environment to experiment, we developed GLIILFXOW WR VWDWLFDOO\ WHDFK WKH FRQFHSW RI D SLSHOLQH several pipeline simulators and a simulator. 7KHUHIRUH ZH GHYHORSHG VRIWZDUH WR VLPXODWH DQG My experience is that students appreciate using G\QDPLFDOO\YLVXDOL]HWKHSURFHVVLQJRILQVWUXFWLRQVE\ simulators and by using them get easily introduced to SLSHOLQHG VXSHUVFDODU  SURFHVVRUV 7KUHH VLPXODWRUV the subject. Based on the knowledge gained from using KDYHEHHQGHYHORSHG the simulators they are motivated to further study the • :LQ'/; LV EDVHG RQ +HQQHVV\3DWWHUVRQV '/; subject using books. DUFKLWHFWXUH DQG LV PRGHOHG DW WKH DUFKLWHFWXUH Almost all of our students have their private PCs and OHYHO WKHUHIRUH YHU\ OLWWOH SURFHVVRULQWHUQDO most of them run Windows95/NT. This was the main LQIRUPDWLRQLVJLYHQ reason why we develped the simulators to run under MS • 0,36LP LV EDVHG RQ 3DWWHUVRQ+HQQHVV\ V 0,36 Windows. It turned out that students particuarly like to SURFHVVRU ERRN DQG LV PRGHOHG DW WKH FRPSXWHU work at home and they are usually well prepared to ask RUJDQL]DWLRQ OHYHO IXQFWLRQDO XQLWV OLNH UHJLVWHU questions in class. ILOHSLSHOLQHUHJLVWHUVPXOWLSOH[HUVDUHYLVLEOHDQG 0,36LPGLVSOD\VFRQWHQWDQGG\QDPLFEHKDYLRURI :LQ'/; VXFKXQLWV • 0N6LP LV EDVHG RQ WKH 0,36 5 WinDLX is a MS-Windows (16 bit) based pipeline DUFKLWHFWXUHDQGPRGHOVWKHLQVWUXFWLRQGHFRGHDQG simulator for the DLX processor as described in [1]. GLVSDWFK XQLW WKH EUDQFK XQLW WKH LQVWUXFWLRQ DLX is modeled at the architecture level, very little TXHXHV DQG WKH IXQFWLRQDO XQLWV DGGUHVV about the underlying computer organization is know at FDOFXODWLRQ ERWK $/8V IORDWLQJSRLQW DGGHU that level. IORDWLQJSRLQW PXOWLSO\GLYLGHVTXDUHURRW XQLW  After loading a symbolic DLX assembler code, most &RQFHSWV OLNH , EUDQFK KLVWRU\ of the information relevant to the CPU (pipeline, WDEOHEUDQFKUHVXPHEXIIHURXWRIRUGHUH[HFXWLRQ registers, I/O, memory, …) can be viewed and modified FDQEHH[SODLQHGHDVLO\XVLQJWKHVLPXODWRU while executing the code step-by-step or continuously. 7HDFKLQJ FDFKH RUJDQL]DWLRQ LV DQ HDVLHU WDVN WinDLX offers statistics about pipeline behavior in QHYHUWKHOHVV YLVXDOLVLQJ FDFKH DFWLYLWLHV KHOSV time. XQGHUVWDQGLQJ WKH G\QDPLFV RI D FDFKH PHPRU\ WinDLX works with several configurations: ;FDFKH LV D VLPXODWRU ZKLFK GLVSOD\V WKH LQWHUDFWLRQV Structure (number of floating point functional units) and EHWZHHQLQVWUXFWLRQPHPRU\DQGLQVWUXFWLRQFDFKHGDWD latency of the floating point can be changed. PHPRU\DQGGDWDFDFKHUHVSHFWLYHO\ Forwarding can be enabled/disabled and memory size can be modified. There is extensive online help 7KHVLPXODWRUDUHDYDLODEOHIRUIUHHGRZQORDGLQJIURP available to explain the simulator and the internals of KWWSZZZYOVLYLHWXZLHQDFDW&RPS$UFK DLX. "Register", "Code", "Pipeline", "Clock Cycle Diagram", "Statistics" and "Breakpoints" windows show internals of the pipeline. Further explanation is given below. )LJXUH0DLQ:LQGRZZLWKRSHQ&RGH:LQGRZ

&RGH:LQGRZ &ORFN&\FOH'LDJUDP:LQGRZ

The code window displays a three column Figure 2 - the cycle diagram window - shows the representation of the memory: address (symbolic or in timing behavior of the pipeline. The simulation shown hex), the machine code in hex and the assembler is in the 4th cycle, the first command is in the MEM command. Figure 1 shows the main simulation stage, the second in intEX and the fourth in IF. The window with a code segment in the open Code third command, however, is denoted as "aborted". Window. Color coding in the different simulation This is because the second command, jal, is an windows is consistent, e.g. WB (Write Back) is unconditional branch. This is known after the 3rd colored in blue. Double-clicking on instructions in cycle, when jal has been decoded. During this cycle any of the simulation windows displays pipeline status the command movi2fp (following after jal) has information in text form giving details about internal already been fetched, but the next executed command registers, operations, stalling and forwarding status. will be at another address. Therefore the execution of movi2fp must be aborted, leaving a "bubble" in the 3LSHOLQH:LQGRZ pipeline. The branch address of jal is named The pipeline window shows the inner structure of "InputUnsigned". By clicking Memory/Symbols in the DLX processor - the five pipeline stages of the the main window, the correspondence between the DLX processor and the floating point units (addition / used symbols and the actual addresses is shown. subtraction, multiplication and division).

)LJXUH&ORFN&\FOH'LDJUDP %UHDNSRLQW5HJLVWHUDQG6WDWLVWLFV:LQGRZ &RQWURO'DWD)ORZ6LJQDOV

Setting breakpoints stops the simulation at user After executing the program code data path and defined points. control signal can be displayed by clicking on them. The register window shows all registers, not just The instruction content of the different pipeline stages the , and their content in hex. is displayed on top of each stage. This statistics window provides information about Extensive help as well as a introductory tutorial is general aspects (e.g. number of simulation cycles), the available online. hardware configuration used in the simulation, stalls and their causes, conditional branches, load-/store- 0.6LP instructions, floating point stage instructions and traps. Usually, absolute count of events and The R10000 is a dynamic superscalar percentage are given, e.g. "RAW stalls: 17 (7.91 % of microprocessor which implements the 64-bit Mips all cycles)". Instruction Set Architecture [3], [4]. It fetches and The statistics window is very useful to compare decodes four instructions per cycle and dynamically the effects of changes in the pipeline configuration. issues them to five fully-pipelined low-latency execution units. Instructions can be fetched and 0,36LP executed speculatively beyond branches. Instructions graduate in order upon completion. Although MIPSim is a pipeline simulator for the MIPS execution is aggressively out-of-order, the processor processor as described in [2]. MIPS is modeled at the still provides sequential memory consistency and computer organization level. Functional units like precise exception handling. register files, pipeline registers, ALU, multiplexers, data and control flow are visible. 0RGHORIWKH5 The user can write small programs (currently there is only a subset of the MIPS instruction set Our R1000k model concentrates on the most implemented) and watch the pipeline doing its work, important issues of a superscalar architecture and we modify the program and the content of data memory wanted to have an easy to learn not to complex user- and register file ‘on the fly’ and go on simulating to interface. The following parts of the processor are see the effects. modelled: At present MIPSim models a rather simple ,QVWUXFWLRQGHFRGHDQGGLVSDWFKXQLW, responsible pipeline without hazard detection and forwarding for instruction fetching, instruction decoding, register units. renaming and finally dispatching the instruction to the appropriate queues. The dispatcher works together $VVHPEOHU3URJUDP,QVWUXFWLRQ0HPRU\&RQWHQW with the EUDQFKXQLW when predicting the outcome of conditional branches. During this process they need to In the very left window in Figure 3 the program access the EUDQFK KLVWRU\ WDEOH and the EUDQFK code is shown. The program can be executed in single UHVXPHEXIIHU, which therefore are also simulated. As step or running mode. By setting the pointer (in soon as instructions are being dispatched to the essence the program counter) to a particular address, queues they are also given an entry in the DFWLYHOLVW, manual jumps in the program can be accomplished. which also is part of our simulation. By double clicking on the Instr. box a window opens All of the R10000's LQVWUXFWLRQ TXHXHV, namely an in which modifications of the instruction memory address queue, an integer queue and a floating-point content (the program) can be done. queue are included in the simulation. To be able to determine, which operand results are ready, they 'DWD0HPRU\&RQWHQW access the also simulated EXV\WDEOH. The remaining parts of the simulation are the five By double clicking on the Data box a window opens. IXQFWLRQDO H[HFXWLRQ XQLWV, the address calculation Modifications (overwriting) of the data memory unit, both ALUs, the floating-point unit and the content can be done interactively. floating-point multiply/divide/square-root unit. Modifying the content of instruction/data memory is Data is read from and written to PHPRU\, which can very valuable for experimenting with the pipeline, e.g. be viewed and modified during the simulation. to show data hazards. The memory is simplified and it is assumed to be accessible without any delay. Exception handling is not implemented. The functional units simulate latencies and repeat rate correctly, but the internal pipeline structure is not visible as in MIPSim. Only a reduced set of instructions is implemented. )LJXUH0,36LP:LQGRZZLWK'DWDDQG&RQWURO6LJQDOV )LJXUH0N6LP0DLQ:LQGRZO 7KH0N6LPXODWRU Extensive help and an introduction how to use the The left windows in Figure 4 shows the assembler simulator is available online. window and the active list window. The block diagram ;FDFKH on the right side of the windows main screen shows the main components of the simulator. XCache simulates and visualizes the behaviour of a During a simulation run, the instructions, cache on a step-by-step basis rather than performing represented as small balls in different colours, "wander" statistical evaluations. along the connections between these elements, thus However, to enable advanced cache performance demonstrating their flow through the superscalar analysis, an interface to Mark D. Hill’s cache simulator instruction pipeline. When an instruction reaches a DINERO was incorporated. processor's unit, such as a queue or a functional unit, its XCache used the same format for the memory "ball" representation disappears and the unit takes over reference pattersn DINERO. The user may specify the display of the instruction. cache parameters like associativity, size, etc., then load Clicking on the Queue, Register and Data Cache an input stream and watch the cache at work. boxes displays the content ot the respective functional Alternatively, it is possible to define the command-line units. parameters for DINERO, run a simulation with it and view the results.

Figure 5 Xcache Main Window

$FNQRZOHGPHQW the pipeline. Another point which is attractive to The simualtors have been developed as part of diploma students is that they can work at home and this is also thesis at our department. Special thanks go to G. Raidl, helpful to the University as is reduces the load on our M. Frigeri, G. Gridling, Ch. Fuss and J. Silhan. computers. 5HIHUHQFHV ,PSOHPHQWDWLRQ [1] D.A. Patterson / J.L. Hennessy: Computer Architecture - The simulators have been written in C++. We make the A Quantitative Approach, Morgan Kaufmann source code available provided that we get the modified Publishers, San Mateo, California, 1990 sources in return and that the executables are made [2] J.L. Hennessy / D.A. Patterson: Computer Organization & available for free to the public domain. Design The Hardware/Software Interface, Morgan Kaufmann Publishers, San Mateo, California, 1994 6XPPDU\ [3] Presentation of the R10000, Hot Chips, August 17, 1995 We have no quantitative measures how much our [4] J. Heinrich et al.: MIPS R10000 Microprocessor User's simulators improved teaching computer architecture. Manual, Version 2.0, October 1996, MIPS But we do know that students spend more time getting Technologies, Mountain View, CA familiar with pipelining than they spent using the paper [5] WinDLX, MIPSIM, M10kSim, Xcache at and pencil approach. We also know that students come http://www.vlsivie.tuwien.ac.at/CompArch well prepared to the lab which accompanies the course and even better come with proposals how to improve