Retargeting Open64 to A RISC processor

-- A Student’s Perspective

Huimin Cui, Xiaobing Feng Key Laboratory of Computer System and Architecture, Institute of Computing Technology, CAS, 100190 Beijing, China {cuihm, fxb}@ict.ac.cn

[6]. Abstract Open64 has been retargeted to a number of This paper presents a student’s experience architectures. Pathscale modified Open64 to in Open64-Mips prototype development, we create EkoPath, a compiler for the AMD64 and summarize three retargeting observations. X8664 architecture. The University of Open64 is easy to be retargeted and the Delaware's and Parallel procedure takes only a short period. With the Systems Laboratory (CAPSL) modified Open64 retarget procedure done, the compiler can to create the Kylin Compiler, a compiler for achieve good and stable performance. Open64 's X-Scale architecture [1]. Besides the also provides many supports for debugging, with targets mentioned above, there are several other which a beginner can debug the compiler supported targets including PowerPC [6], without difficulty. We also share some NVISA [9], Simplight [10] and Qualcomm[11]. experiences of our retarget, including In this paper, we will discuss three issues methodology of verifying the compiler when retargeting Open64: ease of retarget, framework, attention to switches in Open64, performance, and debuggability. The discussion importance of debugging and reading generated is based on our work in retargeting Open64 to code. the MIPS platform, using the Simplight branch as our starting point, which is a RISC style DSP 1 Introduction with mixed 32/16 bit instruction set. During our discussion, we will use GCC (Mips target) for Open64 receives contributions from a comparison. This retarget work is just a student number of compiler groups around the world, project, so we have some non-goals, which will from industry as well as from academia. It is be discussed in section 5. derived from the SGI MIPSPro64 compiler [1]. And we will also share some of our Good performance and industrial strength origin retargeting experiences, in four folds: 1) Steps to make the Open64 compiler a popular choice for verify the compiler framework. 2) Turn on/off research projects. The Open64 compiler supports switches for special hardware features. 3) /C++ and Fortran95 languages and many Debugging is very important, not only for researchers from industry and academia have retarget, but also for later extensive work. 4) Pay contributed to the retargeting of the Open64 attention to the generated code if it doesn’t work

1 as expected. isa//isa.cxx The paper is organized as follows. Section isa//isa_properties.cxx 2 introduces the retarget procedure. Section 3 isa//isa_subset.cxx presents our major retarget results and analysis. isa//isa_registers.cxx Section 4 summarizes our experiences and isa//isa_enums.cxx expectations. Section 5 shows our non-goals and isa//isa_lits.cxx section 6 concludes the paper. isa//isa_operands.cxx isa//isa_hazards.cxx 2 Suggested Retarget Procedure isa//isa_print.cxx isa//isa_pack.cxx This section presents our retargeting isa//isa_bundle.cxx procedure, and it is applicable if one can find a isa//isa_decode.cxx similar processor in Open64’s supporting targets. isa//isa_pseudo.cxx Here, letters (A,B,..) represent the file creation abi//abi_properties.cxx order, and numbers (1,2...) represent the building proc//proc.cxx order. proc//proc_properties.cxx (A). Add make rules for your target() in proc//_si.cxx top level Makefile.gsetup This directory contains most of the basic (B). Create dir targia32_targ setting for retarget purposes. Most retarget This is the build directory for your new problems will be due to wrong settings in this target. Host is specified as ia32. directory. All these files should be written (C). Create and set up Makefile in each subdir carefully. under targia32_targ (3). Build targ_info Make sure the BUILD_TARGET is set (F). Deal with frontend in kgccfe and kg++fe correctly so it goes to the right target directory directories. for each sub component. In kgccfe/, kgccfe/gnu/config, kg++fe/ Set your BUILD_VENDOR correctly. gnu, kg++fe/gnu/config, create directory of (1). Build include by copying from ia64 or x8664 and (D). In common/com, create / config_ fixing the values and the include file names as targ.h well as define statement. Add macros for Is_Target_xxx, xxx Fix up the gnu_config.h to include the represents the supported processors of the corresponding config.h file for your target. new target, e.g, Is_Target_R10K() for MIPS. Fix up the /config.h, / Add ABI macros for the new target. hconfig.h, /tconfig.h, /tm_p.h, to Set GP area size. include the corresponding file for your target. Set target debugging information. Make sure Makefile and Makefile.gbase In common/util, create c_qwmultu.c by know to go to gnu/ directory also. It copying from ia64, or as appropriate. requires one to check whether gnu/ has (2). Build libcmplrs, libiberty, libcomutil been added into SRC_DIRS. Sometimes, it helps to show the compile In include/elf.h, the #define for Elf32_Byte line during building of the compiler, simply and Elf64_Byte can always be enabled. change gcommondefs and gcommonrules in If need to add intrinsic, the following files /make/. needs changing. (E). In common/targ_info, create dir's ‹ intrinsic.def – defines and their files. Use this order:

2 INTRINSIC_ID, property of consistency. intrinsic (order is important, config_cache_targ.cxx sets cache models. indexed by INTRINSIC_ID) Targ_em_* files are for assembly output ‹ FE definition files, kgccfe/gnu control, section types, relocations, dwarf etc. ‹ Builtins.def – id, name, One only need to change these files when focus prototype, attribute in BE portion. ‹ Builtin-types.def – needed if new Basically one needs to tailor files type is needed mentioned about and then continue building ‹ Wfe_expr.cxx – translate GNU components as follows. builtin to WHIRL. (4). Build gccfe (G). Create the following files: (5). Build g++fe common/com//config_platform.h (6). Build ir_tools common/com//config_asm.h (H). Create be/be/ and be/com/ common/com//config_cache_targ.cxx directories common/com//config_elf_targ.cxx Create driver_targ.cxx, fill_align_targ.cxx common/com//config_host.c in be/be/ dir common/com//config_platform.c Create betarget.cxx, sections.cxx in common/com//config_targ.cxx be/com/ dir common/com//config_targ.h There is no need to be optimal at this point, common/com//config_targ_opt.cxx just make it work, and then make it perform. e.g. common/com//config_targ_opt.h One can set Can_Do_Fast_Multiply to return common/com//targ_const.cxx FALSE, and return to that later. common/com//targ_const_private.h (7). Build be common/com//targ_ctrl.h (8). Build libelf, libelfutil, libdwarf, libunwindP common/com//targ_em_const.cxx (I). Create be/cg//register_targ.h common/com//targ_em_dwarf.cxx be/cg//tn_targ.h common/com//targ_em_dwarf.h be/cg//op_targ.h common/com//targ_em_elf.cxx Create one’s own target specific portion in common/com//targ_em_elf.h lib/elf.h (or appropriate elf.h), define the right common/com//targ_sim.cxx relocations, sections, etc. common/com//targ_sim.h (9). Build wopt All these files can be copied from the (J). Create files in be/cg/ directory of a similar target, and modified as Fix cgtarget_arch.h, such as CGTARG_ required. Copy_Op, ... Make sure the endianness is set correctly. It There might be needs to add more things in is set in config_host.c, config_targ.cxx and be/cg/op.h common/com/config.cxx. Fix cgdwarf_targ.cxx, Find_Spill_TN, ..., Make sure the assembly syntax for .s file unwind table in dwarf. output is set correctly. It is set in config_asm.c. Copy other files from a similar processor’s Make sure the calling convention and directory. Go over expand.cxx, whir2ops.cxx parameter passing are set correctly. They are set carefully. And exp_branch.cxx, exp_divrem.cxx, in targ_sim.h and targ_sim.cxx. If you need your exp_loadstore.cxx, expand.cxx, entry_exit_targ. own ABI definition, just change this file, and do cxx needs to be changed or tailored substantially. not forget to modify targ_sim.cxx for (10). Build cg

3 (11). Build driver 3.2 Main Results (12). Build ipl (13). Build lno Observation 1 (See also section 3.3.1). A (14). Build inline student can complete the retarget procedure in (15). Build whirl2c about two months, if an appropriate supported (16). Build ipa target processor can be used to start the retargeting. In our case, the Simplight SL1 3 Retarget Results & Discussion processor is a RISC based processor with 16 bit instructions and DSP extensions. We chose this 3.1 Machine Assumption & as our starting point for its RISC basis as ISA. benchmarks Observation 2 (See also section 3.3.2). Performance of the generated Open64 compiler Machine Assumption: is satisfying, in two folds. Processor: 2f, out of order 1. We expect the retargeted Open64 ISA: MIPS3 compiler will generate very good code due to the Frequency: 666MHz excellent middle ends such as Wopt and Lno Issues: 4 components. Indeed, when we achieved ALU: 2 competitive or better performance compared to a FALU: 2 matured GCC compiler, for all the benchmarks MEM Unit: 1 we tested as a result of our effort. It is customary to use SPEC CPU2000 or 2. Open64 provides obvious performance 2006 to evaluate a compiler performance, but improvement with optimization option switched that is our non-goal (discussed in section 5). As a from –O2 to –O3, especially for C++ programs student project, we only consider the following with higher abstraction levels. benchmarks: Stanford benchmark, the Observation 3 (See also section 3.3.3). abstraction penalty benchmark written by The debuggability of Open64 compiler and its Stepanov [3], two programs chosen from SPEC generated code is good, it is not difficult for a CPU2000 - bzip2 and mcf. beginner to learn and use. Stanford benchmark [2] consists 7 small programs: hanoi, bubble, matmul, perm, qsort, 3.3 Detailed Discussions queen, and sieve. The abstraction penalty benchmark summarizes the characteristics of 3.3.1 Ease of Retarget generic programs; it measures the performance Following the suggested retarget procedure, under different levels of abstraction for C++ we completed our retargetting work in two programs. The baseline is a non-generic version months. In a short period, our compiler has of the kernel function, as shown by Figure 3.1. passed all the benchmarks mentioned in section 3.1, with option from –O0 to -O3. It means that double data[SIZE]; … … our compiler framework is reasonable and the for(int i = 0; i < iterations; ++i) { double result = 0; major components work well. for (int n = 0; n < last - first; ++n) As we know, compiler development needs result += first[n]; check(result); to deal with the issues of ISA, ABI, code } generation, and some machine-dependent Figure 3.1 non-generic version in abstract functions or methods. All these issues are penalty benchmark covered in the suggested retarget procedure, as presented in Section 2.

4 Porting GCC is similar; it requires the comparison of -O2 and -O3 for Open64 and developers to consider ISA and code generation GCC. Open64 –O3 option provides obviously in the .md (machine description) file. While ABI, better and more stable performance for higher processor information and machine-dependent abstraction levels, comparing with –O2 option. subroutines should be considered in some While GCC has no obvious difference for the machine-dependent header files and C source two options. files [4][5]. These issues are common for both Open64 and GCC. 100 90 80 -O2 3.3.2 Performance Results and 70 60 -O3 50 Discussions 40 second) 30 20 In this subsection, we will present our 10 0 Performance(additions per Performance(additions 0 2 4 6 8 0 2 performance comparison with GCC for the three 1 1 benchmarks mentioned earlier. The performance test number is measured with just the retarget procedure done, (a) Open64 performance it should be emphasized that we did NOT do any n machine-dependent performance tuning of the 65.8 65.6 65.4 Open64 compiler during this exercise. -O2 65.2 -O3 Figure 3.2 shows the performance 65 64.8 comparison for abstraction penalty benchmark s per second) 64.6

Performance(additio 0 2 4 6 8 2 between Open64 and GCC. We can see that 10 1 Open64 provides better performance for both test number –O2 and –O3 options. (b) Gcc performance Figure 3.3 Performance comparison of -O2 and

100 -O3, for abstraction penalty benchmark. 80 60 Open64 Figure 3.4 shows the performance 40 GCC comparison for Stanford benchmark. Open64

per seconds) 20 0 provides better performance than GCC, with Performance (additions 0 2 4 6 8 0 2 1 1 options –O2 and –O3. The vertical axis is Test number normalized execution time to GCC –O2. We can (a) performance comparison with –O2 see that matmul shows dramatically performance increasing for Open64 –O3 option. The reason is

100 that loop tiling is applied, which is effective for 80 matrix multiplication, especially when the 60 Open64 problem size is large. 40 GCC

per seconds) 20 0 2.61 2.05 Performance (additions 0 2 4 6 8 10 12 Test number (b) performance comparison with –O3 Figure 3.2 Performance comparison for abstraction penalty benchmark

Figure 3.3 shows the performance Figure 3.4 Performance of Stanford benchmark

5 Figure 3.5 shows the performance of the assemble code more easily. Open64 and GCC for bzip2 and mcf, from SPEC CPU2000. Open64 and GCC have no obvious 4 Experience and Suggestion performance difference. In this section, we will share our

experience and our suggestion for Open64. We 140 120 hope our experience might be useful for future

100 open64 -O2 developers to port Open64 in a short time. Some 80 gcc -O2 issues are related to the methodology of 60 open64 -O3 gcc -O3 40 retargeting, and some might be trivial but useful.

execution time(sec) execution 20 Experience 1: How to verify the basic

0 issues of the compiler framework step by step bzip2(test) mcf(train) quickly? Our experience is choosing benchmarks Figure 3.5 Performance of bzip2 and mcf. from simple to complex. The following shows

3.3.3 Debuggability our steps. Step 1: taking "hello world"(-O0) as the Open64 provides many supports for startup. If it works well, it means the modules of debugging the compiler, including: 1) dumping front-end \ ir-transformation\ code-generation\ out the program code and symbol table between lib-function-call are reasonable. phases. 2) tracing of optimization and analysis Step 2: focus on the variations of "hello process. 3) builtin options to allow one to world"(-O0). Change data type of the parameters, isolating program file that triggers the bug, and and change the number of parameters. Add some isolate the PU, the expression, basic block, etc control-flow into the test-case, eg, branch and that exhibits the bug. 4) some dump methods loop. If all the variations work well, it means the that can be called in GDB. ABI and variant-as-parameter are reasonable. With these supports, the compiler can be Remember to test for 64-bit immediate values if debugged in an intuitive way. Here are our steps: you are retargetting for a 64-bit platform. 1) use isolation tools to find which PU in which Step 3: turn -O0 into -O2 and -O3 for file causes the bug. 2) read the assemble code of hello-world and its variations. If it works, the that PU carefully, to find out which BB or BBs optimization FRAMEWORK works. But are translated incorrectly. 3) dump out the optimization is a complex issue, and needs more intermediate representations between phases as efforts. a .t file, read the .t file carefully to find out the Step 4: taking "stanford benchmark" for bad optimization phase. 4) Now we can know test (-O0 to -O3). If the suite works well, it which optimization causes which BBs error. And means the compiler works for multiple can use GDB to debug the optimization phase. procedures. And some bugs on code generation Dumping methods can be used in debugger, such and optimizations would be found and fixed as WHIRL dumper, BB dumper, etc. during this step. Also Open64 provides friendly comments Step 5: testing for "abstract penalty" in the generated file, to help the reader benchmark. It can check the c++ frontend, and understand the assemble code. For example, a loop optimizations. loop would be commented with its line number Experience 2: Be familiar with Open64’s in source file, nesting depth, estimated iterations, switches. and unroll times. It clearly showed the loop There are many switches in Open64. If we transformations and helps the reader understand need turn on/off some machine feature, search its

6 corresponding switch first of all. If the switch 5 Non-goals already existed, things would turn much simpler. For example, delay slot, multiply-and-add, We mentioned earlier that we are divide instruction, etc. retargeting for an Open64-MIPS PROTOTYPE. Experience 3: Easy to debug is significant. So there are some issues we have not dealt with, It is not easy to modify all the files and we will discuss these issues in this section. correctly at first. We can fix the bugs by running, Multiple ISA support. As we know, MIPS tracing, debugging the compiler. After familiar has several ISAs, say, MIPS2, MIPS3, MIPS4, with that, we can do extensive changes to those MIPS64, etc. It should be controlled and chosen files. with the target machine, but we did not cover it. Debugging not only fixes bugs, but also In our prototype, only MIPS3 is supported. helps get familiar with the compiler more Dynamic Shared Object (DSO). DSO is a quickly and efficiently. library that is linked in at runtime, and it requires Experience 4: Pay attention to the generation of position independent code (PIC) generated code. and position independent data (PID). We have Some files will not affect the correctness of not tested the compilation for dynamic the compiler if they are not modified execution. appropriately, but they will affect the quality of Production Quality. Production compiler generated code. For example, if the branch cost requires elaborate code reviews and extensive was set inaccurately in CGTARG_Compute_ tests; especially the code generator part, to Branch_Parameters (be/cg//cgtarget.cxx), assure it works well and generates correct code. the code layout of if-statement would not match Much more benchmarks and testsuites would be that of the real machine. Remember to check the needed for test, such as, SuperTest, CPU2000, machine parameters if compiler generates codes Perennial, etc. with poor quality. As a student project, we only focused on Suggestion 1. Can the flags distributed in the framework of the compiler, and the IR multiple files be merged into one file? translation. Code generation modules still have For example, the endianness should be set rooms for improvement. in three files: config_host.c, config_targ.cxx and Optimization Tuning. After retarget common/com/config.cxx. We do not know procedure is done, each optimization should be whether there are other similar instances, but this tested for the new target, to assure that its really brings some inconvenience to the retarget expectation can be reached. procedure. If a flag can be set only in one file, it We only tested for –O2 and –O3 options, would be more facile. without paying attention to the individual Suggestion 2. Can some graphic tools be optimizations. So in the previous sections, we developed for view IR, flow, etc? said our optimization FRAMEWORK is Now the intermediate information is shown reasonable. It does not mean the optimizations in text format, including WHIRL node, control are at its best, especially the peephole flow graph, etc. It is time-consuming and optimizations in the code generator. requires the developer be familiar with all the Machine-dependent Optimization. Extra data structures. If there are some graphic tools, machine-dependent optimization is an important this information can be shown in an intuitive step of the retarget, including the machine way and helps the developers find the bug point parameters, memory hierarchy organizations, etc. more quickly. With these optimizations, the target machine’s resources can be efficiently used.

7 6 Conclusion Backend from Sim-nML Processor Description, master thesis, 2001 This paper shows Open64’s ease of retarget [6] Ming Lin, Zhenyang Yu, Duo Zhang, and good performance that can be obtained after Yunmin Zhu, Shengyuan Wang, Yuan Dong, retarget. Debuggability is also discussed with "Retargeting the Open64 Compiler to PowerPC our retarget procedure. The paper also shares our Processor," icesssymposia,pp.152-157, 2008 retarget experiences, including methodology of International Conference on Embedded Software verifying the compiler framework, attention to and Systems Symposia, 2008 switches in Open64, importance of debugging [7] Juha Haataja, Ville Savolainen, T3E and reading generated code. Also we present User’s Guide, (Center for Scientific Computing, some suggestions for Open64 community. Finland, 1997). [8] The Power Challenge Technical Report. On-line. as. http://www.sgi.com/Products/ hardware/ Power/index.html. Acknowledgement [9] Mike Murphy, 's Experience with Open64, Open64 Workshop at CGO 2008. Sun Chan guided our retarget procedure [10] K. M. Lo and Lin Ma, Quantitative from start, told us the methodology and detailed approach to ISA design and compilation for code steps of retargeting Open64. Fred Chow size reduction, Open64 Workshop at CGO 2008. provided the basic document for retargeting [11] Subrato K De, Anshuman Dasgupta, procedure (section 2). Sun also helped us Sundeep Kushwaha, Tony Linthicum, Susan organize and revise the paper. Brownhill, Sergei Larin, Taylor Simpson, We give our thanks to Professor Guang. R. Development of an Efficient DSP Compiler Gao for giving us this opportunity. Professor Based on Open64, Open64 Workshop at CGO Gao also gave us invaluable encouragements and 2008. discussions during the retarget procedure. We also thank to members from ICT and Simplight for their help and discussions, and thank to the reviewers for their valuable comments.

Reference

[1] Homepage of Open64. http://www.Open64.net/ [2] W. J. Price. A benchmark tutorial. IEEE Micro, 9(5):28--43, Oct. 1988 [3] Alex Stepanov. Abstraction Penalty Benchmark, version 1.2 (KAI). , Incorporated, 199?. [4] Homepage of GCC, the GNU Compiler Collection. http://gcc.gnu.org/ [5] Soubhik Bhattacharya, Generation of GCC

8