Retargeting Open64 to A RISC processor
-- A Student’s Perspective
Huimin Cui, Xiaobing Feng Key Laboratory of Computer System and Architecture, Institute of Computing Technology, CAS, 100190 Beijing, China {cuihm, fxb}@ict.ac.cn
compiler [6]. Abstract Open64 has been retargeted to a number of This paper presents a student’s experience architectures. Pathscale modified Open64 to in Open64-Mips prototype development, we create EkoPath, a compiler for the AMD64 and summarize three retargeting observations. X8664 architecture. The University of Open64 is easy to be retargeted and the Delaware's Computer Architecture and Parallel procedure takes only a short period. With the Systems Laboratory (CAPSL) modified Open64 retarget procedure done, the compiler can to create the Kylin Compiler, a compiler for achieve good and stable performance. Open64 Intel's X-Scale architecture [1]. Besides the also provides many supports for debugging, with targets mentioned above, there are several other which a beginner can debug the compiler supported targets including PowerPC [6], without difficulty. We also share some NVISA [9], Simplight [10] and Qualcomm[11]. experiences of our retarget, including In this paper, we will discuss three issues methodology of verifying the compiler when retargeting Open64: ease of retarget, framework, attention to switches in Open64, performance, and debuggability. The discussion importance of debugging and reading generated is based on our work in retargeting Open64 to code. the MIPS platform, using the Simplight branch as our starting point, which is a RISC style DSP 1 Introduction with mixed 32/16 bit instruction set. During our discussion, we will use GCC (Mips target) for Open64 receives contributions from a comparison. This retarget work is just a student number of compiler groups around the world, project, so we have some non-goals, which will from industry as well as from academia. It is be discussed in section 5. derived from the SGI MIPSPro64 compiler [1]. And we will also share some of our Good performance and industrial strength origin retargeting experiences, in four folds: 1) Steps to make the Open64 compiler a popular choice for verify the compiler framework. 2) Turn on/off research projects. The Open64 compiler supports switches for special hardware features. 3) C/C++ and Fortran95 languages and many Debugging is very important, not only for researchers from industry and academia have retarget, but also for later extensive work. 4) Pay contributed to the retargeting of the Open64 attention to the generated code if it doesn’t work
1 as expected. isa/
2 INTRINSIC_ID, property of consistency. intrinsic (order is important, config_cache_targ.cxx sets cache models. indexed by INTRINSIC_ID) Targ_em_* files are for assembly output FE definition files, kgccfe/gnu control, section types, relocations, dwarf etc. Builtins.def – id, name, One only need to change these files when focus prototype, attribute in BE portion. Builtin-types.def – needed if new Basically one needs to tailor files type is needed mentioned about and then continue building Wfe_expr.cxx – translate GNU components as follows. builtin to WHIRL. (4). Build gccfe (G). Create the following files: (5). Build g++fe common/com/
3 (11). Build driver 3.2 Main Results (12). Build ipl (13). Build lno Observation 1 (See also section 3.3.1). A (14). Build inline student can complete the retarget procedure in (15). Build whirl2c about two months, if an appropriate supported (16). Build ipa target processor can be used to start the retargeting. In our case, the Simplight SL1 3 Retarget Results & Discussion processor is a RISC based processor with 16 bit instructions and DSP extensions. We chose this 3.1 Machine Assumption & as our starting point for its RISC basis as ISA. benchmarks Observation 2 (See also section 3.3.2). Performance of the generated Open64 compiler Machine Assumption: is satisfying, in two folds. Processor: Loongson 2f, out of order 1. We expect the retargeted Open64 ISA: MIPS3 compiler will generate very good code due to the Frequency: 666MHz excellent middle ends such as Wopt and Lno Issues: 4 components. Indeed, when we achieved ALU: 2 competitive or better performance compared to a FALU: 2 matured GCC compiler, for all the benchmarks MEM Unit: 1 we tested as a result of our effort. It is customary to use SPEC CPU2000 or 2. Open64 provides obvious performance 2006 to evaluate a compiler performance, but improvement with optimization option switched that is our non-goal (discussed in section 5). As a from –O2 to –O3, especially for C++ programs student project, we only consider the following with higher abstraction levels. benchmarks: Stanford benchmark, the Observation 3 (See also section 3.3.3). abstraction penalty benchmark written by The debuggability of Open64 compiler and its Stepanov [3], two programs chosen from SPEC generated code is good, it is not difficult for a CPU2000 - bzip2 and mcf. beginner to learn and use. Stanford benchmark [2] consists 7 small programs: hanoi, bubble, matmul, perm, qsort, 3.3 Detailed Discussions queen, and sieve. The abstraction penalty benchmark summarizes the characteristics of 3.3.1 Ease of Retarget generic programs; it measures the performance Following the suggested retarget procedure, under different levels of abstraction for C++ we completed our retargetting work in two programs. The baseline is a non-generic version months. In a short period, our compiler has of the kernel function, as shown by Figure 3.1. passed all the benchmarks mentioned in section 3.1, with option from –O0 to -O3. It means that double data[SIZE]; … … our compiler framework is reasonable and the for(int i = 0; i < iterations; ++i) { double result = 0; major components work well. for (int n = 0; n < last - first; ++n) As we know, compiler development needs result += first[n]; check(result); to deal with the issues of ISA, ABI, code } generation, and some machine-dependent Figure 3.1 non-generic version in abstract functions or methods. All these issues are penalty benchmark covered in the suggested retarget procedure, as presented in Section 2.
4 Porting GCC is similar; it requires the comparison of -O2 and -O3 for Open64 and developers to consider ISA and code generation GCC. Open64 –O3 option provides obviously in the .md (machine description) file. While ABI, better and more stable performance for higher processor information and machine-dependent abstraction levels, comparing with –O2 option. subroutines should be considered in some While GCC has no obvious difference for the machine-dependent header files and C source two options. files [4][5]. These issues are common for both Open64 and GCC. 100 90 80 -O2 3.3.2 Performance Results and 70 60 -O3 50 Discussions 40 second) 30 20 In this subsection, we will present our 10 0 Performance(additions per Performance(additions 0 2 4 6 8 0 2 performance comparison with GCC for the three 1 1 benchmarks mentioned earlier. The performance test number is measured with just the retarget procedure done, (a) Open64 performance it should be emphasized that we did NOT do any n machine-dependent performance tuning of the 65.8 65.6 65.4 Open64 compiler during this exercise. -O2 65.2 -O3 Figure 3.2 shows the performance 65 64.8 comparison for abstraction penalty benchmark s per second) 64.6
Performance(additio 0 2 4 6 8 2 between Open64 and GCC. We can see that 10 1 Open64 provides better performance for both test number –O2 and –O3 options. (b) Gcc performance Figure 3.3 Performance comparison of -O2 and
100 -O3, for abstraction penalty benchmark. 80 60 Open64 Figure 3.4 shows the performance 40 GCC comparison for Stanford benchmark. Open64
per seconds) 20 0 provides better performance than GCC, with Performance (additions 0 2 4 6 8 0 2 1 1 options –O2 and –O3. The vertical axis is Test number normalized execution time to GCC –O2. We can (a) performance comparison with –O2 see that matmul shows dramatically performance increasing for Open64 –O3 option. The reason is
100 that loop tiling is applied, which is effective for 80 matrix multiplication, especially when the 60 Open64 problem size is large. 40 GCC
per seconds) 20 0 2.61 2.05 Performance (additions 0 2 4 6 8 10 12 Test number (b) performance comparison with –O3 Figure 3.2 Performance comparison for abstraction penalty benchmark
Figure 3.3 shows the performance Figure 3.4 Performance of Stanford benchmark
5 Figure 3.5 shows the performance of the assemble code more easily. Open64 and GCC for bzip2 and mcf, from SPEC CPU2000. Open64 and GCC have no obvious 4 Experience and Suggestion performance difference. In this section, we will share our
experience and our suggestion for Open64. We 140 120 hope our experience might be useful for future
100 open64 -O2 developers to port Open64 in a short time. Some 80 gcc -O2 issues are related to the methodology of 60 open64 -O3 gcc -O3 40 retargeting, and some might be trivial but useful.
execution time(sec) execution 20 Experience 1: How to verify the basic
0 issues of the compiler framework step by step bzip2(test) mcf(train) quickly? Our experience is choosing benchmarks Figure 3.5 Performance of bzip2 and mcf. from simple to complex. The following shows
3.3.3 Debuggability our steps. Step 1: taking "hello world"(-O0) as the Open64 provides many supports for startup. If it works well, it means the modules of debugging the compiler, including: 1) dumping front-end \ ir-transformation\ code-generation\ out the program code and symbol table between lib-function-call are reasonable. phases. 2) tracing of optimization and analysis Step 2: focus on the variations of "hello process. 3) builtin options to allow one to world"(-O0). Change data type of the parameters, isolating program file that triggers the bug, and and change the number of parameters. Add some isolate the PU, the expression, basic block, etc control-flow into the test-case, eg, branch and that exhibits the bug. 4) some dump methods loop. If all the variations work well, it means the that can be called in GDB. ABI and variant-as-parameter are reasonable. With these supports, the compiler can be Remember to test for 64-bit immediate values if debugged in an intuitive way. Here are our steps: you are retargetting for a 64-bit platform. 1) use isolation tools to find which PU in which Step 3: turn -O0 into -O2 and -O3 for file causes the bug. 2) read the assemble code of hello-world and its variations. If it works, the that PU carefully, to find out which BB or BBs optimization FRAMEWORK works. But are translated incorrectly. 3) dump out the optimization is a complex issue, and needs more intermediate representations between phases as efforts. a .t file, read the .t file carefully to find out the Step 4: taking "stanford benchmark" for bad optimization phase. 4) Now we can know test (-O0 to -O3). If the suite works well, it which optimization causes which BBs error. And means the compiler works for multiple can use GDB to debug the optimization phase. procedures. And some bugs on code generation Dumping methods can be used in debugger, such and optimizations would be found and fixed as WHIRL dumper, BB dumper, etc. during this step. Also Open64 provides friendly comments Step 5: testing for "abstract penalty" in the generated file, to help the reader benchmark. It can check the c++ frontend, and understand the assemble code. For example, a loop optimizations. loop would be commented with its line number Experience 2: Be familiar with Open64’s in source file, nesting depth, estimated iterations, switches. and unroll times. It clearly showed the loop There are many switches in Open64. If we transformations and helps the reader understand need turn on/off some machine feature, search its
6 corresponding switch first of all. If the switch 5 Non-goals already existed, things would turn much simpler. For example, delay slot, multiply-and-add, We mentioned earlier that we are divide instruction, etc. retargeting for an Open64-MIPS PROTOTYPE. Experience 3: Easy to debug is significant. So there are some issues we have not dealt with, It is not easy to modify all the files and we will discuss these issues in this section. correctly at first. We can fix the bugs by running, Multiple ISA support. As we know, MIPS tracing, debugging the compiler. After familiar has several ISAs, say, MIPS2, MIPS3, MIPS4, with that, we can do extensive changes to those MIPS64, etc. It should be controlled and chosen files. with the target machine, but we did not cover it. Debugging not only fixes bugs, but also In our prototype, only MIPS3 is supported. helps get familiar with the compiler more Dynamic Shared Object (DSO). DSO is a quickly and efficiently. library that is linked in at runtime, and it requires Experience 4: Pay attention to the generation of position independent code (PIC) generated code. and position independent data (PID). We have Some files will not affect the correctness of not tested the compilation for dynamic the compiler if they are not modified execution. appropriately, but they will affect the quality of Production Quality. Production compiler generated code. For example, if the branch cost requires elaborate code reviews and extensive was set inaccurately in CGTARG_Compute_ tests; especially the code generator part, to Branch_Parameters (be/cg/
7 6 Conclusion Backend from Sim-nML Processor Description, master thesis, 2001 This paper shows Open64’s ease of retarget [6] Ming Lin, Zhenyang Yu, Duo Zhang, and good performance that can be obtained after Yunmin Zhu, Shengyuan Wang, Yuan Dong, retarget. Debuggability is also discussed with "Retargeting the Open64 Compiler to PowerPC our retarget procedure. The paper also shares our Processor," icesssymposia,pp.152-157, 2008 retarget experiences, including methodology of International Conference on Embedded Software verifying the compiler framework, attention to and Systems Symposia, 2008 switches in Open64, importance of debugging [7] Juha Haataja, Ville Savolainen, Cray T3E and reading generated code. Also we present User’s Guide, (Center for Scientific Computing, some suggestions for Open64 community. Finland, 1997). [8] The Power Challenge Technical Report. On-line. as. http://www.sgi.com/Products/ hardware/ Power/index.html. Acknowledgement [9] Mike Murphy, NVIDIA's Experience with Open64, Open64 Workshop at CGO 2008. Sun Chan guided our retarget procedure [10] K. M. Lo and Lin Ma, Quantitative from start, told us the methodology and detailed approach to ISA design and compilation for code steps of retargeting Open64. Fred Chow size reduction, Open64 Workshop at CGO 2008. provided the basic document for retargeting [11] Subrato K De, Anshuman Dasgupta, procedure (section 2). Sun also helped us Sundeep Kushwaha, Tony Linthicum, Susan organize and revise the paper. Brownhill, Sergei Larin, Taylor Simpson, We give our thanks to Professor Guang. R. Development of an Efficient DSP Compiler Gao for giving us this opportunity. Professor Based on Open64, Open64 Workshop at CGO Gao also gave us invaluable encouragements and 2008. discussions during the retarget procedure. We also thank to members from ICT and Simplight for their help and discussions, and thank to the reviewers for their valuable comments.
Reference
[1] Homepage of Open64. http://www.Open64.net/ [2] W. J. Price. A benchmark tutorial. IEEE Micro, 9(5):28--43, Oct. 1988 [3] Alex Stepanov. Abstraction Penalty Benchmark, version 1.2 (KAI). Silicon Graphics, Incorporated, 199?. [4] Homepage of GCC, the GNU Compiler Collection. http://gcc.gnu.org/ [5] Soubhik Bhattacharya, Generation of GCC
8