Effective Analysis with Minimization

Geet Tapan Telang Kotaro Hashimoto Krishnaji Desai Hitachi India Pvt. Ltd. Hitachi India Pvt. Ltd. Hitachi India Pvt. Ltd. Research Engineer Software Engineer Researcher Bangalore, India Bangalore, India Bangalore, India Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—During embedded software development using open source, there exists substantial amount of code that is ineffective which reduces the debugging efficiency, readability for human inspection and increase in search space for analysis. In domains like real-time embedded system and mission critical systems, this may result in inefficiency and inconsistencies affecting lower quality of service, enhanced readability, increased verification and validation efforts. To mitigate these shortcomings, we propose the method of minimization with an easy to use tool support that leverages preprocessor directives with GCC for cutting out ifdef blocks. Case studies of Linux kernel tree, Busybox and industry- strength OSS projects are evaluated indicating average reduction in lines of code roughly around 5%-22% in base kernel using minimization technique. This results in increased efficiency for analysis, testing and human inspections that may help in assuring dependability of systems. Fig. 1: Overview of minimization.

I.INTRODUCTION The outgrowth of Linux operating system and other real- II.BACKGROUND time embedded systems in dependable computing has raised concern in possible areas such as mission critical system. The In OSS software domain, there are many developers con- size of code has grown to approximately 20 million lines of tributing towards the common goal such as real-time, safety- code (Linux) and with this scale and complexity, it becomes mitigation etc. because of which, the source code developed impossible to follow traditional methods for meeting the safety lacks strict guidelines. Although, there are checks made with and real-time requirements. semantic patches [2] and other utilities before the source code is committed, no clear coding guidelines are followed that will Since tools for analysis and testing are getting advanced, make the source code easy to inspect and analyze. the safety and time requirements can be verified and validated by capturing evidence and justifications. Most of these tools Primary problem with the OSS code in embedded domain are interested in meeting the coverage expectations in terms is the usage of pre-processor directives and conditional code of code, execution times, resources, throughput and fault compilations that are used due to varying configuration op- tolerance that define the overall dependability of systems. tions. In case of Linux kernel alone, there are more than 10,000 different configuration flags that have to enabled/disabled and Code coverage with different analysis and test tools be- thereby used as part of the pre-processor directive in the source comes the major part of verification and validation, in order to code [3]. perform this effectively, we propose method of minimization devised for keeping functional safety and real-timeliness into The configurations of OSS code is easy to enumerate consideration for narrowed search space verification, false pos- and apply depending on the configuration flags and configure itive reduction, easier human inspection and shorter verification command. However, the source code is still having all of con- time. The term minimization as shown in figure 1 signifies ditional compilation code with pre-processor directives. Too removal of unused piece of code comprising of #ifdef and much of conditional compilation code based on configuration, #if blocks. The target code along with configuration file is difficult to inspect and analyze for different static analysis when executed using minimization process produces compi- tools. As the configuration options increase, the usage of lable code without #ifdef and #if block and other unused same in code also increases significantly resulting in analysis lines of code, which is different from execution with GCC complexity. preprocessor where compiled code comprises of #ifdefs. To solve this problem, there are few tools that are built The evaluation is exercised on targets such as Linux Kernel with pre-processor directives awareness so that during the source tree, BusyBox [1] tree and similar quantification of source code analysis these tools read system configuration and other OSS projects. directives to selectively analyze relevant code and skip the disabled code automatically. Some of these tools are GNU used to tweak pre-processor directives such that conditional cflow, Coccinelle etc. Problem here is that dead code elim- compilation code is stripped as per enabled configurations. ination is not the main purpose of these tools. Pre-processor This helps in generating . code that has conditional directives handling is best done with its corresponding compiler in place. applied to remove the redundant code. Limitation here is that, Standalone tools cannot do a good job with this as they do not the approach is not generalized for complete source code tree. have required constructs configured for effective application of conditional compilation. Hence, a standardized method needs One alternative can be Cflow [9], [10] a GNU based tool, to be made available that can suffice minimization agnostic to which can preprocess input files before analyzing them and other static analysis tools and human inspection. However, the it is integrated with pre-processing option itself, however it proposed method needs to leverage compiler technology that renders difficulty because all the required preprocess options can best apply the pre-processor directives. needs to be copied and pasted for execution of Cflow leading to incorrectness. For this purpose, the GCC [4] compiler based pre-processor is selected for realizing the minimization technique. GCC is the Other alternative can have GCC compilation log and then choice due to its immense usage in OSS community including tweak the log options for each file with required pre-processor Linux, Busybox etc. options using GREP. Based on which the required .c and .h files with the stripped code based on the pre-processor A. Linux kernel configuration options can be generated. In subsequent sections, an approach is proposed with Makefile integration and subsequent post Configuration plays a very important role in building any processing for easier code inspection and narrowed down software. In case of OSS, it becomes an essential pre-requisite search space. as the software is developed by different developers with multiple configurations concurrently. To illustrate the same, Linux kernel is a classic example for showing the varied IV. MINIMIZATION APPROACH configuration it supports. A. Minimization Process The Linux kernel configurations are available in arch/*/configs. To alter configuration, integrated Minimization approach emphasizes on a collection of pro- options such as make config, make menuconfig and cesses which tweaks integrated MakeFile options to produce make xconfig are available. Once configuration is done, compilable minimized code. it gets saved in .config file. The indication available in .config file has option =y illustrating driver is built into 1) Definition: The term minimization signifies an efficient the kernel, =m for built as a module or it is not selected [5]. way to get a set of stripped source code, where all the code The .config file appears as below: which is not required according to .config file is left out. Often it is observed that it becomes hard to debug, maintain #General setup and verify the code because of macros and preprocessor CONFIG_INIT_ENV_ARG_LIMIT=32 directives that are expanded during compilation through GCC. CONFIG_CROSS_COMPILE="" This difficulty is subdued with our approach where target # CONFIG_COMPILE_TEST is not set source code is free from selected preprocessor options and CONFIG_LOCALVERSION="" macros expansion, thereby reducing source code. The approach # CONFIG_LOCALVERSION_AUTO is not set that has been used for minimization of source code follows CONFIG_HAVE_KERNEL_GZIP=y the use of GREP command. This filters GCC (used compiler commands) and generates the source tree consisting of limited or useful internal components which are available in shortlisted B. GCC preprocessor configuration .config file as dedicated output. GCC [4] is the compiler choice that is used from utils The GCC compiler: In general scenario the source code to operating system level and rigorously tested for several modules are compiled with certain GCC options to make them years using tools such as CSMITH [6]. Hence, configuration work [11]. Typical GCC option looks like given snippet of based pre-processor is best applied with GCC and is choice kernel modules: for realizing our methodology. The GCC preprocessor implements macro language that is utilized to change C programs before they are compiled. The gcc -Wp,-MD, output is similar to input however, the preprocessor directive arch/x86/tools/.relocs_32.o.d lines are replaced with blank lines and spaces are appended -Wall -Wmissing-prototypes instead of comments based on the configuration. Certain time -Wstrict-prototypes -O2 some directives may be duplicated in output of the preproces- -fomit-frame-pointer -std=gnu89 sor, majority of these are #define and #undef that contains -I./tools/include -c -o certain debugging options [7]. arch/x86/tools/relocs_32.o arch/x86/tools/relocs_32.c

III.RELATED WORK For reduction of unused code the above illustrated options are The proposed minimization methodology improves static further added with -E -fdirectives-only to produce analysis efficiency and easier code inspection. As per prior human readable output for easier review, debug, maintain and work regarding source code stripping [8], the GCC options are verify. 2) Problem: The major difficulties with GREP based ap- The preprocess() function available in minimization tech- proach is illustrated below: nique, takes gcc options that are passed via Makefile as in- gcc -E -fdirectives-only • It requires a complete build in advance to obtain full puts, which then appends set of used GCC commands written in build log. flags and performs preprocess for target C files. • The text parsing (grep and gcc commands) is required Next is identification and deletion of the expanded header and has to be acquired from the build log. contents that is present in used compiler commands once make command is executed. To remove header con- • Finally source code needs to be modified to remove tent, line-markers are used as clues that exists in the #include lines. preprocessed file of kernel source. For example: #30 "/usr/include/sys/stsname.h" 2 However with the help of minimization approach code re- . stripHeaders() duction can be achieved by executing minimize.py script The function in minimization script ac- which requires no pre-build, no build log parsing and no code quires the preprocessed C file and then search for preprocessor #include modification. output which is relevant to lines and is accom- panied by deletion of #include contents guided by line- The minimization approach is implemented using python markers. #include content file name and line number infor- script with a stripping technique [12] as below: mation is conveyed in preprocessor output, for example: In fol- lowing syntax #30 "/usr/include/sys/stsname.h" • Elimination of configuration conditionals such as 2, 30 signifies that this line originates in line 30 of file #ifdef #if #endif. utsname.h after having included in another file which is • Preservation of #define macros. signified by flag 2. The flag which in this example is indicated #include • Preservation of sentences. by 2 represents returning to the file. However flag 1 signifies The stripping is initiated and exercised by initially focusing start of the file. on Linux kernel source code through tweaking the GCC This stripHeader() algorithm finds the line-markers that preprocessor options for complete kernel source tree. starts with # number file name and if file name is the target C file then it copies the line and searches for flag. If flag in the line marker is 2 the algorithm marks it "TO BE REPLACED" which illustrates ”there is #include line”. Finally the #include sentences are restored from the original source code by copying relevant #include lines. The restoreHeaderInclude() function in minimization technique carry out header-stripped preprocessed files and searches for "TO BE REPLACED" mark, followed by com- paring with the original C file and copy original #include lines. Once the above steps are accomplished the result is only deletion of the unused code without changing #include and #define lines.

Fig. 2: Minimization technique process flow.

3) Solution: As depicted in figure 2, the MakeFile is inher- ited and CHECK option is tweaked, where existing CHECK feature in kernel MakeFile is replaced with minimize.py script, which processes the minimization on the fly with single Fig. 3: Code reduction through minimization technique. execution pass. In make process, minimize.py receives options that are completely similar as the compile flag of each source file along with $CHECKFLAGS variable. Below snippet Figure 3 illustrates the minimized code after the make shows on the fly approach: process where the #ifdef and #if blocks are removed. The minimization script minimize.py does not support $ make C=1 CHECK=minimize.py minimization of include files. The main motive behind this CF="-mindir ../minimized-tree/" exclusion is that, a single include file is referred from multiple C files and resulting minimized include file is not identical for The pre-process tweaks the source files with the gcc options all C files referring the include file. Consequently, if compile gcc -E -fdirectives-only. This command allows re- option for each C file differ, effective definitions at compile moval of #ifdef, followed by expansion of #include but time shall differ too and this differentiate #ifdef blocks in preserving #define macros. the included file. B. Minimization methodology "defconfig". Main motivation for using different configu- ration is to comply minimization with expectations, as follows: 1) Prerequisites: The script which is developed to exercise minimization approach requires following commands execut- ing in the host machine: • In case of "allnoconfig" most features are dis- abled. This signifies substantial amount of disabled • diffstat #ifdef causing large amount of code reduction. • diff Eventually, it leads to higher minimization ratio. • echo • Similarly, in case of "defconfig", only a part of • file features are disabled which leads to less number of • gcc (Other options that are required to build Linux disabled #ifdef resulting in less amount of code Kernel or BusyBox) reduction. Hence in case of "defconfig" reduction • python (2.x and 3.x compatibility is supported by is expected to be lower than "allnoconfig". minimization technique) 2) Usage: Proposed minimization technique needs A. Linux Kernel following points for execution of script: Implementation on Linux kernel with "allnoconfig" and "defconfig" option results in substantial reduction of 1) Navigate to source directory. Example: unnecessary code has been achieved as shown in figure 4. The $ cd linux-4.4.9 metrics are as follows: 2) Copy minimize.py to kernel directory. • allnoconfig: 64684 unused lines were removed 3) Prepare configuration file by tuning the .config file from kernel source which constitutes around 22% of and storing it in kernel tree directory. The .config original C code in kernel source. can also be generated by executing make command. • defconfig: With this option 103144 unused lines Example: were removed from kernel source that comprises about $ make allnoconfig 5% of original C code. 4) Add the script directory path. For example: $ export PATH=$PATH:‘pwd‘ 5) Execute make with the following CHECK options: $ make C=1 CHECK=minimize.py CF="-mindir ../minimized-tree/" Parameter value C=1 signifies minimization only for (re)compilation target files. C=2 is used to perform minimization for all the source files regardless of whether they are compilation target or not. Similarly Fig. 4: Minimization technique execution on Linux Kernel. to specify output directory -mindir option in CF flag is used. The minimization script minimize.py executes not only On other hand minimization is also applicable for limited configurations, but also other customized ones for sub target sources. For example: $ make including PREEMPT RT patch. drivers C=1 CHECK=minimize.py CF="-mindir ../minimized-tree/". B. BusyBox Tree In addition, the script has been modified in such a way that on successful execution, compilation and minimization will On executing minimization technique in BusyBox tree be performed at the same time and minimized source tree will having "allnoconfig" and "defconfig" configuration be generated under directory ../minimized-tree/. One options, the reduction metrics obtained are as follows: thing that needs to be known is that only the target C source • allnoconfig: 51 out of 112 compiled C files have files will be minimized. The other file contents(included been minimized. 5945 lines (34% of original C code) header etc) remain as they are. unused lines were removed. • defconfig: 296 out of 505 compiled C files have V. RESULTS been minimized. 20453 lines (11% of original C code) Minimization technique [12] has been experimented and unused lines were removed. evaluated on platforms such as Linux kernel and BusyBox C. Quantification of other OSS projects Tree. The experiment is basically conducted to check reduction metrics after executing minimization technique on original Apart from Linux Kernel and BusyBox tree, quantification code base. of #ifdef and #if-blocks that could potentially be removed from open-source project ARCTIC Core source code The evaluation of minimization technique has been im- [13] as compared to Linux Kernel has been exercised. plemented on hardware specifications: Processor: 3600MHz, The motive is to quantify how much beneficial can Minimiza- width-64bits, cores-8. Memory: size-7891MiB. Architecture: tion approach be for OSS projects such as ARCTIC Core. x86 64. The quantification is carried out by finding total number of It has been performed by comparing different configu- #ifdef and #if-block and calculating the ratio with total rations of target source, particularly "allnoconfig" and lines of code as below: Complexity Linux Kernel BusyBox Tree PREEMPT RT Metrics Original Source Minimized(x86 defconfig) Minimized(allnoconfig) Original Source Minimized(x86 defconfig) Minimized(allnoconfig) Original Minimized Average Line Score 23 7 5 22 21 19 10 7 50%-ile score 4 3 2 9 9 5 4 3 Highest Score 1846 194 158 283 283 283 530 194 TABLE I: Complexity metrics in original and minimized targets.

Total number of lines in all C files VII.BENEFITS of Arctic Core source code = 407994 lines. Total number of #ifdef existing = 12744. A. Verification time and cost improvement Number of lines that can For verification time improvement static analysis has been be reduced = 12744/407994*100 = 3.12% implemented by comparing results of original and minimized kernel source tree using Coccinelle which is a program match- Similarly, in Linux Kernel, ing and transformation engine for C code and has many semantic patches to the new submissions to the mainline kernel Total number of lines in all repository [15], [16]. The verification has been implemented by C files = 15086494 lines. executing a semantic patch [2] which detects functions whose Total number of #ifdef existing = 85728. declared return value type and actually returned type differs Number of lines that can be reduced = by scanning source files (*.c and *.h) that are referred from 85728/15086494*100 = 0.568% init/main.c in kernel tree. Results of the static verification in terms of time parameter are illustrated below: The statistics above indicates that there are more (approxi- Average spatch execution time: mately 5.5 times higher) chances in Arctic Core of eliminating unused #ifdef switches. This can be stated as a possible Original Kernel Source: 12.37[s] advantage of Minimization technique, however port implemen- Minimized Kernel Source: 2.24[s] tation is yet to be realised. The minimized technique provide around 5.5 times faster VI.EVALUATION analysis as compared to original kernel source tree. A. Complexity statistics B. False Positive reduction To analyze the complexity of ”C” program function, Linux with PREEMPT RT patch, Linux Kernel source and BusyBox False positive is a test result which wrongly indicates that tree has been evaluated by comparing complexities of C a particular condition or attribute is present. To mitigate such program functions of minimized and original source code of situation static analysis [15] was conducted on original and these targets respectively. The statistics have been acquired minimized kernel source tree. The number of meaningless using ”Complexity” tool [14]. detection were mitigated as follows in the minimized kernel The complexity tool has been used because it helps extensively source. Number of detection using Coccinelle: in getting an idea of how much effort may be required to understand and maintain the code. Higher the score, more Original Kernel Source: 126 complex is the procedure, and minimization shows comparably Minimized Kernel Source: 82 lower complexity score which signifies it is easy to read and maintain [14]. C. Easy Code Inspection Table I illustrates the measured complexities of original and minimized targets (Linux kernel, BusyBox tree and PRE- The minimization technique generates easy to read source EMPT RT Kernel) respectively. For Linux kernel and Busy- code by implementing following assimilation: Box allnoconfig and x86_defconfig configurations has been evaluated for minimized code. The minimized code • Unused #ifdef, #if blocks are removed. demonstrate decreased complexity in terms of average line • #include and #define lines are preserved. score, 50%-ile score and highest score in all three targets. • Producing same binary file as that of original source tree. B. Verification for the minimized built binary D. Pruning function call graph: The disassembled code (”objdump -d”) matches the bina- ries that are built from minimized and original source code. During analysis, it is required to identify every possible call Also the configuration and target has been confirmed based on path to establish and trace relationship between program and Busybox and Linux kernel as below: subroutines, callgraph is a directed graph that represents this • BusyBox-1.24.1: Checked configuration options in- relationship [17]. The call graph displays every function call clude defconfig and allnoconfig. regardless of #ifdef switches which results in substantially • Linux kernel-4.4.1: Configuration options verified complex graph which is difficult to trace. With minimization allnoconfig. technique, call graph display illustrates only used function calls space by giving evidence for unused code. The evaluation of this technique has been performed on target platform such as Linux Kernel, BusyBox Tree and PREEMPT RT Ker- nel. Minimization reduction of approximately 5% is achieved across the PREEMPT RT Linux kernel, Linux kernel and the Busybox software. From analysis stand-point, this provide essential benefits such as reduction in verification time (spatch execution) from 12.37[s] in original kernel source to 2.24[s] in minimized kernel, false positive reduction where the number of detection relating to bugs using Coccinelle (static analysis) reduces from 126 to 82. This helps in application domains such as automotive, rail- ways, industry etc. The future work for minimization technique Fig. 5: Call graph for Linux kernel before (left) and after (right) includes extension to other compilers such as LLVM [18] minimization. followed by adaption with architecture such as ARM; various build system e.g. CMake, automake. Binary equivalence is checked, however formal equivalence between the original and minimized source code tree is still a future work. The source thereby providing minimized search space. Figure 5 illustrates code is available at GitHub [12]. call graph transformation before and after minimization. With minimization the number of nodes reduced from 94 REFERENCES to 85 followed by edges which are from 140 to 123 hence a [1] E. Andersen. Busybox. [Online]. Available: https://www.busybox.net narrow search space. [2] W. Sang. Evolutionary development of a semantic patch using coccinelle. [Online]. Available: http://lwn.net/Articles/380835/ [3] S. Zhou, J. Al-Kofahi, T. N. Nguyen, C. Kastner,¨ and E. Extracting minimal subtarget sources: S. Nadi, “Extracting configuration knowledge from build files with symbolic analysis,” in Proceedings of the Third International To easily identify which files are used in source tree for Workshop on Release Engineering, ser. RELENG ’15. Piscataway, efficient software walk-through, subtarget can be specified in NJ, USA: IEEE Press, 2015, pp. 20–23. [Online]. Available: the minimized command in result of which minimization will http://dl.acm.org/citation.cfm?id=2820690.2820700 extract only the used source files. The following snippet shows [4] GCC-Team. Gcc, the gnu compiler collection. Free Software addition of subtarget init in minimized command: Foundation, Inc. [Online]. Available: https://gcc.gnu.org/ [5] D. Gilbert. (2003, August) The linux 2.4 scsi subsys- $ make init C=2 CHECK=minimize.py tem howto. Linux Document Project. [Online]. Available: CF="-mindir ../min-init" http://www.tldp.org/HOWTO/SCSI-2.4-HOWTO/kconfig.html [6] Y. Xuejun, C. Yang, E. Eric, and R. John. Csmith. [Online]. Available: https://embed.cs.utah.edu/csmith/ This results in extraction of only used source files when [7] GNU-Manual, The C preprocessor, GNU. [Online]. Available: subtarget is defined and is shown in figure 6. https://gcc.gnu.org/onlinedocs/cpp/index.html [8] StackOverflow. Strip linux kernel sources according to .config. [Online]. Available: http://stackoverflow.com/questions/7353640/strip- linux-kernel-sources-according-to-config [9] S. Poznyakoff. Gnu cflow. GNU. [Online]. Available: http://www.gnu.org/software/cflow/ [10] A. Younis, Y. K. Malaiya, and I. Ray, “Assessing vulnerability exploitability risk using software properties,” Software Quality Journal, vol. 24, no. 1, pp. 159–202, 2016. [Online]. Available: http://dx.doi.org/10.1007/s11219-015-9274-6 [11] P. J. Salzman. The linux kernel module programming guide. [Online]. Available: http://www.tldp.org/LDP/lkmpg/2.4/html/x208.html [12] K. Hashimoto, The Minimization script. [Online]. Available: https://github.com/Hitachi-India-Pvt-Ltd-RD/minimization [13] ArcticCore. Arctic core autosar 3.1 repositories. [Online]. Available: www.arccore.com/resources/repositories [14] B. Korb. Measure complexity of c source. [Online]. Available: Fig. 6: Depended *.c files of Linux kernel in minimized form. https://www.gnu.org/software/complexity/manual/complexity.html [15] G. Muller. (2015, October) Coccinelle. INRIA; LIP6; IRILL. [Online]. Actually included *.h files. Available: http://coccinelle.lip6.fr/documentation.php [16] V. Nossum. Impact on the linux kernel. [Online]. Available: http://coccinelle.lip6.fr/impact linux.php VIII.CONCLUSION [17] G. Kaszuba. Python call graph. [Online]. Available: http://pycallgraph.slowchop.com/en/master/guide/intro.html The minimization technique helps substantially in improv- [18] The llvm compiler infrastructure. [Online]. Available: ing the readability of source code which results in efficient http://www.llvm.org code review and inspection. It helps in narrowing down search