Cslim: Automated Extraction of IoT Functionalities from Legacy C Codebases

Hyogi Sim†,∗, Arnab K. Paul∗, Eli Tilevich∗, Ali R. Butt∗, Muhammad Shahzad‡

Oak Ridge National Laboratory†, Virginia Tech∗, North Carolina State University‡ [email protected],{akpaul,tilevich,butta}@vt.edu,[email protected] ABSTRACT CCS CONCEPTS Many Internet of Things (IoT) devices are resource-poor, •Software and its engineering → Embedded soft- possessing limited memory, disk space, and processor ware; Maintaining software; Software usability; capacity. To accommodate such resource scarcity, IoT software cannot include any extraneous functionalities KEYWORDS not used in operating the underlying device. Although Software Engineering, IoT legacy systems software contains numerous functionali- ties that can be reused in IoT applications, these func- ACM Reference format: Hyogi Sim†,∗, Arnab K. Paul∗, Eli Tilevich∗, Ali R. Butt∗, Muhammad tionalities are exposed as part of a larger codebase with Shahzad‡. 2019. Cslim: Automated Extraction of IoT Functionalities from multiple complex dependencies and a heavy runtime Legacy C Codebases. In Proceedings of International Conference on Dis- footprint. To enable programmers to eectively reuse tributed Computing and Networking, Bangalore, India, January 4–7, 2019 (ICDCN ’19), 6 pages. extant systems software in IoT applications, this paper DOI: 10.1145/3288599.3296013 presents Cslim, a cross-package function extraction tool for C. Cslim extracts programmer-speci functions from a source package and generates new source les 1 INTRODUCTION for a target package, thereby enabling the reuse of sys- The C language is the lingua franca of systems software. tems software in resource-poor execution environments, This language naturally ts the implementation require- such as the IoT devices. Cslim resolves all dependen- ments of various low-level system components, such cies by recursively extracting required functions, while as operating systems and rmware, which put an em- bypassing the complexities of preprocessor macro vari- phasis on increasing execution eciency and reducing abilities by operating on preprocessed source les. Fur- runtime footprint. C programs can be compiled into a thermore, Cslim eciently traverses and resolves the minimal number of machine instructions that can be calling dependencies by maintaining an in-memory re- deployed in compact binaries. In particular, minimiz- lational database. Finally, Cslim is easy to use, as it ing the executable binary code size becomes crucial in requires neither manual intervention nor source code restricted hardware and software environments (e.g., modications. Our prototype implementation of Cslim memory and storage capacity, available shared libraries, has successfully extracted a set of functions from SQLite etc.). Notably, an emerging trend of Internet of Things and GlusterFS, producing slimmed down executables (or IoT) [14] envisions intelligent and connected phys- that can be deployed on IoT devices. ical objects, from small hand-held devices to vehicles and large buildings. Undoubtedly, the binary code size can negatively impact the runtime performance, power usage, and building cost of small-scale IoT devices. ACM acknowledges that this contribution was authored or co- One of the peculiarities of C programming is a lack authored by an employee, or contractor of the national government. of standard libraries that represent common data con- As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for tainers and algorithms to facilitate the development pro- Government purposes only. Permission to digital or hard copies cess [6]. In C, even a string is simply a null terminated for personal or classroom use is granted. Copies must bear this notice array of characters that must be explicitly allocated. and the full citation on the rst page. Copyrights for components of Higher-level libraries, such as glib [5], and qt [8], intro- this work owned by others than ACM must be honored. To copy other- duce a space deployment overhead, as they require the wise, distribute, republish, or post, requires prior specic permission and/or a fee. Request permissions from [email protected]. shipment of an entire shared library to be able to use ICDCN ’19, Bangalore, India only a single module, such as a hash table; this overhead © 2019 ACM. 978-1-4503-6094-4/19/01...$15.00 can be prohibitive for deployments in small or restricted DOI: 10.1145/3288599.3296013 environments. To eliminate the overhead, programmers manually extract functions of interest from other soft- 2 DESIGN OF CSLIM ware packages, and modify them accordingly to t a Cslim has been designed to conform to the following target software package. Oftentimes, however, such criteria, to ensure its practicality. desired functions are chained in a complex calling de- No manual source code modication. Requiring pendency in the package, rendering manual extraction modications to existing source may hurt ongoing de- tasks tedious and error-prone. Software refactoring— velopment productivity and code maturity. Moreover, behavior-preserving code transformations [19]—can au- requiring source code modication prevents the inclu- tomatically extract these desired functions. Refactoring sion of new source packages, decreasing the tool’s adapt- can address the needs of the IoT community by reusing ability. the already stable and tested infrastructure to build ap- No manual processing. Semi-automated refactoring plications for IoT devices. can unreasonably burden developers, particularly for In this paper, we present Cslim, a cross-package func- larger codebases. Therefore, it is essential to obviate the tion extraction tool for C. As an input, programmers need for human intervention to ensure practicality. only need to specify a list of functions to be extracted Ease of maintenance. Individual source packages can from the source package. Once the function list is pro- always be updated (e.g., to x bugs, to add new features, vided, Cslim rst scans the source package, analyzes the etc.) after the needed functions have been extracted. To calling dependencies, and creates a reference database. accommodate such updates, the framework should sup- It then recursively resolves the calling dependencies, port incremental updates that free programmers from and calculates the nal list of functions in the correct hand-operated and error-prone manual patching. order to appear in the new source les. Finally, Cslim Figure 1 shows the overview of Cslim. The user spec- generates new .h and .c les, that are self-contained ies the list of functions to extract from the source pack- and thus can be embedded in other software packages. age, which we refer to as target functions. Cslim rst Cslim sidesteps the complexity of handling C prepro- bootstraps the source package, primarily to avoid com- cessor macros by operating on the output les of the C plications of preprocessor macro variabilities. After the preprocessor. As stated above, Cslim targets restricted bootstrapping, the source les in the source package are environments, such as IoT devices. Consequently, inject- scanned and all calling dependencies between functions ing static package congurations eliminates variabilities, are analyzed. Cslim stores the analysis results in the without compromising the ecacy and practicability of reference database. Next, any calling dependencies in Cslim. Furthermore, Cslim only manipulates the pack- target functions are resolved by consulting the reference age source code, thus being architecture independent. database. Finally, Cslim generates the self-contained To evaluate our prototype of Cslim, we used SQLite [9] target source les ready to be embedded in the target as a case because of its popularity in Android appli- package. We next detail each step of Cslim in turn. cation development. SQLite is preferred in small devices, due to its lightweight nature and single-tier database architecture. SQLite was designed to provide local data 2.1 Bootstrapping the source package storage for individual devices and applications. There- To start extracting functions from a source package, fore, SQLite databases require little administration, mak- we rst bootstrap the source software package in two ing them particularly well-suitable for devices that need phases: conguring the source package and running the to operate without expert human support, such as those C preprocessor. In most cases, conguring a source pack- used in the “internet of things.” In addition to SQLite, age requires running its configure script, which takes we also evaluate Cslim by successfully extracting func- input ags that specify the target device architecture and tions from GlusterFS [10], another open-source C pack- other building options, and generates appropriate build age. GlusterFS is a scalable, distributed le system that scripts. The output build scripts contain all necessary is well-suited for data-intensive tasks, such as media information to resolve the variabilities of preprocessor streaming and cloud storage. macros. Then, the bootstrapping is completed by run- The remainder of the paper is organized as follows. ning the build scripts (e.g., Makefile), but only up to the We explain our design (§ 2), and implementation (§ 3) C preprocessor, as Cslim works the source code level. of Cslim, followed by our initial experience and nd- The output source les, with all preprocessor macros ings from our prototype (§ 4). We also present related stripped, can be further processed by Cslim. research work (§ 6), and then our conclusion and some The preprocessor macros are known to signicantly future directions (§ 7). complicate the analysis and transformation of C pro- grams [12, 13, 20]. However, our bootstrapping step eliminates the need for Cslim to directly address these User Input Target Functions ...

Software Package Target Package

Bootstrap Dependency Dependency Code filefile target sourcefilefile Source Analysis Resolve Generation file file file

Reference Database

- Package Configuration - Source Analysis (cscope) - Recursive extraction - .c .h files generation - C Preprocessor - Calling dependencies - Resolve dependencies - Build script

Figure 1: Overview of the code extraction framework in Cslim. Based on the target function list, Cslim analyzes the source package, resolves the calling dependencies in target functions, and generates the target source les. complications. This shortcut should not aect the practi- location, including the source le name and the line use of Cslim, given our target development environ- number. The Calls table records all calling occurrences ments, which possess limited computing resources and in the source package. For instance, in Figure 2, function scarce libraries, (e.g., IoT devices) in which most soft- caller calling function callee means that if caller is ware packages are statically congured and optimized extracted, callee should be extracted as well. Cslim specically for a single target architecture. uses this reference database to resolve any calling de- pendencies between functions in the following steps. 2.2 Analyzing the source code Cslim needs to scan all the les in the source pack- After the bootstrapping, Cslim analyzes the source code age to populate the reference database. Cslim exploits to identify any possible calling dependencies within the traditional tools, cscope [3] and [2], for functions to extract. For a systematic analysis, Cslim scanning, as further described in § 3. records all functions and their calling interactions in a relational database, whose data schema appears in Figure 2. The Functions table records all functions in 2.3 Resolving the calling dependencies the source package, each with its name and denition’s After the source package has been analyzed, Cslim recur- sively resolves any dependencies in the target function list, as presented in Listing 1. Functions First, both targets and unresolved lists are popu- lated with the target functions, via populate_from_input() FID Function Name Source Line Number procedure. While targets is a static data structure, 1 !"!! " caller.c 24 unresolved keeps changing during the resolving pro- 2 !"!! callee.c 51 cess. For instance, a function is initially added to the ...... unresolved list, and then removed as soon as its calling dependencies are resolved. The algorithm resolves de- Calls pendencies by adding extra functions to be extracted CID Caller FID Callee FID into the compats list. Specically, the while loop in- spects the functions in the unresolved list, which is 1 1 2 initially populated with the target functions. For each ...... function (f) in the unresolved list, the algorithm queries, via invoking get_callees() to acquire any functions Figure 2: The simplied database schema to describe (callees) called by the function f. If there exists any, the function interactions in the source package. This - such callees are added to both compats and unresolve ample shows the table status of recording the calling re- lists. Then, the function f can be removed from the lationship between caller and callee, in which caller unresolved list. In this manner, the loop continues until calls callee. 1 unresolved = [] important to resolve the calling dependencies between 2 targets = [] those functions. For instance, if function A is called by compats = [] 3 another function B, function A should be written before 4 5 def populate_from_input(): function B to avoid any potential compile errors. 6 # read input file and return as a list After the source les are generated, Cslim nally 7 ... generates a build script, Makefile.am. The new build 8 script can facilitate the manual integration process, par- def get_callees(f): 9 ticularly when a package adopts the popular autoconf- 10 # look up callees from the reference 11 # database based build system. 12 ... 13 3 IMPLEMENTATION 14 targets = populate_from_input() 15 unresolved = populate_from_input() We prototyped Cslim with four separate programs writ- 16 ten in C++ and Python. For the initial analysis of the 17 while unresolved.len() > 0: source package, Cslim relies on the output of cscope for f in unresolved: 18 and ctags, which are readily available in most UNIX-like 19 callees = get_callees(f) cslim-createdb 20 for c in callees: systems. Specically, program, written 21 if c in not targets or compats: in C++, parses the output les of cscope and ctags 22 unresolved.append(c) and creates the reference database using SQLite [9]. 23 unresolved.remove(c) cscope is a popular development tool for browsing C compats.append(c) 24 code. ctags works by generating an index or (a tag) le Listing 1: The pseudo code of recursively resolving of the program names found in the input source and calling dependencies among target functions in header les. This tag le allows denitions to be quickly Cslim. and easily located by any utility tools. The remaining steps (i.e., resolving dependencies (§ 2.3) and generating source codes (§ 2.4)) are implemented the unresolved list becomes empty. Upon the comple- in Python. An additional SQLite database le maintains tion, the target list holds the target functions (in the data structures while resolving the dependencies. Fi- appearing order from the user input), while the compat nally, a bash script, cslim.sh and a Makefile automate list contains extra functions in the order the calling de- all extraction steps. The bash script takes the name of pendencies are resolved. As we will see shortly (§ 2.4), the input le, which lists target functions to be extracted, keeping the order or the compat is important in gener- as its argument. ating the correct source le.

2.4 Generating the source code 4 EMPIRICAL STUDY In this section, we demonstrate the viability of Cslim by First Cslim generates the source les that can be in- discussing experiments with two real-world softwares cluded in the target package. In addition to the source written in C: SQLite [9] and GlusterFS [10]. les with the target functions, e.g., functions.h and functions.c, Cslim also generates compat.h and compat.c, containing the declarations and denitions, respectively, 4.1 Synthetic C Software of the extra functions extracted to satisfy the calling We wrote a synthetic C program that has internal call- dependencies of the target functions. All source les are ing dependencies, to verify the correctness of Cslim. generated in a target directory, e.g., /project/compat, Specically, the program consists of a single header le which can be specied by the user. and four C source les. Each C le has 10 functions First, the target functions are extracted in the order of (40 functions in total), calling each other 37 times. We the targets list in Listing 1, which exactly follows the also wrote a main function, which calls 6 of the total order of the target function list, provided by the user. 40 functions. In the baseline case, we compile all C For each function, Cslim extracts the function signature les to generate a single executable le. Using Cslim, (or prototype) and denition from the original source we extract the functions from the synthetic C les and les, by consulting the reference database that was pre- compile the program with the output les generated by viously built (§ 2.2). Next, the functions in the compats Cslim. Table 1 compares the result. list in Listing 1, are extracted and written to the result- The number of functions are accounted from the re- ing source les, compat.h and compat.c in the reverse sulting executable le with command [4, 7], particu- order. Note that the appearing order in the compat.c is larly by counting the “T” section (text/code section). In Number of functions Code size bug xes) into the target source code by integrating with Baseline 46 13,336 Bytes a version control system (e.g., git). Cslim 20 8,432 Bytes Table 1: Comparison of resulting binary executables. 6 RELATED WORK the baseline case, all 40 functions from the source pack- We discuss the general challenges of refactoring C pro- age are included in the executable le. The executable grams, along with the solutions oered by prior research le also includes our main function and 5 other code works. The biggest challenge in refactoring C programs sections that are inserted by the compiler (e.g., _init, is the handling of C preprocessor macros, which is of- _fini, _start). In comparison, Cslim generated the ex- ten treated as a separate language [1, 11]. In essence, a ecutable le only with 20 functions. In fact, the calling macro is a name given to a block of C statements as a dependencies (from the 6 target functions) require 8 pre-processor directive. The #define directive denes functions to be extracted, resulting in extracting an identier and a character sequence that will be substi- 14 functions. Note that exactly 20 functions appear in tuted for the identier each it is encountered in the the resulting executable le, including the main func- source le. The identier is referred to as a macro name tion and 5 compiler-generated functions. The size of and the replacement process as macro replacement. the executable le also eectively decreases, only 63% TypeChef [16] detects errors without having of the baseline executable le. to generate all variants imposed by #ifdef directives. In TypeChef, the preprocessor partially processes the 4.2 Real-World C Software preprocessor directives, and separately analyzes the We have also tested Cslim to extract functions from variabilities. Although the approach of TypeChef is SQLite [9] and GlusterFS [10]. Specically, we have ex- promising, its parser requires implementation-specic tracted sqlite3PageMalloc and sqlite3PageFree func- knowledge about the target source code and cannot an- tions from SQLite, and all functions from the dictionary alyze arbitrary C codes. Another approach of handling implementation (dict_*) in GlusterFS. A total of 14 and C preprocessor macro is to design and develop a new 103 extra functions have been additionally extracted preprocessor language for C. For instance, ASTEC [17] respectively from SQLite and GlusterFS, to satisfy the features similar functionality and usage to the original C calling dependencies. preprocessor macro, but is analyzable and refactorable. However, it lacks support for C++. Moreover, convert- 5 DISCUSSION AND FUTURE WORK ing existing source code to ASTEC framework requires We discuss the limitations that we have observed from human intervention, making it practically less appealing. conducting the experiments. Although Cslim can ex- Yacfe [21] is a parser that can analyze C/C++ source code tract functions as expected, the output les still require without preprocessor macros. This heuristics-based few manual tasks for the following reasons. First, Cslim parser can handle large C projects, at the cost of compro- currently identies the function body based on simple mising correctness. A similar heuristics-based approach string matching, which only works with some coding is supported by Coccinelle [22], which automatically styles. Second, our current prototype cannot correctly generates documentation and collateral evolutions as extract custom type denitions that are required by the a le to assist developers of device drivers in the target functions. These limitations could be addressed Linux kernel. SuperC [13] is the rst framework that can by replacing the current parser with a more advanced completely parse C programs including the preproces- C source code parser, such as SuperC [13]. Also, Cslim sor macros. The preprocessor of SuperC preserves the cannot autonomously modify the build script to stop variability, which is subsequently resolved by the parser the build process after the C preprocessor, e.g., gcc -E. that spawns multiple processes for each static condition. Therefore, the user has to perform the bootstrapping Despite these alternatives, experienced developers still process (§ 2.1) by hand, before executing Cslim. We often resort to using the C preprocessor for portability plan to address these limitations in the future. and variability reasons [18]. Although refactoring the In addition, we plan to integrate the extra features preprocessed C code is ill-advised in general [12], Cslim into Cslim for providing a better usability. First, we plan leverages the unique features of our target environment to append an additional step after the code generation, (i.e., IoT devices), thus safely eliminating the source code to verify that the correctness of the extracted functions variabilities and operating on the preprocessed code. is not compromised. Lastly, we also plan to add the There are many works which focus on optimizing ability to incrementally merge any source updates (e.g., resource consumption in distributed systems [23, 24]. In this paper, we focus on several works which opti- [11] Alejandra Garrido and Ralph Johnson. 2002. Challenges of Refac- mize consumption of resources in constrained environ- toring C Programs. In Proceedings of the International Workshop ments for other languages. For instance, a code cache on Principles of Software Evolution (IWPSE ’02). [12] A. Garrido and R. Johnson. 2003. Refactoring C with Condi- management technique dynamically unloads dead and tional Compilation. In 2003. Proceedings. 18th IEEE International infrequently used code in JIT-based JVMs for mobile Conference on Automated Software Engineering. and embedded devices [26]. Specically, the presented [13] Paul Gazzillo and Robert Grimm. 2012. SuperC: Parsing All of C system proles applications online and o ine to select by Taming the Preprocessor. SIGPLAN Not. 47, 6 (June 2012). candidates for code unloading. Other techniques have [14] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. 2013. Internet of Things (IoT): A - also been used to reduce memory consumption in such sion, Architectural Elements, and Future Directions. Future environments. To balance memory and performance by Generation Computer Systems 29, 7 (2013). reducing re-translation, selective ushing of software [15] Apala Guha, Kim Hazelwood, and Mary Soa. 2010. Balanc- code has been proposed [15]. In addition, managing ing Memory and Performance through Selective Flushing of code cache while maintaining correctness and minimiz- Software Code Caches. In Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded ing memory consumption has also been proposed for systems. ACM, 1–10. resource constrained environments [25]. Cslim compli- [16] Christian Kästner, Paolo G. Giarrusso, Tillmann Rendel, Se- ments the aforementioned techniques, as it reduces the bastian Erdweg, Klaus Ostermann, and Thorsten Berger. 2011. code size in such resource constrained environments Variability-Aware Parsing in the Presence of Lexical Macros and by selectively extracting a minimal amount of target Conditional Compilation. In Proceedings of OOPSLA ’11. [17] Bill McCloskey and Eric Brewer. 2005. ASTEC: A New Approach functions from generic C libraries. to Refactoring C. SIGSOFT Softw. Eng. Notes 30, 5 (Sept. 2005). [18] Flávio Medeiros, Christian Kästner, Márcio Ribeiro, Sarah Nadi, and Rohit Gheyi. 2015. The Love/Hate Relationship with the 7 CONCLUSION C Preprocessor: An Interview Study. In 29th European Confer- We presented Cslim, an automatic source-code refac- ence on Object-Oriented Programming (ECOOP 2015), John Tang toring tool that extracts functions of interest from a Boyland (Ed.), Vol. 37. [19] Tom Mens and Tom Tourwé. 2004. A Survey of Software Refac- source package after resolving their calling dependen- toring. IEEE Transactions on Software Engineering 30, 2 (2004). cies, while bypassing the preprocessor macro complexi- [20] Jerey L. Overbey, Farnaz Behrang, and Munawar Haz. 2014. ties, and then generates source code that can be directly A Foundation for Refactoring C with Macros. In Proceedings of included in any target packages. Cslim supports re- the 22Nd ACM SIGSOFT International Symposium on Foundations stricted development environments, such as IoT devices, of Software Engineering (FSE 2014). [21] Yoann Padioleau. 2009. Parsing C/C++ Code Without Pre- vastly reducing their dependence on external library processing. In Proceedings of the 18th International Conference code. Without requiring any source code modications on Compiler Construction: Held As Part of the Joint European or manual intervention, Cslim can be easily adopted as Conferences on Theory and Practice of Software, ETAPS 2009 (CC a development tool by IoT developers. ’09). [22] Yoann Padioleau, Julia Lawall, René Rydhof Hansen, and Gilles Muller. 2008. Documenting and Automating Collateral Evolu- tions in Linux Device Drivers. In ACM SIGOPS Operating Systems REFERENCES Review, Vol. 42. [1] 1987. The C Preprocessor - GCC - GNU. https://gcc.gnu.org/ [23] Arnab K Paul, Arpit Goyal, Feiyi Wang, Sarp Oral, Ali R Butt, onlinedocs/cpp/. (1987). Accessed: Sept, 2018. Michael J Brim, and Sangeetha B Srinivasa. 2017. I/o load bal- [2] 2009. Exuberant Ctags. https://ctags.sourceforge.net/. (2009). ancing for big data hpc applications. In 2017 IEEE International Accessed: May, 2018. Conference on Big Data (Big Data). IEEE, 233–242. [3] 2012. Cscope Home Page. https://cscope.sourceforge.net/. (2012). [24] Arnab Kumar Paul, Wenjie Zhuang, Luna Xu, Min Li, M Mustafa Accessed: July, 2018. Raque, and Ali R Butt. 2016. Chopper: Optimizing data parti- [4] 2016. Autoconf - GNU Project - Free Software Foundation. tioning for in-memory data analytics frameworks. In 2016 IEEE http://www.gnu.org/software/autoconf/autoconf.html. (2016). International Conference on Cluster Computing (CLUSTER). IEEE, Accessed: Sept, 2018. 110–119. [5] 2018. GLib - GNOME Developer Center. https://developer.gnome. [25] Forrest J Robinson, Michael R Jantz, and Prasad A Kulkarni. 2016. org/glib/stable/. (2018). Accessed: July, 2018. Code Cache Management in Managed Language VMs to Reduce [6] 2018. Glibc - GNU. https://www.gnu.org/software/libc/. (2018). Memory Consumption for Embedded Systems. In ACM SIGPLAN Accessed: May, 2018. Notices, Vol. 51. ACM, 11–20. [7] 2018. GNU Binutils. https://www.gnu.org/software/binutils/. [26] Lingli Zhang and Chandra Krintz. 2004. Prole-Driven Code (2018). Accessed: Sept, 2018. Unloading for Resource-Constrained JVMs. In Proceedings of [8] 2018. Qt - Home. https://www.qt.io/. (2018). Accessed: Sept, the 3rd international symposium on Principles and practice of 2018. programming in Java. Trinity College Dublin, 83–90. [9] 2018. SQLite Home Page. https://www.sqlite.org/. (2018). Ac- cessed: Sept, 2018. [10] 2018. Storage for your Cloud. — Gluster. http://www.gluster.org. (2018). Accessed: Sept, 2018.