Chapter 1 Yu Wang: a Survey of Software Distribution Formats

Chapter 1 Yu Wang: A Survey of Software Distribution Formats Nowadays various computer architectures and operating systems are developed. Software development does not focus on any single of them anymore. Instead, software companies and distributors concerns more about how their software system can be deployed as universal as possible. We call both architectures and operating systems platforms. Without a multi-platform solution, assume we have M instruction set architectures, to design a software that is executable over all the possibilities, we will have to compile the software M times or even more, due to combination of different hardware components. Whereas with a multi-platform solution, the software needs only to be written once, and by techniques of the solution, it is able to achieve the same functionality as if the code is compiled M times in previous case. We define software distribution format to be any form in which software is distributed. In this article, we give discussion on several software distribution formats, which are fat binary, software for virtual machines and source code distribution, where the topic of virtual machines are divided into application oriented and platform oriented. In term of compilation, these distribution formats can be considered as fully compiled, half compiled and not compiled. At the end, comparison on these formats is drawn in conclusing remarks. 1.1 Fat Binary An immediate solution to universal software distribution is to have the source code compiled into several binaries against different architectures, and have all of them available to the computer of end user, so that the binary selectively installed based on the architecture of that computer. This solution is known as fat binary [25] or univeral binary [1], such that compiled binaries are archived and compressed into a single package and allow the system to choose which binary should be installed according to its architecture at the install time. Due to multi-architecture packaging in fat binary, one disadvantage is that the size of the package to be distributed tends to be larger than other solutions introduced in this article. 3 4 Yu Wang [email protected] A fat binary for software with size s often ends up with a size of M × s, for M instruction set architectures. But because fat binary format is easy to be applied technically when producing software, this technique is still widely used, such as the instance that it makes Apple Computer's migration from PowerPC architecture to X86 architecture in 2005 much smoother [26]. A similar solution to fat binary is the deployment of PocketPC applications using Mi- crosoft Installer format (MSI) [2]. A PocketPC is a handheld computer running WindowsCE based operating system by Microsoft and was originally developed in November of 1996. Typ- ical steps of installing an application into a PocketPC are that first have the setup package installed on a desktop workstation, and then transfer the cabinet binary file (CAB) when synchronizing with the PocketPC device when it gets connected [14]. Due to historical reason, current PocketPC devices come with processors in different architectures, such as Intel XScale1, Hitachi SH32 and MIPS3, which raise the problem of compatibility when distributing software applications. Since the computing performance of handheld devices are not as powerful as desktop computers, the solution should be as much independent of the processor as possible. In this case, fat binary is most appropriate. In term of compilation, fat binary can be considered as fully compiled distribution format. Other than universal distribution for different architectures, fat binary is also used for other purposes, such as applications with different localized versions. Such cases are rare and are out of the scope of our topic. 1.2 Application Oriented Virtualization To make the software adoptable for different platforms, one can consider a machine that virtually exists on top of each platform, such that the software only needs to be written for this virtual machine and it is able to be executed on every platform from the point of view from users. Here we classify modern virtual machines into two types, application oriented virtualization and platform oriented virtualization which is discussed in next section. We define application oriented virtual machine to be virtual machine which provides a set of non-native instructions and allows applications, which is compiled against to this instruction set, to be launched and executed in the virtual machine. The language of such instruction set is commonly referred as intermediate language. The code in intermediate language is called intermediate code. An application oriented virtual machines bridges the underlying platforms and its applications, to allow applications running on multi-platforms. Modern application oriented virtual machines are either abstract stack machines or register- based machines. A stack machine model and manipulate memory spaces as stacks, where structured data and function calls are pushed to the stacks and are popped as needed. A register-based machine calls operations against finite registers such that input data is com- 1See www.intel.com/design/intelxscale 2See http://www.superh.com 3See http://www.mips.com A Survey of Software Distribution Formats 5 puted results are placed in the registers. To narrow the scope of our topic, register-based machine is not discussed here. For detailed differences between two types of machines, one can refer to [18]. As software is distributed in intermediate code, eventually it has to be compiled again, by the built-in compiler from virtual machine, into native machine language to be understood by computers. We often refer the compilation from intermediate language to native machine language as code generation. Depending on the time when the compilation takes place, two types of code generation are considered. The first one is called install-time code generation. As named, the intermediate code is compiled during the time when the software get installed. After this stage, software program is completely compiled into machine language of target platform, and will be running natively. One major weakness of install-time code generation is that, if the software to be compiled is relatively large in code size, it will probably take a long time for the installation stage to complete its work. If this is the case, we change strategy to another type of code generation, called Just-in-time (JIT) compilation. In this mode, intermediate code are selectively compiled during the running time of the software. The condition of selective compilation is that the JIT compiler only compiles intermediate code encountered in current program state, such as the code after a conditional branch. Since the time taken for compiling a block of code is much shorter than compiling all the code, large software as mentioned above can be load and executed faster, with its usability guaranteed. In JIT compilation, once the intermediate code is compiled, compiler output is saved in memory for possible calls in future. As the program life time progresses, more and more intermediate code are translated. Since the native code is directly executable on the underlying hardware, it runs faster than the code blocks which are not yet compiled. This leads to a situation that the performance of a software running in JIT mode gradually gets improved, with respect to the beginning when the program is loaded. Eventually all the code blocks are compiled into native code, and the performance reaches its maximum. We call the stage before the intermediate code completely compiled JIT warm-up stage. In term of compilation, software for application oriented virtual machine can be considered as half compiled distribution format. It is suspected in [4] that the earliest proposal that comes up with the concept of JIT is [13] back in 1960s. The author believed that the compilation of function code into machine code can be done on the fly, thus no compiler output needs to be saved in physical storage, which conforms the idea of JIT. From that time on, several implementations of JIT compilers for different languages are researched, such as [15], [10] as well as [22]. Java Virtual Machine The actual time when JIT got well known, is when Sun Microsystems released Java in 1990s. It contains a virtual machine is called Java Virtual Machine or JVM, which is designed to run applications only written in Java language originally. There does exist third-party implementations that compile course codes in other programming language into 6 Yu Wang [email protected] intermediate code executed by JVM[11], but its designers have integrated JVM tightly with whole Java development environment, and thus supports object model of Java directly, such as inheritance and interfacing. Low level methods of objects include static methods, virtual methods and interface methods. Aggregate data such as object is stored only after the memory is dynamically allocated for it, and is collected when it is no longer accessible. Scalar data is stored in either local variable, structure field or on the stack of abstract machine. When invoking methods, scalar data and/or reference to aggregated data are pushed onto the evaluation stack, and the returned value of the methods are presented on the top of the stack. In file system, compiled Java intermediate code exists as class files, or a single compressed JAR file. The filename of each class file has to be the same as the public class name within the file. Classes can be organized using the directory structure from the file system. Each class file is in byte code format, where JVM instructions are presented as one-byte opcodes ranging from 0 to 255. Hence there are about 250 instructions by which JVM intermediate language is formed. These opcodes contains instructions such as load and store, arithmetic, type conversion, object creation, operand stack management, control transfer, method invocation and etc.

Load more