More Efficient Serialization and RMI for Java
Total Page:16
File Type:pdf, Size:1020Kb
To appear in: Concurrency: Practice & Experience, Vol. 11, 1999. More Ecient Serialization and RMI for Java Michael Philippsen, Bernhard Haumacher, and Christian Nester Computer Science Department, University of Karlsruhe Am Fasanengarten 5, 76128 Karlsruhe, Germany [phlippjhaumajnester]@ira.u ka.de http://wwwipd.ira.uka.de/JavaPa rty/ Abstract. In current Java implementations, Remote Metho d Invo ca- tion RMI is to o slow, esp ecially for high p erformance computing. RMI is designed for wide-area and high-latency networks, it is based on a slow ob ject serialization, and it do es not supp ort high-p erformance commu- nication networks. The pap er demonstrates that a much faster drop-in RMI and an ecient drop-in serialization can b e designed and implemented completely in Java without any native co de. Moreover, the re-designed RMI supp orts non- TCP/IP communication networks, even with heterogeneous transp ort proto cols. We demonstrate that for high p erformance computing some of the ocial serialization's generality can and should b e traded for sp eed. Asaby-pro duct, a b enchmark collection for RMI is presented. On PCs connected through Ethernet, the b etter serialization and the improved RMI save a median of 45 maximum of 71 of the runtime for some set of arguments. On our Myrinet-based ParaStation network a cluster of DEC Alphas wesave a median of 85 maximum of 96, compared to standard RMI, standard serialization, and Fast Ethernet; a remote metho d invo cation runs as fast as 80 s round trip time, compared to ab out 1.5 ms. 1 Intro duction From the activities of the Java Grande Forum [8, 20] and from early compar- ative studies [7] it is obvious that there is growing interest in using Java for high-p erformance applications. Among other needs, these applications frequently demand a parallel computing infrastructure. Although Java o ers appropriate mechanisms to implementInternet scale client/server applications, Java's remote metho d invo cation RMI, [23] is to o slow for environments with low latency and high bandwidth networks, e.g., clusters of workstations, IBM SP/2, and SGI Ori- gin. 1.1 Breakdown of RMI cost Wehave studied the cost of a remote metho d invo cation on two platforms: { two PCs: 350 MHz Pentium I I, running Windows NT 4.0 Workstation, iso- lated from the LAN to avoid packet collisions but connected to each other by Ethernet, JDK 1.2 JIT enabled. { a cluster of 8 Digital Alphas: 500 MHz, running Digital UNIX, connected by Fast Ethernet, JDK 1.1.6 regular JIT; the JDK 1.2b eta was to o buggy. For three di erenttyp es of ob jects we measured the time of a remote invo cation of pingobj returning the same obj. A part of this time is sp ent when commu- nicating over existing so cket connections. Wehave timed this separately round trip, but including serialization. Finally,we measured the time needed for the JDK-serialization of the argument alone, i.e. without any communication. Table 1. Ping times s of RMI =100, so cket communication including serial- ization, and just JDK-serialization alone. The argument obj has either 32 int values, 4 int values plus 2 null p ointers, or it is a balanced binary tree of 15 ob jects eachof which holds 4 ints. s p er ob ject 32 int 4int 2null tree15 RMI pingobj 2287 1456 3108 PC socketobj 1900 83 1053 72 2528 81 serializeobj 840 37 368 25 1252 40 RMI pingobj 7633 4312 14713 DEC socketobj 6728 88 2927 68 12494 85 serializeobj 4332 57 1724 40 9582 65 Table 1 gives the results; other typ es of ob jects b ehave similarly.For large ob jects with array data the RMI overhead is ab out constant, so that serialization and low-level communication dominate the overall cost. As a rule of thumb however, Java's ob ject serialization takes at least 25 of the cost of a remote metho d invo cation. The cost of serialization grows with growing ob ject structures and up to 65 in our measurements. The p ercentage is bigger for slower JDK implementations. The RMI overhead is in the range of 0.4 to 2.2 milliseconds. Benchmarks conducted by the Manta team [22] show similar results. 1.2 Organization of this Pap er We presentwork on all three areas serialization, RMI, and network to achieve b est p erformance improvements. After a discussion of related work in section 2, section 3 discusses the central optimization ideas and the design issues of a b etter serialization. Section 4 presents the key ideas and the design of a much leaner RMI. Both our serialization and our RMI can b e used individually or in combination as drop-in replacements for standard JDK equivalents. They are written entirely in Java and are p ortable. Section 5 brie y presents the Myrinet- based ParaStation network that we use instead of Ethernet to demonstrate that our RMI can easily b e used over non-TCP/IP networking hardware. Section 6 discusses the b enchmark collection and quantitative results. 2 Related Work The idea of a remote pro cedure call, RPC, has b een around at least since 1976 [26]. After the rst pap er designs, numerous systems have b een built; Corba and DCOM are this evolutionary pro cess's latest outgrowths. The earliest systems fo cusing on latency and bandwidth of RPCs, include the Cedar RPC [1] by Birrell and Nelson and the Fire y RPC [17] bySchro eder and Burrows. Thekkath and Levy [19] study the design space of RPC mechanisms on several high-sp eed networks and identify the p erformance in uence of all hardware and software comp onents needed to implement an RPC, including network controller and cache system. It is known from this and other work that for a low-latency and high- bandwidth RPC at least the following optimizations need to b e applied: { Stubs marshal user data into network packets; skeletons unmarshal the data. Marshaling routines need to b e fast and need to avoid copying. Explicit and precompiled marshaling routines are needed. Thekkath even includes them into the kernel. The Fire y RPC reuses network packets and allows direct access to the packet data to avoid additional copying. { A pool of pro cesses or threads to handle incoming calls must be created ahead of time, so that their creation time is not added to the duration of the RPC. Analogously, a pre-allo cated network packet should b e assigned to any caller in advance. { Up on an incoming call, the pro cess or thread dispatcher should be able to switch immediately to the single pro cess or thread that waits for that message. { For small/short RPCs it might be advantageous if the caller spins instead of p erforming a blo cking wait for the return message. Similarly, it mightbe b etter on the side of the callee to have the network thread execute user co de. Both will save context switches and hence cache ushes. { Datagram based solutions are faster than connection oriented proto cols. All the features provided by the communication hardware should b e exploited to reduce software overhead. For example, computation of packet checksums is b etter done in hardware. If the network do es not drop any packets, there is no need to implement acknowledgement in software. Thus, any p ortable RPC design must allow to use platform sp eci c and optimized transp ort layer implementations. RMI extends the ideas of RPC to Java [23]. In contrast to earlier RPC sys- tems, RMI is designed for a single language environment, where b oth caller and callee are implemented in Java. The two most essential di erences are that RMI enables the programmer to use a global ob ject mo del with a distributed garbage collector. Second, p olymorphism works on remote calls, i.e., every argument can b e of its declared typeorofany subtyp e thereof. Polymorphism is not p ossible in conventional RPC systems including Corba, DCOM, and RMI-over-I IOP b ecause it requires dynamic loading and binding of stub co de. Unfortunately, Sun's design and implementation of RMI do es not address high p erformance. Whereas fast RPC implementations have learned to exploit hardware features, Sun's RMI seems to prevent exactly this. First, since Java disallows direct access to memory it is hard to avoid copying; pre-allo cated net- work packets cannot b e used. Second, since RMI is tied to connection-oriented TCP/IP-so ckets, the so cket API prevents streamlining the fast path needed for a fast RMI: for example, there is no way to teach the scheduler how to switch di- rectly to a waiting thread and there is no way to get rid of the status information needed for so cket connections but unnecessary for remote metho d invo cations. Sp ecial features of non-Ethernet high-p erformance communication hardware cannot b e exploited. It is a signi cant engineering problem to keep RMI's de nition and API, but to redesign it internally so it b ecomes p ossible to capitalize on existing knowledge on ecient RPC implementations. Moreover, the distinctions b etween RMI and RPC p ose at least two new problems. Whereas arguments and results of RPCs were of primitive data typ es and structures thereof , RMI can ship Java ob ject graphs. Sun's implementation of marshaling uses Java's ob ject serialization and is hence done in a very general way by means of typ e introsp ection.