Exploring Performance Improvement for -based Scientific Simulations

Xiaorong Xiang∗ Yingping Huang Gregory Madey Department of Science and Engineering University of Notre Dame NotreDame, IN 46556 USA Telephone: 574-631-7596 Email:[email protected], [email protected], [email protected]

Abstract and MacOS ) for their scientific studies. It is im- possible to deploy an application written in , C++, There has been growing interest for using the Java or languages from one platform to the programming language in scientific and engineering other without rebuilding the application or changing applications. This is because Java offers several fea- the code. One of the most attractive features of Java tures, which other traditional languages (C, C++, is its portability, “write once, run anywhere.” Java andFORTRAN) lack, including portability, garbage runtime environment (JVM) provides an automatic collection mechanism, built-in threads, and the RMI garbage collection feature that reduces the burden of mechanism. However, the historic poor performance explicitly managing memory for programmer. The of Java stops it from being widely used in scientific Java built-in threads implementation and Java’s Re- applications. Although research and development on mote Method Invocation (RMI) mechanism make it Java resulting in JIT compilers, JVM improvement, easy for and distributed comput- and high performance compilers, have been done ing. much for runtime environment optimization and sig- Although there are many attractive features pro- nificantly speeded up the Java programs, we still be- vided by the Java language and the J2EE architec- lieve that creating hand-optimized Java source code ture makes Java a potential language for scientific ap- andoptimizing code for speed, reliability, scalability, plications, performance remains a prime concern for andmaintainability are also crucial during program program developers using Java. The portability and development stage. In thispaper,wedemonstrate implementation in JVM impose an agent-based simulation model that is built using apenalty on the performance. Ashworth (1999) [1] Java programming language and the Swarm simu- discussed several issues related to the use of Java for lation . We analyze the performance of this high performance scientific applications. simulation model from several aspects: runtime opti- However, much research has been done to reduce mization, database access, objects usage, parallel and the performance gap between Java and other pro- distributed computing. This simulation model pos- gramming languages. This research includes Just-in- sesses most of characteristics which general scientific Time (JIT) compilers that compile the byte code into simulations have. These techniques and analysis ap- native code on-the-fly just before execution, adap- proaches can also be generally used in other scientific tive compiler technology (Sun’s HotSpot VM), third- simulations using Java. party optimizing compilers that compile the Java source code to the optimized , and high per- 1Introduction formance compilers that compile the Java source to native code for a particular architecture (IBM High- C, C++ and FORTRAN have traditionally been used Performance Compiler for Java for RS6000 architec- for modeling scientific applications. Since the Java ture). Bull (2001) [2] rewrote the Java Grande Bench- programming language was introduced by Sun Mi- marks in C andFORTRAN. Bull also compared crosystem in the mid-1990s, there has been growing the performance between these languages in differ- interest for using it in scientific and engineering appli- ent Java runtime environments on different hardware cations. The reason for this is that Java offers several platforms. The results demonstrate that the perfor- features, which other traditional languages lack. Sci- mance gap is quite small on some platforms. entists use various platforms (such as Windows, Unix The runtime environment optimization is an im-

Page 1 portant aspect to study in order to achieve high per- GCJ over other JVMs. The GCJ compiler is a project formance. Determining and understanding the fac- under development and several limitations still exist. tors that affect the performance of scientific appli- For example, GCJ can not compile the Java program cations from a software engineering perspective and with .UsingGCJ,theJavaapplication with identifying and eliminating the bottlenecks that limit Swarm library can be compiled into native code, but scalability at the stage are also it does not improve the performance. necessary for high performance scientific applications. There are many other runtime environment opti- In this paper, several approaches for analyzing and mizers and high performance compilers. AlphaWorks improving the performance of a particular scientific is a high performance compiler for Java from IBM simulation, which is used to study Natural Organic alphaWorks. It can be usedonOS/2,AIXandWin- Matter (NOM), have been described. The NOM sim- dows NT platform. JOVE 2 also can only be used ulator was built using the Java programming lan- on Windows machines. TowerJ environment 3,devel- guage and the Swarm library from Santa Fe Insti- oped by Tower Technology Corporation, is another tute [3]. The NOM simulation model is a typical example of high performance compilers. It is avail- distributed, stochastic scientific application that uses able for Solaris and Linux platforms. an agent-based modeling approach. It generates a For large scale scientific applications written in substantial data set that must be stored in a remote Java, the scalability can be improved using the par- database and able to be manipulated for producing allelism programming model. The parallelism model useful information. The application simulates the be- can run with a single JVM in a shared memory mul- havior of a large number of molecules. An application tiprocessor system or with multiple JVMs in a dis- of the NOM simulation model needs to run for a long tributed memory system. The Java built-in threads time, often for days. Severalaspects of the simulation mechanism is a convenient method for parallelism im- model have been analyzed, including runtime opti- plementation in shared memory environments. How- mization, database access, objects usage, and parallel ever, for large scale applications that require large and distributed computing. memory and CPU time, distributing the application The NOM simulation model possesses the char- on multiple JVMs in a distributed memory system acteristics which typical scientific applications have. with some message passing mechanisms for inter-VM These techniques and analytical approaches can gen- communication is a suitable way to address the re- erally be used in other scientific applications. We quirements. expect that our experiences can help other scientific The standard Java libraries, class, is appro- application developers to find a suitable way of tun- priate for using in the parallel programming paradigm ing and achieving higher performance for their appli- in a single JVM environment. Since most scientific cations. applications are CPU bound, in order to avoid the context switching to achieve best performance, the 2Related works number of threads should match the number of pro- cessors in the hardware architecture. Additionally, More and more implementations of Java based sim- the thread creation and destroying should be avoided ulation environment and toolkits [4][5][6][3]indicate by creating and managing a thread pool. that Java language and Java-based technologies have OpenMP [9], an open standard for shared mem- been widely and will be continue used in simulation ory directives, defines directives for FORTRAN, C, modeling and implementations. and C++. OpenMP provides aportableand scal- Traditionally, a Java program is compiled into able model that offers a simple and flexible interface bytecode using a compiler and a Java Virtual Ma- for developing parallel applications in shared memory chine (JVM) is needed to read in and interpret the systems. JOMP [10] provides a set of OpenMP-like bytecode. directives and library routines for supporting shared GCJ1,theGNUCompiler for the Java language, memory parallel programming in Java. It uses Java can compile the Java source code into either the byte- threads as the underlying parallel model and is most code or the native code. GCJ has been integrated into useful for parallelizing scientific applications at the GCC. Bothner (2003) [7] discussed the advantages, loop level. features, and limitations of GCJ in detail. Ladd For distributed computing, Java provides a com- (2003) [8] did several benchmarks using GCJ on the munication mechanism using sockets and the RMI Linux platform and showed a performance gain of 2http://www.instantiations.com/jove/ 1http://gcc.gnu.org/java/ 3http://www.towerj.com

Page 2 (Remote Method Invocation). Java RMI [11] is a tree structure. These datacanbeexportedtoafor- message passing paradigm based on the RPC (Re- matted text file (HTML). Information about object mote Procedure Call) mechanism. RMI is primarily allocation and deallocation can be viewed in a user intended for use in the client-server model instead of selected order. OptimizeIt is a powerful tool that the peer-to-peer communication model. On the other detects memory leaks and CPU performance bottle- hand, the explicit use of sockets is too low-level to be necks in Java applications. used to develop a parallel application [12]. In C, C++, and FORTRAN, an explicit message passing interface (MPI) [13] standard has been de- 3Simulation Model fined for supporting communication of an application in cluster environments. MPI is the most widely- Natural organic matter (NOM) is a mixture of hetero- used standard for inter-process communication in geneous molecules that comefromanimalandplant high-performance computing. MPICH [14] and LAM material in the natural environment. It plays a crucial MPI[15] are two successful examples of portable MPI role in ecological and bio-geochemical processes such implementations in traditional languages. Program- as the evolution of soils, the transport of pollutants, ming with MPI is relatively straightforward because and the global biochemical and geochemical cycling of it supports the single program multiple data (SPMD) elements [21]. NOM, a prevalent constituent of nat- model of parallel computing, wherein a group of pro- ural waters, is highly reactive with mineral surfaces cesses cooperate by executing identical program im- [21]. While NOM is transported through soil pores ages on local data values. by water, it can be adsorbedontoordesorbedfrom In 1998, a group of researchers of the Java Grande mineral surfaces. Sorption of NOM is an important Forum worked on a specification MPI-like applica- consideration in the treatment of drinking water. The tion programming interface for message passing in evolution of NOM over time from precursor molecules Java (MPJ) [16]. The current implementations of the to mineralization is an important research area in a MPJ can be separated in two ways: as a wrapper to wide range of disciplines, including biology, geochem- existing native MPI libraries or written in pure Java. istry, ecology, and soil science and water resources. NPAC’s mpiJava [17] is an example of the wrapper NOM, micro-organisms, and their environment approach using (JNI) to exe- form a complex system. The global phenomenon of a cute a native call. The JMPI project [18] implements complex system can often be observed by simulating message passing with Java RMI and object serializa- the dynamic behavior of individual components and tion. The jmpi [19] is built upon the JPVM system. their interactions in the system. Complex systems of- MPIJ [20] is a Java based implementation of MPI in- ten have emergent properties. The evolution of NOM tegrated with DOGMA (Distributed Object Group over time from precursor molecules to eventual miner- Metacomputing Architecture). The MPJ implemen- alization involves various molecular transformations. tation in pure Java is usually slower than wrapper These transformations involve chemical reactions, ad- implementations for existing MPI libraries, but pure sorption, aggregation and physical transport in soil, Java implementations are more reliable, stable, and ground, or surface waters. In order to provide sci- secure [19]. Getov (2001) [12] did an experiment that entists a test bed for their theoretical analysis and used the IBM High PerformanceCompiler for Java experimental results for NOM study, a Web-based (HPCJ), which generates native code for the RS6000 simulator was built. We expect that the system will architecture, to evaluate the performance of MPJ on help scientists better understand the NOM complex an IBM SP2 distributed-memory parallel machines. system by providing them with information for pre- The results show that when using such a compiler, dicting the properties of the NOM system over time. the MPJ communication components are as fast as The NOM simulation model is built upon the J2EE those of the MPI. architecture running on a distributed cluster. This OptimizeIt is a Java J2EE performancetuning tool cluster has multiple dual processor PCs running Red- developed by Borland company. It is a commercial hatLinux 8.0 and Windows 2000 operating systems. tool and the trial version can be downloaded from These machines in the cluster include a HTTP server, their Web site 4.OptimizeIt can display the infor- simulation servers, database servers, a reports server, mation about heap allocation, garbage collection, ac- and a data mining server. tive threads, and class load in the form of graphs. The NOM simulation system is an agent-based The CPU sampling information is also displayed in a stochastic model that can model NOM, mineral sur- faces, and microbial interactions near the surface 4http://www.borland.com/optimizeit/ of the soil. In the stochastic model of the evolu-

Page 3 tion of NOM in discrete time and space, NOM is objects through its index in the List.Position access presented as a large numberofdiscrete molecules is a linear-time operation inthe sequential access list, with varying chemical and physical properties. In- and it is a constant-time operation in a random access dividual molecules can be transported through the list. However, the remove operation for the random soil medium via water flow, adsorb on the soil access list is more expensive than for sequential access particle surfaces, and react with other molecules, list [22]. micro-organisms, and the environment. Individual An algorithm for manipulating random access lists molecules are represented as agents with their own can produce quadratic behavior when it is applied complex structures. Thesemolecules move along with to sequential access lists.Inthe NOM simulation the water flow and react witheachotherin a 2D dis- model, in order to randomize the order of accessing crete grid, i.e. a rectangularlattice composed of mul- the molecules in a List,theList needs to be shuffled tiple cells. Each molecule can occupy at most one cell at the beginning of each time step. The collections and each cell can host at most one molecule. During class in Java 2 platform provides a shuffle method the execution of the simulation, each molecule may to randomize the order of objects in the List.The move to another location, experience adsorption to shuffle algorithm used in the implementation has a or desorption from a particular site, and chemically linear time complexity for random access lists and react with other molecules. quadratic complexity for sequential access lists. In At the beginning of a simulation, simulation pa- order to avoid the expensive operation that would re- rameters are read from the database into the simu- sult from shuffling a sequential access list, the shuffle lation engine. The total number of molecules in the method converts the list into an array before execut- simulation can be determined by specifying the dis- ing shuffling andconvertstheshuffled array back into crete grid size and the molecule density. The simula- the list. tion time, defined by the user, is divided into a very Three benchmarks have been created in order to large number of equal, discrete, and independent time test the performance, using different List structures steps. In each time step, each molecule in the system and operations. These benchmarks reflect exactly is chosen in random order and the behavior of this how the data structures have been used in the NOM individual is determined by a set of rules. simulation model. In A, the MoleculeList is implemented using ArrayList,objects are accessed 4Performance Analysis using get(),andMoleculeList is randomized using shuffle().InbenchmarkB,MoleculeList is imple- We analyzed the performance for this simulation mented using LinkedList,objects are accessed us- model using the Optimizeit tuning tool. ing get(),andshuffle() is used. In benchmark C, MoleculeList is implemented using LinkedList,objects are accessed using iterator,andshuffle() is used. In 4.1 Data Structure each benchmark, the add and remove operations are measured, and the overall performance is measured Selecting the appropriate data structure is important to reflect the NOM application behavior. in a scientific application. Different data structures have significant impacts on the performance. No sin- Figure 1 shows the relationship between different gle data structure is appropriate for all situations. types of List with different operations. It also il- The best way to decide which type of data structure lustrates the behavior of different operations when to use in an application is tocreatesome benchmarks the List size is doubled. In the NOM application, that reflect how the structure is used in the applica- molecules are uniformly distributed on the grid. Dou- tion. bling the grid size, therefore, will double the molecule In the NOM simulation model, each molecule ob- number and the List size. ject needs to be held in a collection. Individual The overall performance gain for ArrayList has been molecule objects are accessed randomly at each time offset by the add and remove operations. By dou- step. They can also be added or removed from this bling the grid size, the time for randomly accessing collection. Java 2 platform provides two types of List the LinkedList increases more than 6 times. For this structures: ArrayList and LinkedList.TheArrayList particular application, the ArrayList is the best choice is a random access list, and the LinkedList is a se- for implementing the MoleculeList.Itcanget a max- quential access list. The sameoperation has very dif- imum 2.8 speedup in this benchmark with grid size ferent performance characteristics for different types 400 X 300. When the grid size becomes large, the of List.Position access is an operation used to access speedup increases.

Page 4 garbage collection job runs in a separate thread. Although the JVM is responsible for reclaiming the unused memory, Java programmers still need to put effort into making it clear to the JVM what objects are no longer being referenced [24]. For example, in some situations, a programmer needs to set the object into “null” manually in order to dereference the ob- ject. Although the memory leaks that are common in C++ are less likely to happen in Java, they can still occur due to poor design or simple coding errors. An elegant way of reducing the overhead of objects created and destroyed and of improving the perfor- mance is object reuse. It can also reduce the prob- ability of potential memory leaks. In order to reuse a certain type of object, several steps need to be fol- lowed:

• Isolating object Due to the overhead of object reuse, such as man- Figure 1: Performance comparison of different opera- aging the object pools, only objects that need to tionsforArrayList and LinkedList,thesimulation runs be created and destroyed frequently are compat- for 500 time steps. Top: grid size is 400 X 300. Bot- ible with this technology. For a large scale ap- tom: grid size is 800 X 300 plication, using a powerful profiling tool is neces- sary to help developers detect this kind of object. OptimizeIt has been chosen in the development process of the NOM simulation model. 4.2 Object reuse In the NOM simulation model, at each time step, a certain number of molecules can enter into the Objects play a vital role in Object-Oriented pro- system or leave the system. The molecule objects gramming, C++, and Java. The Java Virtual Ma- need to be created and destroyed frequently and chine (JVM) can automatically manage memory us- they are candidates for potential reuse. ingthe “garbage collection” mechanism. Java devel- • opers can allocate objects as necessary without con- Optimizing object size sidering deallocation, while C++ programmers must The size of objects not only has an effect on the manually specify where an object in the heap is to memory footprint but also on the CPU time. It be reclaimed by coding a “delete” statement. Exces- is worthwhile to estimate the size of a partic- sively creating objects not only increases the memory ular object and the number of instances for a footprint and CPU time for garbage collection, it also given class in memory. A trade off can be made increases the possibility of memory leak. by either reducing the object size or keeping the Sosnoski (1999) [23] showed that the time for the precision. object allocation in a Java program running on the • Reinitializing object JVM is 50 percent longer than one using the C++ code. This is caused by the overhead of adding in- Although Sun’s HotSpot VMs radically improved ternal information to help in the garbage collection the performance of object allocation and garbage process when allocating objects in heap. collection, object allocation and instantiation Different JVMs, different versions of Sun JVM and still has a significant cost, especially when the IBM JVM, use various techniques to automatically object size is small. discover objects when they are no longer being re- Amicro-benchmark is created for testing the ferred to and to recycle the memory periodically. De- time of creation or re-initialization of the spite the automatic nature of the garbage collection Molecule objects in the NOM simulation model. process, a potential disadvantage is that it adds an In this benchmark, 1000 Molcule objects are cre- overhead that can affect program performance, even ated, then reinitialized to their original state. in some JVMs such as Sun HotSpot JRE, where the The average creation time for these Molecule

Page 5 objects is 53.85 milliseconds, while the average and optimization, and plan execution [25]. The reinitializing time is 3.9 milliseconds. query parsing phase is a syntax-checking process for the string-based SQL query that ensures the • Creating object pool query statement is legal. If a SQL statement or In order to reuse certain types of objects, these asetofsimilar SQL statements needs to be exe- objects need to be kept in a collection in the cuted repeatedly, the cost for the parsing process memory, the so-called object pool. An object can be reduced by cachingthepreviously parsed pool is used to store a free list of objects. Gen- queries. JDBC provides this function with a Pre- erally, an object pool can be implemented using paredStatement object. Unlike the Statement ob- Vector, Linked List, ArrayList,orarawarray. Se- ject, which needs to be sent to a database for lecting a suitable type of data structure depends parsing each time, the PreparedStatement is com- on the type of operations used for managing the piled in advance and can be executed as many pool. times as needed. The speedup of a query varies In the NOM simulation model, the object pool according to the type of query (SELECT, IN- has been implemented as a First-In-First-Out SERT, UPDATE, DELETE) or the complexity (FIFO) queue. When a new molecule object of the query (more than one table involves). needs to be created, the method first needs to • Batch updates check the object pool. If there is an available object in the pool, the object is removed from For a large scale scientific application, the appli- the queue and reinitialized. If there is no ob- cation and the database normally reside on phys- ject available, a new object is created. When a ically distributed machines. Substantial network molecule object leaves the system, it is added at latency can lead to very inefficient query execu- the end of the queue. The data structure chosen tion. JDBC provides the batch update approach for the object pool implementation is LinkedList. that can decrease the network latency effect by executing a number of queries in one network Object reuse is a simple and elegant way to con- round trip. The query processing, however, is serve memory and enhance speed. By sharing and not necessarily faster when using the batch up- reusing objects, processes or threads are not slowed date approach instead of the PreparedStatment. down by the instantiation and loading time of new The trade off comes in two forms, (1) batch up- objects, or the overhead ofexcessive garbage collec- dates can only be combined with a Statement ob- tion. ject and (2) the size of the data needs to be sent from the client to the server in one large round 4.3 Database connection and database trip. The batch updates approach offers more query benefits when the network connection is slow. • For an application written in the Java programming Transactionmanagement language, communication to a database can be ac- In SQL terms, a transaction is a series of op- complished through a Java Database Connectivity erations that must be executed as a single log- (JDBC) driver, with all database Input/Output via ical unit. When a connection is created using SQL (Structured Query Language). Database ven- JDBC, the database is set in autocommit mode dors provide their own JDBC drivers that conform to and each SQL statement is treated as a separate the common Java API defined by . transaction. In order to allow two or more state- Using JDBC allows developers to change database ments to be grouped into a transaction, the au- location, port, and database vendors with minimal tocommit mode needs to be disabled using Con- changes in code. JDBC offers several advanced tech- nection.setAutoCommit(false).Treatingagroup niques that allow the programmer to write high per- of operations as a transaction is a safe way to formance queries [25]. These techniques are presented guarantee the integrity of a database. For exam- to show the significant impact on performance and ple, if a bank customer wishes to transfer funds scalability. from a savings account to a checking account, the two update operations need to be sent. If • Prepared statements these two calls are treated as two separate trans- Query processing is a process for resolving a SQL actions, then one is successful and the other fails query. It can be broken down into three ba- due to the network or other factors. This results sic phases: query parsing, query plan generation in data that is not consistent in the database.

Page 6 Explicitly committing a group of SQL state- 4.4 Parallel data output with Java ments is not only a safe approach, but also has threads large performance impact because the overhead of the commit operation is reduced. In the NOM Large-scale scientific applications always involve large model, all the data and information pertaining to data sets output. They are not only CPU bound pro- the reacted molecules inthesystem are stored cesses, but also I/O bound processes, especially when in the database at the end of each time step. the database sever and the application server are on The insertions for the data of all the reacted different machines. This is a common architecture molecules at each time step are treated as one design in large-scale scientific applications. transaction to ensure dataconsistency in the sys- If the simulation server has multiple processors, tem. multithreading can be used to overlap the compu- tation and communication. More specifically, paral- The NOM core simulation engine connects to the lelism can be achieved by overlapping the computa- Oracle database using the JDBC thin driver. Figure 2 tion and the I/O. There are, however, trade offs be- shows the performance comparison for data insertions tween the simplicity of programming and the perfor- using five approaches. The simulation runs for 96 mance. When an application not only involves the time steps, with a total of 1782 insertions. data-read but also involves the data-write, several The benchmark consists of five cases. In case 1, programming issues need to be considered to prevent each insertion for each reacted molecule was treated the deadlock and the race conditions. as a single transaction with a Statement object. In In the NOM simulation model, large amounts of case 2, each insertion for each reacted molecule was data need to be written in the database at each time treated as a single transaction with a PreparedState- step. The average time for one record insertion us- ment object. In case 3, a group of insertions for every ing PreparedStatement object with transaction man- reacted molecule in one time step is treated as a sin- agement is 4.7 milliseconds as shown in previous sec- gle transaction with Statement object. In case 4, a tion. In order to parallelize the data write to the group of insertions for everyreacted molecule in one database, a buffer (FIFO queue) is allocated and an time step is treated as a single transaction with Pre- extra thread is created. While the computational paredStatement object. In case 5, a batch updates thread adds the object to the queue, the I/O thread approach isapplied. removes the object from the queue and writes the data to the database. If there are no objects in the queue, the data-writing thread executes busy waiting. If the number of objects in the queue is equal to the queue size, the computationthreadwaits.Inorder to safely add to and remove from the queue, all the accesses to the queue are synchronized. The NOM simulation modelhasbeen tested and run for various time steps, from 96 time steps to 579 time steps. Refer to Figure 3. By using a separate thread for data output, there is average 1.3 speedup relative to the single threads model for the NOM application.

4.5 Choosing JVM Various Java Virtual Machines (JVM), such as IBM JVM and Sun HotSpot JVM, have been implemented Figure 2: Comparison of data insertions using five in conforming to the Specifica- approaches in NOM simulation model tion [26]. Sun MicroSystem implements two types of HotSpot JVM, Client VM and Server VM, to meet different requirements for different applications. Case 4, the PreparedStatement object with trans- Compared with server side programs, client side pro- action management has the best performance in the grams often require a smaller RAM footprint and NOM application. It offers 3.03 times speedup rela- have a faster start-up time. These two HotSpot JVM tive to case 1 in this benchmark. share the same runtime portion, but the main differ-

Page 7 Figure 3: Overlap the computation and I/O using Java threads. Left figure shows that the number of data insertion over the time steps. Right figure shows that the speedup of using a separate thread for data output.

ence between them is found in their compiler tech- nologies. The Server VM contains a highly advanced Figure 4: Performance comparison for a NOM ap- adaptive compiler that includes many of the optimiza- plication run in Sun Client VM and Sun Server VM tion technologies which are used in C++ compilers with different grid sizes. [22]. The application of the NOM simulation model has been benchmarked using two different runtime modes of Sun JVM 1.4.1 01 on Redhat Linux 8.0 for 500 time vidual molecules is simulated using the agent-based steps and 1500 time steps. Figure 4 shows that as the modeling approach. Molecules reside in cells on a grid size and the time steps increase, Server VM of- 2D grid and individual molecules can be transported fers higher performance than Client VM. The results through the soil medium via water flow. At each time shows that choosing different JVMs can produce sig- step, molecules can move from one cell to the other nificant differences in the performance. when a random event occurs. The time for finishing Ladd (2003) [8] showed benchmark results for both asimulation is largely determined by the number of Sun and IBM JVM with versions 1.3.1 and 1.4.1 01 molecules in the system and the time steps. on a Linux platform. According to Ladd, the JVM version 1.3.1 has a higher performance level than ver- 4.6.1 Java threads version sion 1.4.1 and the IBM JVM has a better performance level than Sun JVM. Choosing the appropriate JVM In order to parallelize the programming model, the for a particular scientific application also involves the original grid has been equally separated into two sub- consideration of the hardware architecture and the set grids (e.g., a 800X300 grid is separated into two . 400X300 grids). Two threads are created, each thread has its own grid object to place molecules and a col- 4.6 Scalability lection to hold molecules. In each time step, the com- putation for individual molecules is executed on the The scalability of the NOM simulation model involves two threads concurrently. two aspects, the required total number of simula- When one molecule moves across the boundary, the tion time steps and the grid size. Two parallelism molecule is removed from the grid and placed into programming models are implemented for the NOM alocalbuffer in the current thread. At the end of simulation model. The Java thread version is im- the time step, a Barrier is used to synchronize these plemented using built-in Java threads and runs on a two threads at this point. After all the threads reach single JVM. The distributed memory model is imple- this state, one of the threads executes the exchange mented by using mpiJava library with LAM MPI. operation by maintaining the state of two grids and In the sequential implementation model for simu- two MoleculeLists. Threads have been synchronized lating the NOM complex system, the behavior of indi- before this thread finishes the setting of boundary

Page 8 condition. Figure 5 shows the design. rank 0 rank 1 rank 2

JVM JVM JVM

JVM a c

Thread Thread time 0 1 step b d i Molecule Molecule List0 List1 Grid 0 Grid 1 Grid 2 a

Right Left Grid send send left b send Grid buffer buffer buffer

send right buffer Exchange Exchange buffer0 buffer1 end Barrier of time Barrier step Thread receive i left block 0 receive receive send buffer Exchange remove buffer and buffer X molecules Thread receive recv 1 right wait

Molecule add molecules List X Barrier

Update Update Update List& List& List& put grid grid grid a molecule b Right to Left Grid new Grid location

begin of Barrier time step Thread Thread i+1 0 1 Figure 6: The design for parallelism of NOM simula- tion model using MPJ.

Figure 5: The design for parallelism of NOM simula- tion model using Java threads. 4.6.3 Performance results

In order to measure the performance of the parallel implementations, a Linux cluster has been built. The cluster consists of 4 PCs. Each PC has dual 650 MHz processor running the RedHat Linux 8.0 Oper- 4.6.2 MPJ version ating System. The Java codes are implemented and compiled using SUN’s 1.4.1 01 In the distributed memory model, each processor and executed on SUN’s Java Virtual Machine. runs its own copy of code. The ModelSwarm object Figure 7 presents the results for the sequential ver- contains a subset of the original grid. These sub- sion and Java thread version for the NOM application sets of the grid have equal size in order to ensure that runs for 1500 time steps with various grid size. the computationalbalance of each node in the clus- Both versions ran on a single dual Intel computer. ter. The basic design model is similar to the Java The experiments for parallelism of the NOM sim- thread model. Each node in the cluster machine pro- ulation model using the distributed memory model cesses its own computation of the subset of the grid. are made on a cluster of four PCs. LAM MPI 6.5.9 When a molecule crosses the boundary, it is added and mpiJava have been used to build the execution into the local buffer. MPI.COMM WORLD.Barrier environment. The NOM application runs on 2 and 4 is used to synchronize these processes at the each machines. time step. Molecule objects that cross the boundary Figure 8 shows that the performance comparison are sent to their neighbor grids using blocking send between the sequential programming model and the and receive modes, MPI.COMM WORLD.Sendrecv, MPJ model that ran on 4 nodes in the cluster. These MPI.COMM WORLD.Send,and twomodels both ran for 500 and 1500 time steps. MPI.COMM WORLD.Recv.Figure6shows this de- This figure shows that the communication between sign. The molecule list and grid on each machine are nodes and the grid and MoleculeList maintenances updated after all the sending and receiving are fin- offset the performance gainedbydistributing the job ished. to different when the problem size is small.

Page 9 Figure 7: Compare the performance between sequen- tial programming model and Java thread model in Figure 8: Comparison of the performance between the NOM application (one thread vs. two threads on sequential programming model and MPJ model that asingledual CPU computer). runs on 4 machines for 1500 time steps.

As the problem size grows (larger grid size or longer time steps), however, a performance improvement ap- pears by distributing the job to. Compared to the ideal performance gain of a factor of 4, the efficiency is low. Figure 9 illustrates that the job has been dis- tributed on four nodes in the cluster machines and ran for 500 time steps. There is nosynchronizationbe- tween nodes. Instead of sending the molecules which cross the boundary to other nodes, they are wrapped to the other side of this subset grid. This figure shows that as the grid size increases the speedup is closer to the ideal linearspeedup of 4.

5Conclusion

In this paper, several approaches for exploring the performance and scalability improvement of a typical scientific application, the NOM simulation model, has been presented. These approaches are summarized in Figure 9: Comparison of the performance between Table 1. The speedup varied by the problem size and sequential programming model and MPJ model on 4 different situations. The speedup shown in the ta- machines without synchronization and molecule ex- ble came from the benchmarks that were described change. in the previous sections. All the numbers of speedup listed in the table show the approximate highest per- formance gain in eachbenchmark.

Page 10 Besides the approaches that are listed in the Ta- ble, using a native code compiler that can compile the Java source codetonativecodecanincreasethe performance for some applications. Program profiling is a crucial step for high perfor- mance computation in Java-based applications. Two Table 1: Summary of the performance improvements major aspects, CPU time and memory usage, need to be monitored. Approaches Speedup Comments Selecting the appropriate runtime environment for Data struc- 2.8 using ArrayList provides aparticularapplication is important. Besides the ture higher overall performance JVMs that have been evaluated here, IBM JVM is than using LinkedList in the also valuable to be investigated. benchmark. Using Java built-in threads to parallelize the Java Object reuse - reduce the memory foot- applications is a convenient approach. How much per- print, the garbage collection formance can be gained from this parallism depends cycle, and the overhead of on the JVM implementation and the efficiency of the object allocation and deal- operating system handling the Java threads. The ex- location. The performance periments that we did are on a dual CPU PC with gain is relatively small com- Linux operating system. It is valuable to extend these pare to the overall computa- experiments to Windows operating system or Solaris tion time in the NOM simu- operating systems. lation model. Distributing jobs on multiple machines in a clus- JDBC 3 JDBC tuning can effect the ter environment is an efficient approach for large size performance ofdataI/O. problems. However, the communication among nodes Parallel data 1.3 overlap the computation and the grid and list maintenances offset the perfor- output and I/O using Java threads mance gain. In order to avoid this overhead, instead can improve the perfor- of synchronizing all the processes at each time step, mance. they can be synchronized at every 5 time steps or Java runtime 1.4 Sun Server VM has higher more. For this particular application, it is also possi- performancethanSun ble to distribute the job on multiple machines using Client VM for large-scale MPJ and combine the results at the end of the sim- scientific applications. ulation in the database. There is no communication Java threads 1.1 performance ofJavathreads at all after the job is distributed. However, the vali- model model largely depends on dation is necessary for these two approaches. the JVM implementation and how efficient the op- erating system handle the References threads. MPJ model 1.5 distributing the job to 4 [1] Mike Ashworth. The potential of Java for high nodes in a cluster has performanceapplications. In The First Interna- relative larger performance tional Conference on the Practical Application of gains when problem size is Java,pages 19–33, 1999. big and the communication between nodes is minimized. [2] J. M. Bull, L. A. Smith, L. Pottage, and R. Free- However, it is still much man. Benchmarking Java against C and For- lower than the ideal perfor- tran for scientific applications. In Proceedings mance gain (4). of the 2001 joint ACM-ISCOPE conference on Java Grande,pages 97–105, June 2001.

[3] Swarm development group. http://www.swarm.org.

[4] Yung-Hsin Wang and Szu-Hsuan Ho. Imple- mentation of a devs-javabean simulation envi-

Page 11 ronment. In The 34th Annual Simulation Sympo- [18] Steven Morin, Israel Korean, and C. Mani Kr- sium,pages 333–338, Seattle, Washington, 2001. isha. JMPI:Implementing the message passing standard in Java. In Internatinal Parallel and [5] Peter H.M. Jacobs, Niels A. Lang, and Alexan- Distributed Processing Symposium: IPDPS 2002 der Verbraeck. D-sol: A distributed java based Workshops.IEEE, 2002. discrete event simulation architecture. In The 2002 Winter Simulation Conference,pages 793– [19] Kivanc Dincer. Ubiquitous message passing 800, 2002. interface implementation in Java: jmpi. In [6] Repast. http://repast.sourceforge.net/. 13th International Parallel Processing Sympo- sium and 10th Symposium on Parallel and Dis- [7] Per Bothner. Compiling Java with tributed Processing.IEEE, 1999. GCJ. Linux Journal,Jan 2003. http://www.linuxjournal.com/article.php?sid=4860. [20] Glenn Judd, Mark Clement, and Quinn Snell. Dogma: Distributed object group management [8] Scott Robert Ladd. Benchmark- architecture. In Concurrency: Practice and Ex- ingcompilers and languages for ia32. perience,volume 10. ACM 1998 Workshop on http://www.coyotegulch.com/reviews/almabench.html, Java for High-Performance Network Computing, 01 2003. 1998. OpenMP [9] OpenMP Architecture Review Board. [21] Steve Cabaniss. Modeling and stochastic simula- C C++ and application programming inter- tion of nom reactions, working paper, July 2002. face. Technical report, OpenMP Architec- ture Review Board, 1998. Available from [22] Steve Wilson and Jeff Kesselman. Java platform http://www.openmp.org. performance strategies and tactics,chapter Ap- pendixB.Addison-Wesley, 2000. [10] Mark Bull and Mark Kambites. JOMP–An OpenMP-like interface for Java. In ACM 2000 [23] Dennis M. Sosnoski. Java performance pro- Java Grande Conference.ACM, 2000. Available gramming, part 1: Smart object-management from http://www.epcc.ed.ac.uk/research/jomp. saves the day. Java World,Nov 1999. http://www.javaworld.com/javaworld/jw- [11] Sun Microsystems. Java remote method 11-1999/jw-11-performance.html. invocation specification. Technical report, Sun Microsystems, 1998. Available at: [24] Steve Wilson and Jeff Kesselman. Java platform http://java.sun.com/products/jdk/rmi/. performance strategies and tactics,chapter Ap- [12] V. Getov, G. von Laszewski, M. Philippsen, and pendixA.Addison-Wesley, 2000. I. Foster. Multiparadigm communications in [25] Greg Barish. Building Scalable and High- J ava for . Commication of the Performance Java Web Applications using ACM, 44(10):118–125, 2001. J2EE Technology.Addison-Wesley, 2002. [13] MPI.http://www-unix.mcs.anl.gov/mpi/. [26] Tim Lindholm and Frank Yellin. The Java [14] MPICH: A portable implementation of MPI. Virtual Machine Specification, Second Edition. http://www-unix.mcs.anl.gov/mpi/mpich/. Addison-Wesley, 1999. [15] LAM/MPI parallel computing. http://www.lam-mpi.org/. [16] Bryan Carpenter, Vladimir Getov, Glenn Judd, Anthony Skjellum, and Geoffrey Fox. MPJ: MPI-like message passing for Java. Concur- rency: Practice and Experience, 12(11), 2000. [17] Mark Baker, Bryan Carpenter, Geoffrey Fox, Sung Hoon Ko, and Sang Lim. mpiJava: An objected-oriented Java interface to MPI.InIn- ternationalWorkshop on Java for parallel and Distributed Computing, IPPS/SPDP 1999,April 1999.

Page 12