Titanium: Language and Compiler Support for High Productivity Scientific Computing

Web Page: titanium.cs.berkeley.edu/

Participating UC Berkeley Faculty: Katherine Yelick (CS – [email protected]) Susan Graham (CS – [email protected]) Paul Hilfinger (CS – [email protected])

The major barrier to widespread use of scalable parallel machines is the difficulty of programming. While message-passing libraries have proven very useful for the most dedicated performance programmers, many computational scientists who once used vector or shared memory machines are not using large-scale clusters, because they are unwilling to make the significant programming investment required to use message passing. Titanium addresses the dual goals of programming ease and high performance by combining modern programming language features with an optimizing compiler. Starting with the Java language, Titanium retains the support for user-defined types, inheritance, strong typing and safe memory management. Evidence from the business and systems programming community suggest that these features result in higher programmer productivity, when compared to programming in C or C++. Titanium is based on Java, but uses an SPMD style of parallelism with a shared address space for high performance programming, similar to UPC and Co-Array Fortran. The current Titanium compiler runs on most parallel platforms and provides programmers with a uniform abstract machine model, which is independent of the underlying hardware. The key additions to Java are described in several papers on the Titanium web page. They are:  User-defined immutable classes (often called "value" classes)  Efficient multi-dimensional arrays  Support for rectangular grid-based computations using domain algebra  Unordered loop iteration to allow aggressive optimization  Safe, zone-based memory management  Statically check global synchronization  Library of collective operations (barrier, broadcast, reductions, etc)  Operator overloading  Templates (parameterized classes) The current Titanium implementation runs on a wide range of platforms, including uniprocessors, shared-memory multiprocessors, distributed-memory clusters of uniprocessors or SMPs (CLUMPS), and a number of specific supercomputer architectures (Cray T3E, IBM SP, Origin 2000). The distributed memory backends can utilize a wide variety of high-performance network interconnects, including Active Messages, MPI, IBM’s LAPI, and UDP. Titanium is based on a static compilation model and does not permit dynamic class loading. Portability comes from a two-step compilation process, in which the Titanium compiler (tc) translates the Titanium program into C with calls to a communication layer, with the C code compiled by either gcc or a vendor-supplied compiler. In spite of the use of two separate steps, programs can be easily compiled using a single compiler command.

Speedup on the NOW (Sun cluster)

$ [ s   - [.w   $ [<~F    E p  [IˆSInt    u $

d [Wwa  

e ] e $ p

S u $  $ I  ¸ ð Number of processors

Figure 1: The immersed boundary method is used to simulate fluid flow in biological systems, including the heart, the inner ear, and blood clotting. The left-hand figure shows the fluid flow through a simple torus, and the right-hand figure show early speedup numbers on a Sun cluster. Several medium-scale applications exist in Titanium, including solvers for computational fluid dynamics based on adaptive mesh refinement, a micro-array selection application from computational biology that uses dynamic load balancing, a finite-element problem from earthquake engineering, and a tree-based n-body problem from astrophysics. In addition, the Titanium group has developed the first distributed-memory parallel implementation of the immersed-boundary method using Titanium. The immersed boundary is used to model systems of elastic fibers in an incompressible fluid. The figure above shows the result of a synthetic problem and some early performance numbers taken on a Sun cluster called the NOW machine. We have run the full human heart simulation on the Cray T3E and the IBM SP. Our future work includes compiler and language support for irregular computations, developing new runtime implementations for future architectures, and more aggressive communication and memory-hierarchy optimizations based on a parameterized machines model. The current array abstractions in Titanium are designed for block-structured algorithms, and the compiler optimizes the computational patterns that arise in nearest neighbor computations in multigrid by reorganizing loops to reduce cache misses and improve register reuse. For other computational patterns, such as those arising in unstructured grids and some particle methods, more support is required to improve the cost of random array accesses. There are two aspects of this work, one involving runtime support for efficient irregular communication, and the other involving single processor memory hierarchies as in the BeBOP project. The existing applications in Titanium are also useful as application benchmarks for evaluating future research architectures and programming models being proposed by Sun researchers. We are working with the Sun HPCS Programming Models group to better understand the limitations of the Titanium model and the trade-offs in programmability and performance of other Java-like languages for high-performance computing.

Desired Support: Graduate student support.