Parallel and Distributed Gr\" Obner Bases Computation In
Total Page:16
File Type:pdf, Size:1020Kb
Parallel and distributed Gr¨obnerbases computation in JAS Heinz Kredel IT-Center, University of Mannheim [email protected] 18. July 2010 Abstract Besides the sequential algorithm, we consider three Gr¨obner bases implementations: multiple This paper considers parallel Gr¨obner bases al- threads using shared memory, pure distributed gorithms on distributed memory parallel comput- memory with communication of polynomials be- ers with multi-core compute nodes. We summa- tween compute nodes and distributed memory com- rize three different Gr¨obnerbases implementations: bined with multiple threads on the nodes. The last shared memory parallel, pure distributed memory algorithm, called distributed hybrid, uses only one parallel and distributed memory combined with control communication channel between the master shared memory parallelism. The last algorithm, node and the worker nodes and keeps polynomials called distributed hybrid, uses only one control in shared memory on a node. The polynomials are communication channel between the master node transported asynchronous to the control-flow of the and the worker nodes and keeps polynomials in algorithm in a separate distributed data structure. shared memory on a node. The polynomials are In this paper we present new performance measure- transported asynchronous to the control-flow of the ments on a grid-cluster [6] and discuss performance algorithm in a separate distributed data structure. of the algorithms. The implementation is generic and works for all An object oriented design of a Java computer al- implemented (exact) fields. We present new perfor- gebra system (called JAS) as type safe and thread mance measurements and discuss the performance safe approach to computer algebra is presented in of the algorithms. [31, 23, 24, 26]. JAS provides a well designed software library using generic types for algebraic computations implemented in the Java program- 1 Introduction ming language. The library can be used as any We summarize parallel algorithms for computing other Java software package or it can be used in- Gr¨obnerbases on todays cluster parallel comput- teractively or interpreted through an Jython (Java ing environments in a Java computer algebra sys- Python) front-end. The focus of JAS is at the tem (JAS), which we have developed in the last moment on commutative and solvable polynomials, years [28, 29]. Our target hardware are distributed Gr¨obnerbases, greatest common divisors and ap- arXiv:1008.0011v1 [cs.DC] 30 Jul 2010 memory parallel computers with multi-core com- plications. JAS contains interfaces and classes for pute nodes. Such computing infrastructure is pre- basic arithmetic of integers, rational numbers and dominant in todays high performance computing multivariate polynomials with integer or rational clusters (HPC). The implementation of Gr¨obner number coefficients. bases algorithms is part of the essential building blocks for any computation in algebraic geometry. 1.1 Parallel Gr¨obnerbases Our aim is an implementation in a modern object The computation of Gr¨obnerbases (via the Buch- oriented programming language with generic data berger algorithm) solves an important problem for types, as it is provided by Java programming lan- computer algebra [5]. These bases play the same guage. role for the solution of systems of algebraic equa- introduces the expected and developed infrastruc- tions as the LU-decomposition, obtained by Gaus- ture to implement parallel and distributed Gr¨obner sian elimination, for systems of linear equations. bases. The Gr¨obnerbase algorithms are summa- Unfortunately the computation of such polynomial rized in section 3. Section 4 evaluates several as- bases is notoriously hard, both with sequential and pects of the design, namely termination detection, parallel algorithms. So any improvement of this al- performance, the `workload paradox', and selection gorithm is of great importance. For a discussion strategies. Finally section 5 draws some conclu- of the problems with parallel versions of this algo- sions and shows possible future work. rithm, see the introduction in [28]. For the convenience, this paper contains sum- maries and revised parts of [28, 29] to explain the 1.2 Related work new performance measurements. Performance fig- ures and tables are presented throughout the paper. In this section, we briefly summarize the related Explanations are in section 4.2. work. Related work on computer algebra libraries and an evaluation of the JAS library in comparison to other systems can be found in [24, 25]. 2 Hard- and middle-ware Theoretical studies on parallel computer algebra focus on parallel factoring and problems which can In this section we summarize computing hardware exploit parallel linear algebra [32, 41]. Most re- and middle-ware components required for the im- ports on experiences and results of parallel com- plementation of the presented algorithms. The puter algebra are from systems written from scratch suitability of the Java computing platform for par- or where the system source code was available. A allel computer algebra has been discussed for ex- newer approach of a multi-threaded polynomial li- ample in [28, 29]. brary implemented in C is for example [9]. From the commercial systems some reports are about 2.1 Hardware Maple [8] (workstation clusters), and Reduce [38] Common grid computing infrastructure consists of (automatic compilation, vector processors). Multi- nodes of multi-core CPUs connected by a high- processing support for Aldor (the programming performance network. We have access to the bw- language of Axiom) is presented in [36]. Grid aware GRiD infrastructure [6]. It consists of 2×140 8-core computer algebra systems are for example [35]. CPU nodes at 2:83 GHz with 16 GB main memory The SCIEnce project works on Grid facilities for connected by a 10 Gbit InfiniBand and 1 Gbit Eth- the symbolic computation systems GAP, KANT, ernet network. The operating system is Scientific Maple and MuPAD [40, 44]. Java grid middle-ware Linux 5.0 and has shared Lustre home directories systems and parallel computing platforms are pre- and a PBS batch system with Maui scheduler. sented in [18, 16, 7, 20, 2, 4]. For further overviews The performance of the distributed algorithms see section 2.18 in the report [15] and the tutorial depend on the fast InfiniBand networking hard- [39]. ware. We have done performance tests also with For the parallel Buchberger algorithm the idea normal Ethernet networking hardware. The Ether- of parallel reduction of S-polynomials seems to be net connection of the nodes in the bwGRiD cluster originated by Buchberger and was in the folklore for is 1 Gbit to a switch in a blade center containing a while. First implementations have been reported, 14 nodes and a 1 Gbit connection of the blade cen- for example, by Hawley [17] and others [33, 43, 19]. ters to a central switch. We could not obtain any For triangular systems multi-threaded parallel ap- speedup for the distributed algorithm on an Ether- proaches have been reported by [34, 37]. net connection, only with the InfiniBand connec- tion a speedup can be reported. 1.3 Outline The InfiniBand connection is used with the Due to limited space we must assume that you are TCP/IP protocol. The support for the direct Infini- familiar with Java, object oriented programming Band protocol, by-passing the TCP/IP stack, will and mathematics of Gr¨obnerbases [5]. Section 2 eventually be available in JDK 1.7 in 2010/11. The 2 evaluation of the direct InfiniBand access will be be able to distinguish messages between specific future work. threads on both sides we use tagged messages chan- nels. Each message is send together with a tag 2.2 Execution middle-ware (an unique identifier) and the receiving side can then wait only for messages with specific tags. For In this section we summarize the execution and details on the implementation and alternatives see communication middle-ware used in our implemen- [29]. tation, for details see [28, 29]. The execution middle-ware is general purpose and independent 2.3 Data structure middle-ware of a particular application like Gr¨obnerbases and can thus be used for many kinds of algebraic algo- We try to reduce communication cost by employ- rithms. ing a distributed data structure with asynchronous communication which can be overlapped with com- putation. Using marshalled objects for transport, GBDist ExecutableServer the object serialization overhead is minimized. This Distributed Distributed data structure middle-ware is independent of a par- ThreadPool Thread ticular application like Gr¨obnerbases and can be Reducer Reducer used for many kinds of applications. Server Client The distributed data structure is implemented by GB() clientPart() class DistHashTable, called `DHT client' in figure DHT 1, see also figure 7. It implements the Map inter- Client DHT Client face and extends AbstractMap from the java.util DHT Server package with type parameters and can so be used in a type safe way. In the current program version master node client node we use a centralized control distributed hash table, InfiniBand a decentralized version will be future work. For the usage of the data structure the clients only need Figure 1: Middleware overview. the network node name and port of the master. In addition there are methods like getWait(), which The infrastructure for the distributed partner expose the different semantics of a distributed data processes uses a daemon process, which has to structure as it blocks until an element for the key be setup via the normal cluster computing tools has arrived. or some other means. The cluster tools available at Mannheim use PBS (portable batch system). PBS maintains a list of nodes allocated for a clus- 3 Gr¨obnerbases ter job which is used in a loop with ssh-calls to In this section we summarize the sequential, the start the daemon on the available compute nodes. shared memory parallel and distributed versions of The lowest level class ExecutableServer imple- algorithms to compute Gr¨obnerbases as described ments the daemon processes, see figure 1.