Parallel Systems in Symbolic and Algebraic Computation
Total Page:16
File Type:pdf, Size:1020Kb
UCAM-CL-TR-537 Technical Report ISSN 1476-2986 Number 537 Computer Laboratory Parallel systems in symbolic and algebraic computation Mantsika Matooane June 2002 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2002 Mantsika Matooane This technical report is based on a dissertation submitted August 2001 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Trinity College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/TechReports/ Series editor: Markus Kuhn ISSN 1476-2986 Abstract This thesis describes techniques that exploit the distributed memory in massively parallel processors to satisfy the peak memory requirements of some very large com- puter algebra problems. Our aim is to achieve balanced memory use, which differen- tiates this work from other parallel systems whose focus is on gaining speedup. It is widely observed that failures in computer algebra systems are mostly due to mem- ory overload: for several problems in computer algebra, some of the best available algorithms suffer from intermediate expression swell where the result is of reason- able size, but the intermediate calculation encounters severe memory limitations. This observation motivates our memory-centric approach to parallelizing computer algebra algorithms. The memory balancing is based on a randomized hashing algorithm for dynamic distribution of data. Dynamic distribution means that the intermediate data is allocated storage space at the time that it is created and therefore the system can avoid overloading some processing elements. Large scale computer algebra problems with peak memory demands of more than 10 gigabytes are considered. Distributed memory can scale to satisfy these requirements. For example, the Hitachi SR2201 which is the target architecture in this research provides up to 56 gigabytes of memory. The system has fine granularity: tasks sizes are small and data is partitioned in small blocks. The fine granularity provides flexibility in controlling memory balance but incurs higher communication costs. The communication overhead is reduced by an intelligent scheduler which performs asynchronous overlap of communication and computation. The implementation provides a polynomial algebra system with operations on multivariate polynomials and matrices with polynomial entries. Within this frame- work it is possible to find computations with large memory demands, for example, solving large sparse systems of linear equations and Gr¨obner base computations. The parallel algorithms that have been implemented are based on the standard algorithms for polynomial algebra. This demonstrates that careful attention to memory management aids solution of very large problems even without the benefit of advanced algorithms. The parallel implementation can be used to solve larger problems than have previously been possible. 3 4 Contents 1 Introduction 9 1.1 Motivation . 10 1.2 Randomized dynamic distribution . 13 1.3 Algebraic structures . 15 1.4 Outline of the dissertation . 16 2 Literature review 19 2.1 Target architecture . 19 2.2 Performance metrics . 25 2.3 Algorithms for data partitioning . 35 2.4 Computer algebra systems . 36 2.5 Parallel arithmetic . 41 2.6 Sparse systems of linear equations . 43 2.7 Gr¨obner bases . 48 2.8 Summary . 51 3 Data structures and memory allocation 53 3.1 Data locality . 54 3.2 Granularity . 55 3.3 Randomized memory balancing . 56 3.4 Local memory allocation . 58 3.5 Load factor . 60 3.6 Relocation . 60 3.7 Analysis of the randomized global hashing . 62 3.8 Weighted memory balancing . 63 3.9 The scheduler . 66 3.10 Summary . 69 4 Parallel arithmetic 71 4.1 Multiprecision integer arithmetic . 71 4.2 Parallel polynomial arithmetic . 77 4.3 Summary . 80 5 Systems of polynomial equations 83 5.1 Parallel Gr¨obner base algorithm . 83 5.2 Parallel sparse linear systems . 90 5 5.3 Summary . 97 6 Conclusions 99 6.1 Contributions . 99 6.2 Future directions . 100 6.3 Conclusions . 101 A CABAL: a short manual 103 A.1 CommSystem interface . 104 A.2 Polynomial interface . 105 A.3 Bignum interface . 106 A.4 Grobner interface . 106 B The Hitachi SR2201 109 B.1 The processor . 109 B.2 Pseudo vector processing . 110 B.3 Interconnection network . 111 B.4 Mapping of processors to topology . 113 B.5 Inter-process communication . 113 B.6 Memory hierarchy . 115 B.7 Programming environment . 115 C Network topologies 117 C.1 Fully connected network . 117 C.2 Ring network . 118 C.3 Star network . 118 C.4 Tree network . 118 C.5 Mesh network . 119 C.6 Hypercube . 119 C.7 The crossbar . 121 D Message passing interface 123 D.1 Features of MPI . 123 D.2 Communicators . 124 D.3 Point-to-point communication . 125 D.4 Type matching . 126 D.5 Collective communications . 126 D.6 Process groups and topologies . 127 D.7 Comparison of MPI and PVM . 127 D.8 Other communication libraries . 128 6 List of Figures 1.1 Growing memory requirements for a GCD calculation . 11 1.2 The comparative performance of memory and CPU . 12 1.3 Development of a parallel computer algebra system . 16 2.1 Resources in a parallel system . 20 2.2 Classes of distributed memory parallel machines . 22 2.3 Layering of parallel components . 23 2.4 Communication layers . 28 2.5 Partitioning of data with different granularity . 32 2.6 Block distribution . 36 2.7 The Bareiss determinant algorithm . 46 2.8 Buchberger's algorithm for Gr¨obner basis . 49 3.1 Communication categories affecting data locality . 55 3.2 Fine granularity during multiplication . 56 3.3 Randomized memory allocation . 57 3.4 Dynamic partitioning at execution time . 59 3.5 Margin of imbalance in memory load (κ) . 60 3.6 Traffic during multiplication with 2 PEs . 64 3.7 Traffic during multiplication with 3 PEs . 65 3.8 Local process scheduling . 67 3.9 Architecture of a parallel computer algebra system . 68 4.1 Kernel components . 71 4.2 Multiprecision addition . 74 4.3 Parallel polynomial addition . 78 4.4 Identifying blocks of the same term . 78 4.5 Parallel polynomial multiplication . 79 4.6 Execution time when increasing buffer size . 80 5.1 Ordering s-pairs below the diagonal . 85 5.2 Row algorithm for selecting critical s-pairs . 85 5.3 Selecting leading term of a polynomial . 87 5.4 Distributed s-polynomial computation . 88 5.5 Reduction algorithm . 88 5.6 Parallel reduction algorithm . 89 5.7 Sparse matrix representation . 90 5.8 Recursion tree for minor expansion . 94 7 5.9 Costs at different levels of recursion tree . 94 5.10 Parallelizing recursive minor expansion . 96 A.1 Application programming interface . 104 B.1 Vector inner product . 109 B.2 Object code for vector inner product . 110 B.3 Sliding windows for pseudo-vector processing . 111 B.4 Some additional instructions in SR2201 . 111 B.5 3D crossbar switching network . 112 B.6 Message passing with system copy . 114 B.7 Remote direct memory access (rDMA) system . 114 B.8 Comparison of some message passing systems . 115 C.1 A fully connected network of 5 processors . 118 C.2 Ring, linear and star topologies . 118 C.3 A binary tree . 119 C.4 A 2D mesh network . 119 C.5 Hypercubes in 2,3,4 dimensions . 120 C.6 A 2D crossbar network . 121 D.1 A message passing interface (MPI) packet . ..