DESIGN OF A PARALLEL MULTI-THREADED PROGRAMMING MODEL FOR MULTI-CORE PROCESSORS

By Muhammad Ali Ismail

Thesis submitted for the Degree of Doctor of Philosophy

Department of Computer and Information Systems Engineering

NED University of Engineering & Technology

University Road, Karachi - 75270, Pakistan

2011 DESIGN OF A PARALLEL MULTI-THREADED PROGRAMMING MODEL FOR MULTI-CORE PROCESSORS

PhD Thesis

By Muhammad Ali Ismail Batch: 2008-2009

Project Advisor:

Prof. Dr. Shahid Hafeez Mirza

Project Co-supervisor:

Prof. Dr. Talat Altaf

2011

Department of Computer and Information Systems Engineering

NED University of Engineering & Technology

University Road, Karachi - 75270, Pakistan Certificate

Certified that the thesis entitled, “DEVELOPMENT OF A NEW PARALLEL MULTI-THREADED

PROGRAMMING MODEL FOR MULTI-CORE PROCESSORS” which is being submitted by Mr.

Muhammad Ali Ismail for the award of degree of Doctor of Philosophy in Computer & Information

Systems Engineering Department of NED University of Engineering and Technology is a record of candidate’s own original work carried out by him under our supervision and guidance. The work incorporated in this thesis has not been submitted elsewhere for the award of any other degree.

______Prof. Dr. Talat Altaf, Prof. Dr. Shahid Hafeez Mirza Dean (ECE ), NEDUET Professor, UIT PhD Co-supervisor PhD Supervisor

Acknowledgements

In first place, I would like to thank the Almighty Allah for His countless blessings. In fact, all praise and glory belongs to Him and none has the right and worth to be worshipped but He.

Next, I would like to acknowledge my home university, NED university of Engineering and Technology, for giving me the opportunity and funding for conducting this PhD research.

I would also like to express my gratitude to my mentor and supervisor, Prof. Dr. Shahid Hafeez Mirza, for his generous supervision. His continuous support, encouragement, guidance, advices and comments helped me to stay in the right direction to complete this research.

I am also very grateful to my co-supervisor, Prof. Dr. Talat Altaf, for his very kind advices, support and motivation throughout my PhD research.

Many thank to my department, Computer and Information System Engineering, including my colleagues and its administrative and technical staff for providing me such a supportive and productive work environment.

Last but not the least, special thanks to my family, particularly to my parents for their endless prayers and support.

CONTENTS

Abstract……………………………………………………………………………………………………………………..………….. v

List of Publications…………………………………………………………………………………………………………………. vi

List of Figures………………………………………………………………………………………………………..……..………… vii

List of Tables………………………………………………………………………………………………………………………..… x

1. Introduction…………………………………………………………………………………………………………………….. 1

1.1. Contributions of Dissertation 1 1.1.1. Multi-level Cache System for Multi-core Processors ( "LogN+1" and "LogN" Cache Models ) 2 1.1.2. Multi-level Cache Simulator for Multi-core processors ( "MCSMC" ) 3 1.1.3. Multi-threaded Parallel Programming Model for Multi-core processors ( "SPC3 PM" ) 3 1.2. The Thesis Organization 4

2. Motivation and Challenges with Multi-Core Processors……………………………………………………. 5

2.1. Architectural Challenges 6 2.1.1. Memory Hierarchy 6 2.1.1.1. Cache Levels 7 2.1.1.2. Synchronization 7 2.1.1.3. False Sharing 8 2.1.1.4. Spinning 8 2.1.1.5. Communication Minimization 8 2.1.2. Architectural Support for Compilers / Programming Models 9 2.2. Software Challenges 9 2.2.1. Parallel Programming Models 9 2.2.2. Parallel Algorithm Models 10 2.2.2.1. Data Parallel Models 10 2.2.2.2. Task Graph Model 11 2.2.2.3. Work Pool Model 11 2.2.2.4. Master-Slave Model 11 2.2.2.5. Pipeline or Producer-Consumer Model 11 2.2.3. Decomposition Techniques 12 2.2.3.1. Recursive Decomposition 12 2.2.3.2. Data Decomposition 13 2.2.3.3. Exploratory Decomposition 13 2.2.3.4. Speculative Decomposition 13 2.2.4. Levels of Parallelism 13 2.2.5. Compiler Optimization 15 2.2.5.1. Parallelism 15 2.2.5.2. Removal of Data Dependencies 16

I

2.2.5.3. Memory Space 16 2.2.6. Related Tools for Performance and Parallel Debugging 16 2.2.7. Regular and Irregular Problems 17 2.3. Performance and Scalability Issues 18 2.4. Summary 19

3. LogN+1' and 'LogN' Cache model, A Binary Tree Based Cache System for Multi-Core Processors……………………………………………………………………………………………………………………….. 20

3.1. Present 3-level Cache System and Related Improvements for Multi-core Processors 20 3.2. 'LogN+1' and 'LogN' Cache Model 22 3.2.1. Design Concept 23 3.2.2. Cache Hierarchy and Cache Size 23 3.2.3. Cache Hierarchy and Cache frequency (Cycle Time) 28 3.3. Performance Evolution 30 3.3.1. Average Cache Access Time 30 3.3.2. Probability of Cache Hits 32 3.3.3. Result Analysis 34 3.4. Summary 35

4. Queuing Modeling of 'LogN+1' and 'LogN' Cache Models……………….………………………………… 36

4.1. Queuing Theory and Kendal’s Notation 36 4.2. M/D/C/K- FIFO Queuing Model, for LogN+1 and LogN Cache Model 37 4.2.1. Basic Model 38 4.2.2. Performance Equations 39 4.2.2.1. Average Data Request Rate 40 4.2.2.2. Average Cache Utilization 41 4.2.2.3. Average Individual Cache Access Time 42 4.2.2.4. Average Request Queue Length 42 4.2.2.5. Overall Average Cache System Access Time 42 4.3. Queuing Model for 3-Level Cache system 43 4.4. Performance Evolution 45 4.4.1. LogN+1 Model 45 4.4.2. LogN Model 48 4.4.3. Present 3-Level Cache System 48 4.4.4. Result Analysis 52 4.5. Summary 56

5. Simulation of 'LogN+1' and 'LogN' Cache Models Using 'MCSMC'…….………………………………. 57

5.1. Cache Simulation 57 5.2. MCSMC (Multi-level Cache Simulator for Multi-Cores) 58 5.2.1. Input Parameters Set 58 5.2.2. Software Modules 59 5.2.2.1. Cache Architecture Generator 60 5.2.2.2. Program Scheduler 60 5.2.2.3. Trace Generator 60 5.2.2.4. Replacement Policy Module 62

II

5.2.2.5. Results Generation 62 5.2.3. Serial / Parallel Execution of MCSMC 62 5.2.4. Comparison with CACTI Cache Simulator 65 5.3. Performance Evolution 67 5.3.1. Simulation Environment 67 5.3.2. Result Analysis 67 5.4. Summary 72

6. SPC3 PM; A Multithreaded Parallel Software Development Environment for Multi-Core Processors………………………………………………………………………………………...…………………………… 73

6.1. Currently Available Parallel Programming Tools 73 6.1.1. Commercially Available Multi-Core Application Development Aids 73 6.1.1.1. 's Multi-Core Application Development Aids 74 6.1.1.2. Microsoft’s Multi-Core Application Development Aids 76 6.1.1.3. Sun's Multi-Core Application Development Aids 76 6.1.1.4. Other Commercial Multi-Core Application Development Aids 77 6.1.2. Other Standard Shared Memory Programming Approaches Use for 78 Multi-core processors 6.1.2.1. Erlang 78 6.1.2.2. POSIX Thread (Pthreads) 79 6.1.2.3. OpenMP 79 6.1.3. Research Oriented Multi-Core Application Development Tools 79 6.1.4. Current Multi-Core Research Groups 81 6.1.5. Summary 83 6.2. Key Features of SPC3 PM 84 6.3. Design Concepts 85 6.3.1. Design Issues with Multi-Core Programming 86 6.3.2. Task Based Parallelism 89 6.3.3. Thread Level Parallelism 89 6.3.4. Decomposition Techniques 90 6.3.5. Task Scheduling 92 6.3.6. Execution Modes 93 6.3.7. Types of Problem Supported 93 6.3.8. Data Sharing 94 6.3.9. Compilation 94 6.4. Programming with SPC3 PM 96 6.4.1. Rules for Task Decomposition 96 6.4.2. Properties of a Task 97 6.4.3. Program Structure 99 6.4.4. SPC3 PM Library 100 6.4.4.1. Serial Function 100 6.4.4.2. Parallel Function 102 6.4.4.3. Concurrent Function 104 6.5. Performance Evolution 106 6.5.1. Matrix Multiplication Algorithm 107 6.5.2. Serial Function 109 6.5.3. Parallel Function 113 6.5.4. Concurrent Function 119 6.6. Summary 125

III

7. Solving Travelling Salesman Problem using SPC3 PM..………………………………………………………. 126

7.1. Travelling Salesman Problem (TSP) 126 7.1.1. TSP applications 126 7.1.2. TSP solutions 128 7.1.2.1. Exact Algorithms 129 7.1.2.2. TSP Heuristics 129 7.1.2.3. Meta-Heuristics 129 7.1.2.4. Hyper-Heuristics 130 7.2. Lin-Kernighan Heuristic 130 7.2.1. Basic Lin-Kernighan Heuristic Algorithm (LKH) 130 7.2.2. Modified Lin-Kernighan Heuristic Algorithm (LKH-1) 133 7.2.3. Lin-Kernighan Heuristic Algorithm with General k-opt Sub-move (LKH2) 134 7.3. LKH-2 Software 135 7.3.1. Execution of LKH-2 Software 135 7.3.2. Flow Chart for LKH-2 Software Processing 138 7.4. Parallelization of LKH-2 Software using SPC3 PM 139 7.4.1. Flow Chart for Parallel LKH-2 Software Processing Parallelized using SPC3 PM 141 7.5. Performance Evaluation 142 7.5.1. TSP Library (TSPLIB) 142 7.5.2. Result Analysis 143 7.6. Summary 150

8. Conclusions and Future Work……………………………………………….…………………………………………. 151

8.1. Summary 151 8.2. Future work 154

Appendix A: List of TSP instances in TSPLIB...... 156

References……………………………………………………………………………….……………………………………………. 159

IV

Abstract

With the arrival of Chip Multi-Processors (CMPs), every processor has now built-in parallel computational power and that can be fully utilized only if the program in execution is written accordingly. Also existing memory system and parallel developments tools do not provide adequate support for general purpose multi-core programming and unable to utilize all available cores efficiently. This research is an attempt to come up with some solutions for the challenges that multi- core processing is currently facing. This thesis contributes by proposing a novel multi-level cache system design "LogN+1 and LogN cache Models" for multi-core processors. This new proposed cache system is based on binary tree data structure and can be replaced with the existing 3-level cache system in order to minimize memory contention related problems. This thesis also contributes by developing a new multi-thread parallel programming model, "SPC3 PM” (Serial, Parallel and

Concurrent Core to Core Programming Model), for multi-core processors. The SPC3 PM is a serial- like task-oriented parallel programming model which consists of a set of rules for algorithm decomposition and a library of primitives to exploit thread-level parallelism and concurrency on multi-core processors. The programming model works equally well for different classes of problems including basic, complex, regular and irregular problems. Furthermore, a parallel trace-driven multi- level cache simulator "MCSMC" (Multi-level Cache Simulator for Multi-Cores) is also developed during this PhD research. It is a new addition in the family of cache simulators using that one can simulate the present 3-level cache system or any customized multi-level cache system. Its parallel execution makes it more efficient and less time consuming and its large set of input parameters also provides a wide range of simulation scenarios.

V

List of Publications and US Patent

So far this research has produced the following international journal and conference publications plus one US patent application.

1] M. A Ismail, S.H. Mirza, T. Altaf, “A Parallel and Concurrent Implementation of Lin-Kernighan Heuristic (LKH-2) for Solving Traveling Salesman Problem for Multi-Core Processors using SPC3 Programming Model”, Intl. J. of Adv. Comp. Sc. & App. (IJACSA), Vol. 2(7), 2011, pp 34-43.

2] M. A Ismail, S.H. Mirza, T. Altaf, “Concurrent Matrix Multiplication on Multi-core Processors”, Intl. J. of Comp. Sc. & Sec. (IJCSS), vol. 5(4), 2011, pp 208-220

3] M.A. Ismail, S.H. Mirza, T. Altaf, “LogN+1 and LogN Cache System for Multi-Core Processors”, Application accepted by HEC for US patent, May 07, 2010.

4] M.A. Ismail, S.H. Mirza, T. Altaf, “LogN+1 and LogN Model: A Binary Tree Based Multi-Level Cache System for Multi-Core Processors”, Intr. J. Comp. Sys. Sc. & Engg, submitted on June 4, 2010. Accepted in 1st phase, Result awaiting for the final phase.

5] M.A. Ismail, S.H. Mirza, T. Altaf, "Design of a Cache Hierarchy for LogN and LogN+1 Model for Multi-Level Cache System for Multi-Core Processors", In Proc. of International conference on Frontiers of Information Technology (FIT)-09 , ACM, Dec 16 – 18, 2009, Pakistan.

6] M.A Ismail, S.H. Mirza, T. Altaf, "Binary Tree Based Multi-level Cache System for Multi core Processors" In Proc. HPCNCS-09, July 13 – 16 2009, Orlando, Florida, USA, pp. 146-152.

VI

List of Tables

Table-3.1: Cache memory size (KB) at different cache levels for different number of cores (LogN+1 Model) using GP...... 25 Table-3.2: Cache memory size (KB) at different cache levels for different number of cores (LogN Model) using GP...... 26 Table-3.3: Cache memory size (KB) at different cache levels for different number of cores (LogN+1 Model) using AP...... 26 Table-3.4: Cache memory size (KB) at different cache levels for different number of cores (LogN Model) using AP...... 27 Table-3.5: Cache frequency (GHz) at different cache levels for different number of cores (LogN+1 Model) ...... 29 Table-3.6: Cache frequency (GHz) at different cache levels for different number of cores (LogN Model) ...... 29 Table-3.7: Cache cycle time (nsec) at different cache levels for different number of cores (LogN+1 Model)...... 29 Table-3.8: Cache cycle time (nsec) at different cache levels for different number of cores (LogN Model)...... 29 Table-3.9: Average cache access time for LogN+1, LogN and 3-level cache system with different number of cores...... 31

Table-4.1: Summary of the request rate equations for different cache levels...... 41 Table-4.2: Probability of cache hit at different cache levels for different number of cores (LogN+1 model)...... 46 Table-4.3: Probability of cache miss at different cache levels for different number of cores (LogN+1model)...... 46 Table-4.4: Request rate at different cache levels for different number of cores (LogN+1 Model)...... 46 Table-4.5: Cache utilization of different cache levels for different number of cores (LogN+1 Model)...... 47 Table-4.6: Individual average cache access time (nsec) at different cache levels for different number of cores (LogN+1 Model)...... 47 Table-4.7: Individual request queue length at different cache levels for different number of cores (LogN+1 Model)...... 47 Table-4.8: Average cache access time (nsec) LogN+1 cache model for different number of cores...... 47 Table-4.9: Probability of cache hit at different cache levels for different number of cores (LogN model)...... 48 Table-4.10: Probability of cache miss at different cache levels for different number of cores (LogN model)...... 49 Table-4.11: Request rate at different cache levels for different number of cores (LogN Model).....49

VII

Table-4.12: Cache utilization of different cache levels for different number of cores (LogN Model)...... 49 Table-4.13: Individual average cache access time (nsec) at different cache levels for different number of cores (LogN Model)...... 49 Table-4.14: Individual request queue length at different cache levels for different number of cores (LogN Model)...... 50 Table-4.15: Average cache access time (nsec) LogN cache model for different number of cores...50 Table-4.16: Cache utilization for 3-level cache system for different number of cores...... 51 Table-4.17: Average cache access time (nsec) LogN cache model for different number of cores...... 52 Table-4.18: Average cache access time for LogN+1, LogN and 3-level cache for different number of cores using queuing networks analysis………………………...... 52

Table-5.1: Input parameters for MCSMC with their default and maximum or other possible values……………………………………………………………………………………59 Table-5.2: Normal and detailed Input parameters for CACTI…………………………………….66 Table-5.3: Comparison of cache access time between MCSMC cache simulator and CACTI for various cache line sizes……………….…………………………………………………66 Table-5.4: Average access time calculated for LogN+1, LogN and 3-level cache for different number of cores using simulation……………………………………………………….68

Table-6.1: Comparison of Parallel language features (OpenMP, TBB, and SPC3 PM)...... 83 Table-6.2: Different Decomposition Techniques using SPC3 PM on ‘N- core’ machines...... 91 Table-6.3: Pseudo-codes in C and SPC3 PM for a program finding the smallest number in an array ‘A’ of length ‘n’ using recursive decomposition...... 92 Table-6.4: Three different algorithms for serial matrix algorithm...... 110 Table-6.5: Execution time in Seconds for each of three approaches, C++, SPC3 PM Serial Function with ‘auto core assignment’ and SPC3 PM with ‘specified core assignment’ for different sizes of matrices...... 110 Table-6.6: Speedup of SPC3 PM Serial function with ‘auto core assignment’ and with ‘specified core assignment for different sizes of matrices with C++...... 112 Table-6.7: Parallel matrix algorithm for OpenMP and SPC3 PM (Parallel)...... 114 Table-6.8: Execution Time (Sec) for parallel matrix multiplication using OpenMP and Parallel function of SPC3 PM for 4 Parallel threads...... 115 Table-6.9: Execution Time (Sec) for parallel matrix multiplication using OpenMP and Parallel function of SPC3 PM for 8 Parallel threads...... 115 Table-6.10: Execution Time (Sec) for parallel matrix multiplication using OpenMP and Parallel function of SPC3 PM for 12 Parallel threads...... 116 Table-6.11: Execution Time (Sec) for parallel matrix multiplication using OpenMP and Parallel function of SPC3 PM for 24 Parallel threads...... 116 Table-6.12: Speedup for Matrix Multiplication using SPC3 Parallel function with different number of parallel threads and different matrix sizes…...... 118

VIII

Table-6.13: Parallel matrix algorithm for OpenMP and concurrent function of SPC3 PM...... 120 Table-6.14: Execution Time (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent Function for 4 concurrent threads...... 121 Table-6.15: Execution Time (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent Function for 8 concurrent threads...... 121 Table-6.16: Execution Time (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent Function for 12 concurrent threads...... 121 Table-6.17: Execution Time (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent Function for 24 concurrent threads...... 121 Table-6.18: Comparison of speedup obtained for the SPC3 PM concurrent function with different number of concurrent threads and matrix sizes…...... 121

Table-7.1: Minimum, average and total run time for original serial LKH-2 software for each medium size TSP instances…...... 144 Table-7.2: Minimum, average and total run time for parallelized LKH-2 software using SPC3 PM for each medium size TSP instances...... 145 Table-7.3: Minimum, average and total run time for original serial LKH-2 software for each large size TSP instances...... 148 Table-7.4: Minimum, average and total run time for parallelized LKH-2 software using SPC3 PM for each large size TSP instances...... 148

IX

List of Figures

Figure-2.1: Possible classification of problems……………………………………………..………17

Figure-3.1: Fully shared L2 cache, multi-core processor configuration……………………………..21 Figure-3.2: Semi shared L2 cache, multi-core processor configuration……………………………..21 Figure-3.3: The binary tree based cache system (LogN+1 model)…………………………………..24 Figure-3.4: The binary tree based cache system (LogN model)…………………….………………24 Figure-3.5: Cache size at different levels for different number of cores (LogN+1 Model)…………26 Figure-3.6: Cache size at different levels for different number of cores (LogN Model)……………26 Figure-3.7: Cache size at different levels for different number of cores (LogN+1 Model)…………27 Figure-3.8: Cache size at different levels for different number of cores (LogN+1 Model)………....27 Figure-3.9: Comparison of average cache access time in 3-level cache system with LogN and logN+1 cache model as number of cores varies, using probabilistic model…………….32

Figure-4.1: Atomic block model for LogN and LogN+1 cache system…..….……………………..38 Figure-4.2: Queuing network model for LogN+1 cache model…………………..………………...39 Figure-4.3: Queuing network model for LogN cache model…..……………………………………39 Figure-4.4: Atomic model of L2 cache for N-cores for 3-level cache system……………………….43 Figure-4.5: Queuing network model for 3-Level Cache System…………………………………...44 Figure-4.6: Utilization of L1, L2 and L3 cache in 3-level Cache System with different number of cores……………………………………………………………………………………..51 Figure-4.7: Comparison of average cache access time in 3-level cache system with LogN and logN+1 cache model as number of cores varies, using queuing networks analysis…….53 Figure-4.8(a): Comparison between the average cache access time of LogN+1, LogN and present 3- level cache system estimated for 8 cores using Probabilistic model and Queuing analysis………………………………………………………………………………….53 Figure-4.8(b): Comparison between the average cache access time of LogN+1, LogN and present 3- level cache system estimated for 16 cores using Probabilistic model and Queuing analysis………………………………………………………………………………….54 Figure-4.8(c): Comparison between the average cache access time of LogN+1, LogN and present 3- level cache system estimated for 32 cores using Probabilistic model and Queuing analysis………………………………………………………………………………….54 Figure-4.8(d): Comparison between the average cache access time of LogN+1, LogN and present 3- level cache system estimated for 64 cores using Probabilistic model and Queuing analysis…………………………………………………………………………………..54 Figure-4.8(e): Comparison between the average cache access time of LogN+1, LogN and present 3- level cache system estimated for 128 cores using Probabilistic model and Queuing analysis………………………………………………...... 55

X

Figure-4.8(f): Comparison between the average cache access time of LogN+1, LogN and present 3- level cache system estimated for 256 cores using Probabilistic model and Queuing analysis…………………………………………………………………………………..55 Figure-4.8(g): Comparison between the average cache access time of LogN+1, LogN and present 3- level cache system estimated for 512 cores using Probabilistic model and Queuing analysis…………………………………………………………………………………..55 Figure-4.8(h): Comparison between the average cache access time of LogN+1, LogN and present 3- level cache system estimated for 1024 cores using Probabilistic model and Queuing analysis…………………………………………………………………………………56

Figure-5.1: Simulator Design Methodology for parallel trace-driven multi-level cache simulator for LogN+1, LogN and 3-level Models………………………………..……………………61 Figure-5.2: Snapshot 1, highlighting the details of trace searching at different cache levels………63 Figure 5.3: Snapshot 2, highlighting the details of trace searching at different cache levels………..64 Figure-5.4: Comparison of average access time in LogN and logN+1 cache model with 3-level cache system as number of cores varies, using simulation…………………………………….68 Figure-5.5(a): Comparison between the average access time of LogN+1, LogN and present 3-level cache system estimated for 4 cores using three different approaches, Probabilistic model, Queuing analysis and Simulation………………………………………………………..69 Figure-5.5(b): Comparison between the average access time of LogN+1, LogN and present 3-level cache system estimated for 8 cores using three different approaches, Probabilistic model, Queuing analysis and Simulation………………………………………………………..70 Figure-5.5(c): Comparison between the average access time of LogN+1, LogN and present 3-level cache system estimated for 16 cores using three different approaches, Probabilistic model, Queuing analysis and Simulation………………………………………………..70 Figure-5.5(d): Comparison between the average access time of LogN+1, LogN and present 3-level cache system estimated for 32 cores using three different approaches, Probabilistic model, Queuing analysis and Simulation……………………………………………….70 Figure-5.5(e): Comparison between the average access time of LogN+1, LogN and present 3-level cache system estimated for 64 cores using three different approaches, Probabilistic model, Queuing analysis and Simulation………………………………………………..71 Figure-5.5(f): Comparison between the average access time of LogN+1, LogN and present 3-level cache system estimated for 128 cores using three different approaches, Probabilistic model, Queuing analysis and Simulation. ……………………………………………....71 Figure-5.5(g): Comparison between the average access time of LogN+1, LogN and present 3-level cache system estimated for 256 cores using three different approaches, Probabilistic model, Queuing analysis and Simulation………………………………………………..71 Figure-5.5(h):: Comparison between the average access time of LogN+1, LogN and present 3-level cache system estimated for 512 cores using three different approaches, Probabilistic model, Queuing analysis and Simulation……………………………………………….72

Figure-6.1: Threads orientation in SPC3 PM……...…………………………………………………90 Figure-6.2: The design concept of SPC3PM……….………………………………………………...95 Figure-6.3: Auto allocation of a thread of a serial task using serial execution function………...…101

XI

Figure-6.4: Allocation of a thread on a specified core of a serial task using serial execution function………………………………………………………………………………...101 Figure-6.5: Allocation of a thread pool of a parallel task using parallel function. Threads equal to the number of cores spawned and scheduled on each core………………………………...103 Figure-6.6: Allocation of a thread pool of a parallel task using parallel function. Threads equal to the number defined spawned and scheduled on each core accordingly…………………103 Figure-6.7: Allocation of a thread pool of a parallel task using parallel function. Threads equal to the number defined spawned and scheduled to specified cores accordingly………………104 Figure-6.8: Concurrent execution of two tasks Task i and Task j on four cores. Each task has two spawned threads scheduled on a core by the operating system………………………..105 Figure-6.9: Concurrent execution of two tasks Task i and Task j on four cores. Each task is scheduled on the respective core as assigned…………………………………………106 Figure-6.10: Comparison of execution time for each of three approaches, C++, SPC3 PM serial with auto core assignment, and SPC3 PM with specified core assignment for different sizes of matrices………………………………………………………………………………...111 Figure-6.11: Speedup comparison of three serial approaches, C++, SPC3 PM serial function with ‘auto core assignment’ and with ‘specified core’ assignment………………………………111 Figures 6.12: Comparison of execution times (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Parallel for 4 cores……………………………………………………...116 Figures 6.13: Comparison of execution times (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Parallel for 8 cores……………………………………………………...117 Figures 6.14: Comparison of execution times (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Parallel for 12 cores…………………………………………………….117 Figures 6.15: Comparison of execution times (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Parallel for 24 cores…………………………………………………….117 Figure-6.16: Speedup comparison of matrix multiplication using SPC3 Parallel with different number of parallel threads and different matrix sizes………………………………………….119 Figure-6.17: Shows the comparison of speedup based on table 12 for SPC3 PM concurrent function with 4,8,12 and 24 concurrent threads…………………………………………………122 Figures 6.18: Comparison of speedups for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent for 4 cores………………………………………………………………….124 Figures 6.19: Comparison of speedups for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent for 8 cores………………………………………………………………….124 Figures 6.20: Comparison of speedups for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent for 12 cores………………………………………………………………...124 Figures 6.21: Comparison of speedups for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent for 24 cores………………………………………………………………...125

Figure 7.1: A 3-opt move x1, x2, x3 are replaced by y1, y2, y3 [ Ref. 250]...... 131

Figure 7.2: Restricting the choice of xi , yi , xi+1, and yi+1 [ Ref. 250]……………………………….…..132

Figure 7.3: Alternating cycle (x1, y1, x2, y2, x3, y3, x4, y4) [ Ref. 250]…………………………………..132

XII

Figure 7.4: Non-sequential exchange (k = 4) [ Ref. 250]…………………………………………..132 Figure 7.5: Sequential 4-opt move performed by three 2-opt moves. [Ref. 250]…………………..133 Figure 7.6: Stages in Original serial LKH-2 software……………………………………………...138 Figure 7.7: Stages in parallelized LKH-2 software using SPC3 PM……………………………….141 Figure 7.8: Comparison of minimum time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the medium size TSP instances…………………146 Figure 7.9: Comparison of average time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the medium size TSP instances………………...146 Figure 7.10: Comparison of total run time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the medium size TSP instances…………………149 Figure 7.11: Comparison of total run time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the large size TSP instances………………….....149 Figure 7.12: Comparison of average run time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the large size TSP instances…………………….149 Figure 7.13: Comparison of total run time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the large size TSP instances………………….....149

XIII

This page is left intentionally blank CHAPTER 1

Introduction

Chip Multi-processors (CMPs) are designed to increase efficiency and performance of the system by increasing multi-tasking, parallelism and throughput. A multi-core processor consists of multiple processing units residing in one physical chip having their own set of execution and architectural recourses which differs with the traditional shared memory parallel architectures (SMP) in both hardware and software designs. The numbers of processors in multi-processor are often limited to four or eight whereas in CMPs designers are thinking to place hundreds or even thousands of cores in a single chip.

The cores in CMPs are more closely coupled than processors in SMPs. In multi-processors system no cache at any level is shared but in CMPs, L2 and L3 caches are shared by the multiple cores within a chip. Also, the interconnection scheme used for processor-to-processor and processor-to-memory is static for multi-processors system and dynamic for CMPs. Similarly, from software design aspect CMPs also have different challenges than those of SMPs. These include, program or thread scheduling and better load distribution on the available cores. Next is the level of parallelism, CMPs favors thread level parallelism whereas multi-processors work better for process or application level parallelism. Some other differences in software design for CMPs and SMPs may include, design of threads, algorithm decomposition techniques, programming patterns, operating system support etc.

The shift towards chip multi-processors (CMPs) from multi-processors systems has resulted in several challenges for hardware and software designers. With changes in technology from micrometer to nanometer, there is a significant increase of the number of cores in a chip. Now, it is computer designer’s responsibility to determine a computational structure that can transform the increase in cores into a corresponding increase in computational performance efficiency. This challenge must be dealt with on several fronts, like modification in basic architecture design of each processor (core) to increase single or

1 multi-thread performance, change in the architecture of memory system and development of new programming models for multi-core processors.

1.1 Contributions of Dissertation

This thesis is an attempt to come up with some solutions for the challenges that multi-core processing is currently facing related to its memory hierarchy design and parallel programming environment. This research has contributed by proposing a novel multi-level cache system design

"LogN+1 and LogN cache Models" for multi-core processors and by developing a new multi-thread parallel programming model, "SPC3 PM” (Serial, Parallel and Concurrent Core to Core Programming

Model), for multi-core processors. Furthermore, a parallel trace-driven multi-level cache simulator

"MCSMC" (Multi-level Cache Simulator for Multi-Cores) has also been developed during this PhD research. The following sub-sections highlight the features of each contribution.

1.1.1 Multi-level Cache System for Multi-core Processors ( 'LogN+1' and 'LogN' Cache Models )

LogN+1 and LogN cache models are the two possible implementations of a new proposed multi- level cache system for multi-core processors based on binary tree data structure in place of the present 3- level cache system. In the proposed cache system every cache is shared by its two descended caches except the caches at L1 which are private to the respective cores. The numbers of cache levels increase as the number of cores increase. This cache system maintains a true pyramid (Memory Hierarchy). The results obtained have indicated that, for higher number of cores, the proposed LogN+1 and LogN cache models work more efficiently with reduced overall average cache access time than the present 3-level cache system, and the performance gain increases as the number of cores increases. The proposed cache system also minimizes the cache coherence problem because of its scalable and symmetric architecture.

Cache load is well distributed and no cache at any level is over utilized. Thus the proposed cache models are found much suitable for multi-core processors. This new cache system has also enabled us to earn a

US patent approval (International patent number is awaiting).

2

1.1.2 Multi-level Cache Simulator for Multi-core processors ( 'MCSMC' )

A parallel trace-driven multi-level cache simulator named as MCSMC (Multi-level Cache

Simulator for Multi-Cores) has also developed during this PhD research. This is an extra effort that we have added to our research as there was no suitable simulator available to simulate our proposed LogN+1 and LogN cache models in a real time environment. Currently available cache simulators do not support such a large number of cores and cache levels. The MCSMC is a new addition in the family of cache simulators and distinguishes from the existing ones in many ways. It can be used to simulate the present

3-level cache system or a customized multi-level cache system with a large number of cores. It has been tested for 2048 cores and 12 cache levels. Its parallel execution makes it more efficient and less time consuming. Its input set of eleven parameters also provides flexibility and a wide range of simulation scenarios.

1.1.3 Multi-threaded Parallel Programming Model for Multi-core processors ( 'SPC3 PM' )

The SPC3 PM, (Serial, Parallel and Concurrent Core to Core Programming Model), is a new parallel programming model developed as a part of this PhD research. The development of SPC3 PM is motivated with an understanding that existing parallel developments tools do not provide adequate support for general purpose multi-core programming and unable to utilize all available cores efficiently as they are designed for either specific parallel architecture or certain program structure. The SPC3 PM is developed to equip a common programmer with multi-core programming tool for scientific and general purpose computing. The SPC3 PM provides a set of rules for algorithm decomposition and a library of primitives that exploit parallelism and concurrency on multi-core processors. The programming model is serial-like task-oriented and provides thread level parallelism without the programmer requiring a detailed knowledge of platform details and threading mechanisms. It has also many other unique features that distinguish it with all other existing parallel programming models. It supports both data and functional parallel programming. Additionally, it supports nested parallelism, so one can easily build larger parallel components from smaller parallel components. A program written with SPC3 PM may be executed in serial, parallel and concurrent fashion on available cores. Besides, it also provides processor core interaction feature that enables the programmer to assign any task or a number of tasks to any of the

3 cores or set of cores. Besides, the ability to use SPC3PM on virtually any processor or any operating system with any C++ compiler makes it very flexible.

1.2 The Thesis Organization

There are eight chapters in all to describe the work and present the results. To begin with, in chapter 1 we have highlighted the major contribution made by us through our PhD research. The chapter 2 highlights the current hardware and software challenges of multi-core processing to give a background which motivated us to do this research. In chapter 3 we have presented the design and working of the proposed

LogN+1 and LogN cache models. Its performance evaluation and comparison with the present 3-level cache system has also been discussed here.

In Chapter 4 M/D/C/K-FIFO queuing models for the proposed LogN+1 and LogN cache models are discussed. Later the performance of the models is compared with the existing present 3-level cache system using queuing modeling along with the summary of results.

The Chapter 5 discusses the design methodology and features of the developed multi-level cache simulator, MCSMC. Simulation results of LogN+1, LogN and 3-level cache models using MCSMC are also examined and compared in this chapter.

In Chapter 6, a new parallel multi-threaded programming model, SPC3 PM, developed for multi-core processors is presented. Its key features, design concepts, programming styles and performance evolution are discussed in detail. Later, the performance of parallel and concurrent implementation of matrix multiplication for upto 10000 X 10000 elements using SPC3 PM has been reported. Chapter 7 discusses the performance and behavior of SPC3 PM for complex and irregular problems like Travelling Salesman

Problem (TSP) in detail. Parallelization of LKH-2 software for solving TSP using SPC3 PM and its performance evaluation is done in this chapter.

Finally, chapter 8 summarizes the research performed. Additionally, suggestions for some related future work have also been stated in the chapter.

4

CHAPTER 2

Motivation and Challenges with Multi-Core Processors

Multi-core processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses. These multi-core processors are also termed as Chip Multi-Processors (CMP). Depending on the design complexity of cores and chip, these can be classified as Homogenous Multi-core in which all cores are identical in all respects. The other is

Heterogeneous Multi-core in which cores have different execution capabilities but having same ISA

(Instruction Set Architecture). In Hybrid Multi-core, all cores have different ISA and execution capabilities. Multi-core processors are designed to increase efficiency by increasing multi-tasking, parallelism and throughput.

The nature of challenge in CMPs is different than that in case of multiple processors (SMPs) in many ways. Like cores in CMPs are more closely coupled than that of processors in SMPs. L2 and L3 cache which are shared by the multiple cores with in a chip whereas, in SMPs no cache at any level is being shared by the processors. This leads to more complex cache and memory hierarchy design in

CMPs than SMPs. Also, the processor interconnection scheme is usually static in SMPs and dynamic in

CMPs. Scalability is another challenge from architectural point of view as number of processors in general SMPs are often limited to four or eight, where in CMPs designers are thinking to place hundreds or even thousands of cores in a single chip. Similarly, from software design aspect CMPs also have different challenges than those in SMPs. These includes, program or thread scheduling and better load distribution on the available cores, level of parallelism as CMPs favor thread level parallelism whereas

SMPs work better for process or application level parallelism. Some other software challenges may include, design of threads, algorithm decomposition techniques, programming patterns, operating system

5 support etc. The following subsections provide details of some of these challenges categorized as architectural and software challenges.

2.1 Architectural Challenges

The shift towards multi-core architectures causes several challenges for computer architects. Due to a big change in technology, from micrometer to nanometer, there is a significant increase of the number of cores on a chip. Now it is computer designer’s responsibility to determine a computational structure that can transform the increase in cores into a corresponding increase in computational performance efficiency. This challenge must be dealt with on several fronts, like basic architecture of each processor (core) to increase single or multi-thread performance, the architecture of the memory system and a holistic approach to support to emerging programming models for multi-core processors.

Multi-core architectures have two main characteristics that differ from conventional multi- processor architectures. One is heterogeneity and the other is massive parallelism. Multi-core processors can have different types of cores and any arbitrary topology of interconnection of processors. A multi- core also has a large number of cores as processing elements. Thus, considering heterogeneous multi- core architectures with massive parallelism, parallel programming for multi-core is more complicated than conventional parallel programming. To cope with this difficulty, it is necessary to devise efficient models and methods of multi-core programming.

2.1.1 Memory Hierarchy

One of the most critical challenges that multi-core processors face is its memory system.

Program execution is often limited by the memory bottleneck but not due to non-availability of processor and its low speed. It is mainly because of this fact that a significant portion of applications always resides in main memory until it is executed by the processor. Multi-core processors can exacerbate problems unless care is taken to conserve memory bandwidth and avoid memory contention. Also, the absence of explicit commands to move data between core and cache in general programming languages makes it more difficult to place and trace the data in cache optimally. In fact, one can expect that a significant

6 portion of the improvement in a multi-core processor will be devoted to the memory hierarchy [6, 44, 46,

65, 68, 72, 268].

The memory hierarchies in parallel machines are more complicated than in the uni-processor's machines because of the existence of multiple private and shared caches on single shared-memory systems, especially in multi-core processors where L2 and L3 cache are shared by the multiple cores with in a chip. This leads to more complex cache and memory hierarchy design in CMPs than other parallel architectures. Multi-processors or multi-core processors or CMPs add a number of issues, complexities and challenges to the memory system [263, 268]. In the following sub-sections, some major issues like cache levels, data synchronization, removal of false sharing and communication mechanism are discussed.

2.1.1.1 Cache Levels

The current general purpose multi-core processors use 3-level (L1, L2 and L3) cache scheme. In this cache model every core has its own private L1 cache. The L2 cache may be private, shared or split.

In first case L2 is kept local to a core like in case of Intel Pentium 4D and Itanium 2. In second approach which is more general, each core has its own private L1 cache and shares L2 through L1. Intel dual core and core duo and AMD dual and quad core processors have this configuration. And in third case L2 cache is usually shared by half of the cores. Remaining half of the cores have their own shared L2 cache and then all L2 caches are connected to L3. An Intel Xeon quad and eight core processor has two L2 cache shared by two cores. Sun Ultra Sparc T2 has four bi-shared L2 caches and one L3 cache for eight cores. With the present dual, quad and eight cores, the present 3-level cache system is working well. But as number of cores increases, this may cause the bottleneck and result in data traffic congestions and performance degradation [44, 65, 68].

2.1.1.2 Synchronization

Sharing or memory contention occurs when one processor reads or writes a memory address that is cached by another processor. If both the processors are reading the data without modifying it, than the data can be cached at both processors. If, however, one processor tries to update the shared cache line,

7 then the other's copy must be invalidated to ensure that it does not read an out-of-date value. This problem is referred as cache-coherence problem. This problem is relatively easily solved for smaller number of processors but becomes very complex and time consuming as the number of processors increases. In multi-core processors where both shared and private caches at different levels co-exist, this cache coherence or synchronization problem becomes worse. All the exiting synchronization techniques and cache-coherence algorithms must be revisited for multi-core processors.

2.1.1.3 False-Sharing

The scenario in which two different processors are accessing logically different data that reside on the same cache line is termed as False-sharing. This problem leads to a difficult tradeoff. That is large cache lines are good for locality, but they increase the chances of false-sharing. The phenomenon of false sharing can be reduced by ensuring that data objects that might be accessed concurrently by independent threads should lie on different cache lines. With the use of 'padding' in data structures, the programmer or a compiler can avoid the false-sharing. In padding, empty bytes are inserted in different data structures to ensure that different data elements should lie within different cache lines. In multi-core processors, with the increase in number of cores, the number of private and shared caches will also increase and this might result in increase of false-sharing.

2.1.1.4 Spinning

Spinning is a process in which a processor repeatedly tests some word in memory, waiting for another processor to change. Depending on the architecture, spinning can have different effects on overall system performance. On SMP architecture, spinning consumes fewer resources as in general

SMPs numbers of processors are often limited to four or eight. But the behavior of spinning on multiple- core processor having large number of cores has yet to be explored.

2.1.1.5 Communication Minimization

Memory-memory and memory-processor communication also affects the program execution and its performance. In multi-core processors, with the increase in the number of cores and respective caches,

8 the inter-memory and memory-processor communication will increase exponentially. So, memory hierarchy of multi-core processors should be designed in such a way that these communications are contained.

2.1.2 Architectural Support for Compilers / Programming Models

Another challenge for multi-core processors design is the support from the existing compilers and programming models. Besides, theses programming models should be capable of utilizing the resources of multi-core efficiently and make programs suitable for multi-core processing. Most of the modern parallel programming models and compilers support two basic standard parallel architectures.

One is the Shared-memory architecture and the other is Distributed-memory architectures. In shared memory architecture there is a single shared memory address space which can be accessed by every processor. In distributed memory architecture, every processor has its own memory space; however, this memory space may be accessed by other processors with the permission of the local processor. Besides the devolvement of the new programming models or modifying the existing programming models or compiler for the support of multi-core processors, the hardware designers should keep the multi-core architecture as close as possible and compatible with the existing parallel programming models, compilers and other related tools [268, 269].

2.2 Software Challenges

Software development is also a major challenge for multi-core programmers. The software that runs on the multi-core processor must have capability of exploiting maximum parallelism and concurrency and good load distribution. Although much progress has been made on these problems but still, much remains to be done. Furthermore, multi-core processors require new specified software development environment including parallel programming models, compiler, debugger, simulator, and other tools [178, 179, 200, 201, 202].

9

2.2.1 Parallel Programming Models

A parallel programming model is a specialized software approach to express parallel algorithms and match applications with the underlying parallel systems. It includes the Programming Languages,

Areas of applications, Algorithms, Compilers, Libraries, Communications Systems, Scheduling Scheme, and Memory Management etc.

From the viewpoint of the application programmer, a multi-core processor provides a parallel architecture that consists of processors connected with each other via a communication network. Thus, to exploit the parallel architecture, parallel programming is required. Conventionally, there are two types of parallel programming model: shared-memory programming and message-passing programming.

OpenMP and message-passing interface (MPI) are examples, respectively. When using conventional parallel programming models for multi-core processors, there are some conventional issues in parallel programming, e.g., shared memory versus message passing. There is also a need to identify the difference between conventional parallel programming and multi-core programming. One needs to exploit the characteristics specific to multi-core processors to use the parallel programming models in a more efficient way.

Parallel programming is difficult in part because high performance does not automatically follow from parallel implementation. To achieve the highest possible performance, the designer must also take a number of other considerations into account. First, the programmer must balance the load on the processors or cores so that no single processor or core dominates the running time. Second, very large problems require that the computation scales to large number of parallel processors; the implementation must be crafted to achieve this goal. Third, some components of the problem, though serial, may be made faster by a partial parallelization strategy known as pipelining [179, 180, 184, 186].

2.2.2 Parallel Algorithm Models

One of the fundament challenges of parallel programming is to determine the execution flow of a parallel program. The process of determining the execution flow of a parallel program is known as parallel program modeling. Standard parallel architectures support a variety of parallel algorithm models

10 as discussed in the sub-sections. The programmer must choose the suitable parallel algorithm model depending on application for multi-core processor [191, 192, 193, 194, 266, 268].

2.2.2.1 Data-Parallel Model

In data-parallel model, the program is distributed into tasks where each task performs similar operations on different data. These tasks are statically or semi-statically mapped onto processes or cores.

In data parallelism same or identical operations are applied on different data set of a problem. Data- parallel model is suitable for both shared and distributed memory parallel architectures. Applications having vector or matrix calculations are the best candidate for data parallel models.

2.2.2.2 Task Graph Model

The task-graph model is generally used to solve problems in which the data associated with the task is large relative to the computation associated with it. In this tasks may be mapped statistically or dynamically and sometimes a decentralized dynamic mapping may also be used. Task-dependency graph is used to analyze the parallel computations and the interaction pattern of the tasks mapped on available processors. .

2.2.2.3 Work Pool Model

In work pool model, there is no preferred pre-mapping of tasks onto available processes.

Mapping of the tasks onto the processor is done dynamically at the run time. This dynamic mapping enforces proper load balancing. This runtime mapping of tasks may be centralized or decentralized. In case of centralized mapping the tasks may be statically available at the start of processing whereas in decentralized approach tasks may be dynamically generated at the run time and added to the global work pool usually managed by operating system.

2.2.2.4 Master-Slave Model

In the master-slave model, a program has one or more master processes that generate work and allocate it to slave processes. In this model it is the responsibility of master process to ensure proper work

11 distribution among the slave processes. Master process may be a bottleneck and may cause single point failure. This algorithm model is also known as the manager-worker model.

2.2.2.5 Pipeline or Producer-Consumer Model

In the pipeline model, the problem is divided into a series of succession tasks each performing a separate operation. These tasks are processed through a succession of processors each of which performs the desired operation on it. This type of parallelism is also called stream parallelism. Software pipelining is an example of this model.

In short, selection of a suitable algorithm model for the underlying parallel architecture is very important otherwise, the performance might be degraded. In some cases more than one model may be applied to the problem resulting in a hybrid algorithm model where multiple models may be applied in hierarchical fashion that is one after the other or multiple models may be applied sequentially to different phases of a parallel algorithm.

2.2.3 Decomposition Techniques

Decomposition techniques are used for achieving concurrency and parallelism in a given program. And these decomposition techniques become more important for multi-core processor. The selection of suitable or hybrid decomposition is also a challenge for multi-core programmers. These techniques are broadly classified as recursive decomposition, data-decomposition, exploratory decomposition, and speculative decomposition. The recursive- and data-decomposition techniques are relatively general purpose as they can be used to decompose a wide variety of problems. On the other hand, speculative and exploratory-decomposition techniques are more of a special purpose nature because they apply to specific classes of problems [191, 192, 193, 194, 266, 268].

2.2.3.1 Recursive Decomposition

In recursive decomposition, a given problem is solved by first dividing it into a set of independent sub-problems and executing each one of these sub-problems recursively. These sub- problems are further computed with similar division into smaller sub-problems. In other words, this

12 decomposition induces concurrency in problems which can be solved using the divide-and-conquer strategy which is the natural way to exploit concurrency in the problem.

2.2.3.2 Data Decomposition

Data decomposition is used to exploit concurrency in algorithms that have large data structures.

In data decomposition, the large data structures are portioned into tasks such that each task computes similar function on a different dataset.

2.2.3.3 Exploratory Decomposition

Exploratory decomposition is used for searching solutions in the given space. In this decomposition the problem (search space) is portioned into smaller sub-spaces. These sub-spaces are searched concurrently on the available processors for the desired solution.

2.2.3.4 Speculative Decomposition

Speculative decomposition is used for producer-consumer models in which a program selects one of many computationally possible solutions for further processing depending on the output of other computations that precede it. Discrete event simulation is an example of speculative decomposition.

Similarly to a selection of an algorithm model, selection of a suitable decomposition technique for the underlying parallel architecture is also very important otherwise the performance might be degraded. In some cases more than one decomposition techniques may be applied to the problem resulting in a hybrid decomposition where computation is structured into multiple stages and each stage may be based on different types of decomposition techniques.

2.2.4 Level of Parallelism

Generally problem solutions may contain various types of parallelism. These levels of parallelism depend on the problem nature and underlying parallel architecture [188, 189, 190, 209, 262].

Functional level parallelism is a kind of parallelism which arises from the logic of a problem solution. It

13 occurs in all formal descriptions of problem solutions, such as program flow diagram, data flow graphs, programs and so on to a greater or lesser extent [195, 196, 262].

Programs written in imperative languages may embody functional parallelism at different levels, that is, at different sizes of granularity. In this respect one can identify the following four levels and corresponding granularity:

 Parallelism at the Instruction Level (Fine Grained Parallelism)

 Parallelism at the Loop Level (Middle Grained Parallelism)

 Parallelism at the Procedure Level (Middle Grained Parallelism)

 Parallelism at the Program Level (Coarse Grained Parallelism)

Available instruction level parallelism means that particular instructions of a program may be executed in parallel. To this end, instruction can be either assembly (machine level) or high level language instructions. Usually, instruction level parallelism is understood at the machine language level.

While considering instruction level parallelism one can confine to instruction expressing more or less elementary operations, such as instruction prescribing the addition of two scalar operands, as opposed to multi-operation instructions like those implying vector or matrix operations.

Parallelism may also be available at the loop level. Here, consecutive loop iterations are candidates for parallel execution. However data dependencies between subsequent loop iterations, called recurrences, may restrict their parallel execution. The potential speedup is proportional to the loop limit or, in the case of nested loops, to the product of the loop limits and the levels of nested loops.

Next, there is parallelism available at the procedure level in the form of parallel executable procedures. The extent of parallelism exposed at this level is subject mainly to the kind of the problem solution considered.

In addition, different programs are obviously independent of each other. Thus, parallelism is also available at the user level. Multiple, independent programs are a key source of parallelism occurring in

14 computing scenarios. Evidently, in a problem solution different level of parallelism are not exclusive but may coexist at the same time.

In contrast to Functional Parallelism, there is another kind of parallelism, called Data Parallelism.

It comes from using data structure that allows parallel operation on their elements, such as vectors or matrices, in problem solutions. Data parallelism is inherent only in a restricted set of problems, such as scientific or engineering calculations or image processing. Usually, this kind of parallelism gives rise to a massively parallel execution for the data parallel parts of the computation. Thus, the actual values for the achievable speedup depend on the characteristics of the application concerned [203, 204, 205].

From another point of view parallelism can be considered as either regular or irregular. Data parallelism is regular, whereas functional parallelism, with the exception of loop level parallelism, is usually irregular. Regular parallelism is often massive, offering several orders of magnitude in speedup [262].

2.2.5 Compiler Optimizations and Techniques

A compiler is responsible to convert a given code into an optimized execution form on the underlying architecture. Parallel completion makes this process even more complex and challenging especially in multi-core processor environments. If the parallel compiler fails to produce an optimized parallel code for the underlying parallel architecture the result will surely be performance degradation, grater data conflicts, poor load balancing and minimum utilization of the parallel recourses. From the performance point of view, the two most important memory related tasks performed by a parallel compiler is parallelism optimization, enhancing data locality and efficient use of memory [267, 269].

2.2.5.1 Parallelism

Tracing of parallelism and its optimization is one of an important and fundamental function of parallel compilers. The parallelization strategy is responsible for determining the efficient use of memory. However, maximum parallelism may not always be easy to achieve because of several factors.

Data dependencies and unavoidable communications like inter-processes, inter-processor, and processor-

15 memory communications. Besides, the demand of different parts of a problem for different number of processors or cores can make the problem much complex for proper load balancing.

2.2.5.2 Removal of Data Dependencies

Another important problem associated with parallel compiler is ensuring proper data locality and removal of data dependencies as much as possible. In addition, in a multi-core processor environment, special care should be taken regarding inter-core communication schemes which might cause frequent cache line invalidations and updates and result overall increased latency.

2.2.5.3 Memory Space

Another memory-related problem connected to a parallel compiler is an efficient use of memory system. The memory system includes both main and cache memory. The memory must be allocated to available processors according to the requirement of the process scheduled on it. Special care must be taken in order to avoid memory related issues like false sharing and cache coherence problem.

2.2.6 Related Tools for Performance and Parallel Debugging

The goal of parallel computing is to obtain better performance with lower execution time than obtained using serial computation. With the advent of multi-core processors it has become more challenging. Optimization of a parallel program is an important part of the development of parallel program and requires knowledge of the underlying architecture, programming model, mapping and parallelization strategies of the application [106, 107, 268].

There is a verity of performance tools available for present parallel architectures. These include various parallel program development studios, compilers, threading assistants, memory and threading error checkers, performance profiler and advisors, parallel program inspector and composer etc.

These tools are used to simplify threading in code by identifying those areas in serial and parallel applications where parallelism would have the greatest impact. It may be used to optimize C / C++ compiler, performance libraries and to fine-tune the windows applications for optimal performance in

16 order to ensure the full utilization of underlying parallel architecture. and Microsoft visual studio 2010 provide such set of tools [83, 96].

Parallel Debugging is much more complex than to write parallel program. It ensures that a parallel program produces the required output and satisfying that, parallel program always produces the same answers on the same inputs and same as the sequential program. A number of debugging tools have been developed that can be classified into two categories, static analysis tools and dynamic analysis tools. In the first type, the tool uses static compiler optimization techniques whereas, the dynamic analysis tool uses shadow variables to resolve race conditions that occur at run time.

2.2.7 Regular and Irregular Problems

The nature of the problems or applications also has a great impact on the advantages that can be gained using multi-core processing. These problems may be divided into two classes as regular and irregular. The irregular problems are characterized by the difficulty of predicting the amount of work connected to parts of the input data. Therefore, such problems are difficult to parallelize and to make them capable to readapt the work distribution at runtime in order to utilize all available cores / processors. Problems like scheduling, inventory and management, automatic control, VLSI Design, bioinformatics fall in this category of irregular problems. The following figure 2.1 shows the possible classification of problems.

Regular Pipline Organized by Flow of Data Event Based Irregular Coordination

Linera Task Parallelism Organized by Tasks Divided and

Problem Recursive Conquer

Geometric Linear Decompostion Organized by Data Recursive Recursive Data `

Figure-2.1: Possible classification of problems

17

2.3 Performance and Scalability Issues

Form the programmer’s perspective; scalability and load balancing are the two important issues in order to achieve high performance on parallel machines. Generally, large parallel systems are always implemented by increasing the system size, thus it is very important to determine how the performance of the parallel system is influenced with increase in the system size. Scalability not only measures the ability of a parallel architecture to support parallel processing at different system sizes, and the inherent parallelism of a parallel algorithm, but also can be used to predict the performance of large problems and system sizes based on the performance of small problems and system sizes. With the introduction of multi-core processors, these performance and scalability issues have become more challenging [184, 186,

264, 268].

The performance of a parallel system depends on a large number of factors, all affecting the scalability of the parallel architecture and the application program involved. The approaches to scalability determine the matching degree between a parallel architecture and an application algorithm. A parallel architecture can be very efficient for one parallel algorithm but bad for another, and vice versa. Thus, for different parallel architecture and program pair, the analysis may end up with different conclusions. The main reasons for which scalability may not be achieved in some applications may include the nature of application and it structure. Some application may have a large serial region than the parallel region, so most of the program must be run sequentially. The other barriers to scalability are poor load balancing and the requirement for a high degree of inter-processes and inter-cores communications. Thus a major goal of parallel programming is to ensure good load balance which is more challenging in case of multi- core processors [179, 180, 264].

The goal of parallel processing is to have the running time of an application reduced by a factor that is inversely proportional to the number of processors or cores used. One way to define the speedup is the ratio of the running time on a single processor to the running time on parallel processors machine.

This type of scalability only depends on the architecture not on the application. Sometimes the application is limited and further addition of processors or cores may even degrade the performance. The other term which combines both the architecture and application behavior is ''scaled speedup''. According

18 to this concept, an application is said to be scalable if the number of processors and the problem size are increased by a factor then the running time should remain unchanged [264, 268].

Load balancing is another issue that strongly affects the performance of a system. It means that the processors have nearly the same amount of program code to be executed. In order to balance the computational load on a multi-core machine, the programmer must divide the computations and communications on all available cores uniformly.

2.4 Summary

Multi-core processors differ from the Shared Memory Multi-processors (SMPs) in both hardware and software aspects. Both hardware and software technologies that are being used for current multi- processors cannot be used directly for multi-core processors. All these technologies related to both hardware and software; have to be revisited for multi-core processors. New memory hierarchy and cache designs, cores interconnection patterns and many related hardware issues have to be worked out to allow the maximum utilization of multi-core processors. Similarly, from software design aspect, new programming models, libraries, related performance tools have also be designed or the existing ones have to be modified accordingly in order to facilitate the multi-core parallel programming and to make most of multi-core processors.

19

CHAPTER 3

“LogN+1” and “LogN” Cache Models, A Binary Tree Based Cache System for Multi-core Processors

For general purpose multi-core computing, memory design especially cache implementation is one of the important issues. Program execution is often limited by the memory bottleneck, not processor availability and speed. Multi-core processors can exacerbate the problem unless care is taken to conserve memory bandwidth and avoid memory contention. Also, the absence of explicit commands to move data between core and cache in general programming languages makes it more difficult to place and trace the data in cache optimally. These foreseen problems are forcing the hardware designers to define a proper cache system for multi-core processors that work optimally and effectively for all types of data structures. During the course of our PhD project we are able to propose a novel multi-level cache system for multi-core processors. It has presented in our papers [3, 4] which also enabled us to earn a US patent approval. In this chapter we have presented this novel cache system and its two possible implementations

(models) LogN+1 and LogN. This cache system is based on binary tree data structure. To begin with, the present 3-level cache system and related improvements for multi-core processors are discussed. Next, the new proposed cache system with its two models LogN+1 and LogN are discussed. In the end, the performance of these two proposed cache models is analysed and compared with the present 3-level cache system using the probabilistic mathematical model. These models are further analysed and evaluated using queuing model and simulation which are discussed in the subsequent chapters 4 and 5 respectively.

3.1 Present 3-level Cache System and Related Improvements for Multi-core Processors

The current general purpose multi-core processors [26, 30, 35] use the present 3-level (L1, L2 and L3) cache scheme. In this cache model every core has its own private L1 cache. The L2 cache may

20 be private, shared or split. In first case L2 is kept local to a core like in case of Intel Pentium 4D and

Itanium 2 [30]. In second approach which is more general, each core has its own private L1 cache and shares L2 through L1 as shown in figure-3.1. Intel dual core and core duo [30] and AMD dual and quad core processors [26] have this configuration. And in third case L2 cache is usually shared by half of the cores. Remaining half of the cores have their own shared L2 cache and then all L2 caches are connected to L3, as shown in figure-3.2. An Intel Xeon quad and eight core processor has two L2 cache shared by two cores. Sun Ultra Sparc T2 [35] has four bi-shared L2 caches and one L3 cache for eight cores.

With the present dual, quad and eight cores, the present 3-level cache system is working well

[26, 30, 35]. But as number of cores increases, this may cause the bottleneck and result in data traffic congestions and performance degradation [44, 65, 68].

Figure-3.1: Fully shared L2 cache, multi-core Figure-3.2: Semi shared L2 cache, multi-core processor configuration processor configuration

To improve the multi-core system performance and to avoid memory bottlenecks, present solutions concentrate on increasing the bandwidth and/or on sharing some parts of their caches and maintaining others as private. The technology of smart cache [30] is being used by Intel and hyper transport technology [20] is being used by AMD but these technologies have not been worked out for greater number of cores. Besides, an attempt was made using reconfigurable cache design [43]. But a problem with this is the additional time required to reconfigure the cache for every new problem. Also the behaviour of this solution on multiple cores executing multiple programs is yet to be explored.

Another approach which is also being used is multi-sliced bank cache [56]. This approach has low latency and better hit rate but doesn‟t specify its behaviour for larger number of cores. Also crossbar

21 network and other processor-memory interconnections are being used to adjust the increased number of cores with existing 3-level cache system but the problems associated with this approach are very high cost of multiple read / write cache memories and the other is that the number of interconnections do not increase linearly with the number of cores.

In addition to hardware improvements some software improvements are also being researched.

Different modifications are being proposed in cache coherence protocols like snooping protocols [69] or directory based protocols [16, 19]. Different techniques to exploit spatial locality at either compiler or operating system level are also proposed for improving cache performance, like exploiting of page sharing [63, 72] and immediate spatial (IS) locality [53]. But the problem that may cause hindrance in its working is locality behaviour which varies from program to program. Cache partitioning and sharing is another attempt to utilize multi-core processors. It refers to the partitioning of shared L2 or L3 caches among a set of programming threads running simultaneously on different cores. A number of cache partitioning methods have been proposed with different optimization objectives but this approach has not been exploited for greater number of cores [39, 63]. Transactional memory (TM) is also one of the techniques being researched to increase the memory efficiency. It is being implemented as software transactional memory (STM), hardware transactional memory (HTM) or hybrid transactional memory.

Initially TM was developed for providing a concurrency control mechanism that avoids many of the pitfalls of lock-based synchronization in SMPs. But now TM is also being implemented in multi-cores to avoid memory related issues [6, 261].

3.2 'LogN+1' and 'LogN' Cache model

So far attempts are being made involving hardware and software based on finding adjustments of the present 3-level cache system with multi-core processors. In this section a novel binary tree based multi-level cache system for multi-core processors is presented. This cache system maintains a true pyramid (Memory Hierarchy), that is, with increase in base size (Number of Cores), its height is being adjusted accordingly (Memory Levels). This new cache system has two possible implementations

(models) that are „„LogN+1'' and ''LogN''. With „N‟, the number of available cores, the name of each model indicates the number of cache levels to be inserted in the binary tree based cache

22 system being implemented using that model. This new cache system promises to overcome many foreseen memory related problems in multi-core processors with the increase in number of cores like cache overloading, data congestion and scalability

3.2.1 Design Concept

th For a processor with „N‟ cores, where ci ,(1 i  N), be the i core, the binary tree based multi- level cache system is like a binary tree of „N‟ terminals. The levels of cache are treated as intermediate nodes of the tree. Each core will have its own L1 private cache, thus there are total „N‟ L1 caches. Two

L1 caches of neighbouring cores are connected to a L2 cache, such that „N/2‟ binary L2 shared caches in total. Similarly two adjacent L2 caches are connected to a L3 cache. The number of L3 caches is therefore „N/4‟ and all of these are binary shared. The root of the tree is either the last level of cache or the main memory. These two configurations lead to two possible models “LogN+1” and “LogN”.

In LogN+1 model, the root of a system cache is a single final level of cache, connected to the main memory with a very high and fast bandwidth interface. In this model, the number of cache levels

is (Log2 N 1) . It is shown in figure-3.3.

th In LogN model, (Log2 N 1) level of cache is removed. Thus the two caches at (Log2 N) level are connected to a main memory through a high interface bus. In this model, the number of cache

levels is reduced to (Log2 N). It is shown in figure-3.4.

3.2.2 Cache Hierarchy and Cache Size

In case of LogN+1model, ''Log2 N'' and in case of LogN model, ''Log2 N -1'' cache levels are inserted between the n-core processor and main memory. Taking care of the properties of memory hierarchy is very important in the design of a cache system. The cycle time and the size of cache should increase as one moves down the hierarchy. It is also important for exploiting the locality of reference to decide the size and frequency (cycle time) of caches at each level optimally. If the different levels of

23

Figure-3.3: The binary tree based cache system (LogN+1 model)

Figure-3.4: The binary tree based cache system (LogN model)

24 cache are not synchronized in terms of clock frequency, cycle time and size then the principle of locality will fail which will result in performance degradation

Consideration of cache size at different levels of hierarchy is one of the important design parameter in memory system design. The cost and locality of reference are most affected by the cache size. It is much easier to increase cache size but it results in too much cost. Therefore a relationship should be maintained between the cache sizes at different levels to exploit the principle of locality with reduced cost. For LogN+1 and LogN cache models, two approaches have been analysed for defining the hierarchy for multi-level cache system. One is the arithmetic, and second is geometric progression. Using arithmetic progression, size of each cache level in an „n-level‟ cache system is defined as a  kdn1 such k 0 that the difference between two successive cache sizes is a constant ‟d‟ and „a‟ is the size of L1 cache in the cache system. Whereas in geometric progression approach, size of each cache level in an „n-level‟ cache system is defined as k n1 such that the ratio between two successive cache sizes is a constant a.r k 0

„r‟ and „a‟ is the size of L1 cache.

In first approach, the size of cache at various levels is calculated using geometric propagation.

Tables 3.1and 3.2 show the possible sizes of cache at various levels for LogN+1 and LogN models respectively. For this purpose, present general processor configuration that is L1 cache of 64 KB and

Main Memory of 1 GB is considered. Figures 3.5 and 3.6 show comparison graphs between cache size at different cache levels for LogN+1 and LogN models respectively. Exploitation of memory hierarchy may clearly be observed.

Table-3.1: Cache memory size (KB) at different cache levels for different number of cores (LogN+1 Model)

25

Table-3.2: Cache memory size (KB) at different cache levels for different number of cores (LogN Model)

Figure-3.5: Cache size at different levels for different number of cores (LogN+1 Model)

Figure-3.6: Cache size at different levels for different number of cores (LogN Model)

Similarly in the second approach the size of cache at various levels is calculated using arithmetic propagation. Tables 3.3 and 3.4 show the possible sizes of cache at various levels for LogN+1 and LogN models respectively. For these calculations same configuration as for GP is considered. Figures 3.7 and

26

3.8 show comparison graphs between cache size at different cache levels for LogN+1 and LogN models respectively. Here exploitation of memory hierarchy may also clearly be observed.

Table-3.3: Cache memory size (KB) at different cache levels for different number of cores (LogN+1 Model)

Table-3.4: Cache memory size (KB) at different cache levels for different number of cores (LogN Model)

Figure-3.7: Cache Size at different levels for different number of cores (LogN+1 Model)

27

Figure-3.8: Cache Size at different levels for different number of cores (LogN Model)

Comparing the figures 3.6, 3.7 3.8 and 3.9, it can be observed that out of two approaches, AP and

GP, for determining the size of cache at each level of cache system, the GP approach is more uniform and suitable for exploiting locality of reference. The GP approach also requires smaller cache sizes at different levels than that of cache hierarchy requires designed using AP thus reducing the cost, complexity and access time of cache system.

3.2.3 Cache Hierarchy and Cache Frequency (Cycle Time)

For calculating the cache clock frequency (cycle time) of each cache at different level geometric propagation approach is selected because of the observation made in previous section. In this approach, frequency of each cache level in an „n-level‟ cache system is defined as k n1 such that the ratio ar k 0 between two successive cache frequency (cycle time) is a constant „r‟ and „a‟ is the frequecny (cycle time) of L1 cache in the cache system. Tables 3.5 and 3.6 show the possible cache clock frequency at different levels with different number of cores for logN+1 and LogN model respectively. Tables 3.7 and

3.8 show the possible cache cycle time at different levels with different number of cores for LogN+1 and

LogN model respectively. It is assumed that each core and its private L1 cache are operating at 4.0GHz and main memory is operating at 1.00 GHz.

28

Table-3.5: Cache frequency (GHz) at different cache levels for different number of cores (LogN+1 Model)

Table-3.6: Cache frequency (GHz) at different cache levels for different number of cores (LogN Model)

Table-3.7: Cache cycle time (nsec) at different cache levels for different number of cores (LogN+1 Model)

Table-3.8: Cache cycle time (nsec) at different cache levels for different number of cores (LogN Model)

29

3.3 Performance Evolution

After calculating the possible cache size and frequency (cycle time) for each cache at different levels of binary tree based cache system, average access time and probability of cache hits are calculated using probabilistic mathematical method for both LogN+1 and LogN model. For the worst case analysis all the caches are assumed to be single read-write. Multiple read-write cache memories may be used at any particular level or all levels for further performance gain.

3.3.1 Average Cache Access Time

For LogN+1 cache model, every cache from L2 to (Log2 N 1) level has two descendants except L1 cache which is private to a core and always available to that core. Whereas, for other levels the fair probability of a cache of serving its one of the descendent is ½. So the worst case average cache access time for a core to access the final level of cache in binary tree based multi-level cache system can be given by;

T  T  2T  2T ...... 2T  2T (3.1) max 1 2 3 log2 N log2 N1

Where T1,T2 ,T3 are the cycle times for L1, L2, L3 caches respectively and so on.

For LogN cache model, the whole memory system consists of (Log2 N) levels of cache plus one level of main memory. Considerations as in LogN+1 model are also applicable here. Thus Equation (3.1) is slightly modified and appears as Equation (3.2).

T  T  2T  2T ......  2T (3.2) max 1 2 3 log2 N

On the other hand for a 3-level cache system with shared L2 cache, L1 cache is private to each core, L2 is shared by N cores and L3 has only one decedent, as shown in figure 3.1, for which the worst case average cache access time can be given by Equation (3.3)

T  T  NT T max 1 2 3 (3.3)

30

For semi shared L2 cache, where L1 cache is independent to each core, L2 is shared by N/2 cores and L3 has two descendents for which the worst case average cache access time can be given by the Equation

(3.4)

T  T  (N / 2)T  2T (3.4) max 1 2 3

Based on Tables 3.7 (for LogN+1 model) and 3.8 (for LogN model), the following table-3.9 shows the results obtained using equations (3.1) (3.2) & (3.4). Figure-3.9 shows the comparison in average cache access time of LogN, logN+1 and 3-level cache system with variation in number of cores based on table-3.9.

Table-3.9: Average cache access time for LogN+1, LogN and 3-level cache system with different number of cores

Probabilistic Method

Average Access Time (nsec)

Cache Levels Binary Tree Based Multi-level Number Cache system 3- Level Cache of Cores System (Semi Shared L2 Cache) LogN+1 Model LogN Model LogN+1 Model LogN Model

4 3 3 2.30 2.30 2.30 8 4 3 3.37 2.30 3.09 16 5 4 4.44 3.37 4.68 32 6 5 5.52 4.44 7.85 64 7 6 6.60 5.52 14.19 128 8 7 7.68 6.60 26.87 256 9 8 8.76 7.68 52.23 512 10 9 9.84 8.76 102.95 1024 11 10 10.10 9.87 158.87

31

Figure-3.9 Comparison of average cache access time in 3-level cache system with LogN and logN+1 cache model as number of cores varies, using probabilistic model.

It can be observed that for less number of cores the average cache access time is nearly same but as the number of cores increases the difference between the average cache access time for the LogN+1 and LogN models and the 3-level cache system becomes more and more prominent showing far better performance of LogN+1 and LogN models in comparison with the 3-level cache system. It may also be noted that the LogN model has slightly less average cache access time than the logN+1 model and shows much better performance than 3-level cache system as the number of cores increases beyond eight.

3.3.2 Probability of Cache Hits

Let is the probability of finding the data at the ith level of binary tree based multi-level cache system. In a memory hierarchy presence of the data in a higher level is independent of the lower level but it is dependent vice versa. It is assumed that there is no cache miss therefore, the data will certainly be found in the highest level of cache system. The

probability of finding a data in the cache level Li if it is present in higher cache level Li1 can be calculated using Bayesian Theorem [22].

Li1 P(Li )P( ) L L (3.5) P( i )  i Li1 P(Li1 )

32

For a memory hierarchy system, the probability of finding the data at the higher level if it is present in the lower level is always one, ignoring the chance of updating the cache by its other descendent.

Therefore Equation (3.5) becomes:

Li P(Li )  P( )P(Li1 ) Li P(Li ) L P( )   i1 Li1 P(Li1 )

L Similarly; P(L )  P( i1 )P(L ) i1 L i2 i2

By generalizing it, we get Equation (3.6).

L L L L P(L )  P( i )P( i1 )P( i2 )...... P( N 1 )P(L ) (3.6) i L L L L N i1 i2 i3 N

For both present 3-level and proposed cache systems all the cores have their own private L1 cache, so respective L1 caches are always present for serving the cores. For L2 in 3-level cache system, the probability of serving its descendants is 1/N or 2/N for fully shared or semi shared respectively. But the probability of the same for a descendant in the proposed cache system is always 1/2. It is valid for both Log N+1 and LogN models.

For LogN+1 model, another important parameter is the probability for a core to get service from the cache system. It is found that above probability for the proposed binary tree based multi-level cache system is the same as that of the 3-level cache system. It means no change in probability for serving the individual cores in this proposed model and it is not affected by increase in number of cache levels.

Rewriting Equation (3.6) for the LogN+1 model;

log2 N 1 1 1 P(L )  ( )   1  log N (3.7) 2 2 2 N

using Equation (3.6) for the 3-level L2 semi shared cache system, we get

33

2 1 1 P(L )  ( )( )  (3.8) 1 N 2 N which is same as given by Equation (3.7).

For LogN model, the probability for a core to get a service from the cache system is higher than that from LogN+1 and 3-level cache model. Modifying Equation (3.6) for this model we get

Equation (3.9). The resulting probability is greater than that of Equations (3.7) & (3.8).

log2 N 1 1 1 1 1 P(L )  ( )    1  log N 1 (3.9) 2 2 2 N 1 N

3.3.3 Result Analysis

In this chapter LogN+1 and LogN cache models are analysed using mathematical probabilistic method. Using probability rules the average time was calculated for each core to access the top most cache level. Similar methods are also used for present 3-level cache system in order to compare the results. The probability equations are also derived for an individual core that may be served considering all the three cache systems. From this study it can be observed that both the cache models, LogN+1 and

LogN show much lower average access time than the 3-level cache system and that the performance gain increases as the number of cores also increases (Figure-3.9). It is further observed for both the proposed cache models that the probability for a core to get service from the L2 cache is much higher than the present 3-level cache system. It can also be seen that for logN+1 model, the probability for a core to get service from the cache system using LogN+1 model is the same as that for 3-level cache system whereas it is high in LogN model.

One more characteristic of the proposed LogN and LogN+1 models is that the, properties of memory hierarchy are being followed. With increase in cache levels, memory hierarchy is not being disturbed either because of individual cache size or its cycle time / clock frequency. Also the cost may not be an issue for proposed multi-level cache system because of advancement in memory technology which has significantly reduced the cost per bit of memory.

34

3.4 Summary

In this chapter, a novel binary tree based multi-level cache system design for multi-core processor and its two possible implementations (models) LogN+1 and LogN cache models were presented. In this cache system every cache is shared by its two descended caches except the caches at L1 which are private to the respective cores. The numbers of cache levels increase as the number of cores increase. This cache system maintains a true pyramid (Memory Hierarchy), that is, with increase in base size (Number of Cores), its height is adjusted accordingly (Memory Levels). The models were analysed and compared with the existing 3-level cache models using the basic mathematical probabilistic approach. The results obtained indicated that, for higher number of cores, the proposed LogN+1 and

LogN cache models works more efficiently and has reduced overall average cache access time than present 3-level cache system, and the performance gain increases as the number of cores increases. The proposed cache system also has a scalable and symmetric architecture. Cache load is well distributed and no cache at any level is over utilized.

35

CHAPTER 4

Queuing Model of “LogN+1” and “LogN” Cache Models

Queuing theory is one of a well-established mathematical modelling method to study and analyse the queuing phenomena in a system. Because of its realistic behaviour, it is being used extensively to study the behaviour of various computer related applications like scheduling, users and processes management, batch processing, multi-programming, virtualization, uni- and multi-processor computer architecture design etc. In this chapter, the two proposed cache models, LogN+1 and LogN (Chapter 3), are further analysed using M/D/C/K-FIFO queuing model. Related performance equations for average access time of individual cache and overall cache system and respective utilization have been derived.

Besides, queuing model for present 3-level cache system has also been developed and its performance has been compared with the two proposed cache models.

4.1 Queuing Theory

The queuing theory is a mathematical method to analyse a client-server system. Queuing theory examines every component of a system in line to be served, including the arrival process, service process, number of servers, number of system places and the number of customers. A queuing network analysis is useful to determine many performance parameters like mean response time, marginal probabilities, utilization, throughput, mean number of jobs, mean queue length, and mean waiting time for any individual server and for the complete network [9, 14]. In queuing theory, any queuing system is described by Kendal’s notation. It was first proposed by D. G. Kendall in 1953. This standard notation,

‘A/B/C/K/N/D’, is used to describe, characterize and classify the queuing model that a queuing system corresponds. Each character in the Kendal’s notation defines a specific characteristic of a queue as given below.

36

A: It defines the arrival process. It may be Markovian (M), Deterministic (D), Erlang (Ek), General (G) or Phase-type (PH).

B: It defines the service time distribution like arrival process. It may be Markovian (M), Deterministic

(D), Erlang (Ek), General (G) or Phase-type (PH).

C: It defines the number of parallel servers in the queuing system.

K: It defines the capacity of the queuing system, or the maximum number of customers allowed in the system including those in service. When the number is at this maximum, further arrivals are turned away.

If this number is omitted, the capacity is assumed to be unlimited, or infinite.

N: It defines the size of the population from which the customers come. If this number is omitted, the population is assumed to be unlimited, or infinite.

D: It defines the service discipline. It may be FIFO, FCFS, LIFO, LCFS, SIRO or PNPN

4.2 M/D/C/K- FIFO Queuing Model, for LogN+1 and LogN Cache Model

Any cache hierarchy may be analysed using queuing theory by considering every cache as a server and data request either by a CPU or the lower level of cache as a client. A complete cache hierarchy may be considered as an open queuing network where multiple servers (caches) are attached in a specific pattern. A request from a client is served by a specific server (cache). If the server (cache) fulfils the request then the client leaves the queue otherwise the request is sent to next server (upper level cache). Probability of a request being fulfilled or not at any server (cache) is the same as the hit or miss ratio. Similarly the mean response time of the server is same as that of average cache access time. Using queuing network, performance parameters like mean response time (average cache access time), marginal probabilities, utilization, throughput, mean number of jobs, mean queue length, and mean waiting time may be calculated for any individual server (cache) and for the complete network (cache hierarchy).

In queuing theory, M/D/C/K [5, 15] is one of the analytical models that generalize the solution of

Markovian queues to the case of constant service time distributions. In the M/D/C/K theory, the arrival

37

process is Poisson and service rate is constant and deterministic. C and K represent the number of parallel servers and system capacity respectively.

4.2.1 Basic Model

Considering the proposed binary multi-level cache system where every cache level except L1 which is local to the core, has two descendants. Average request rate ' ' for first level of cache is the same as that of requests made by the respective cores. Request rates for rest of levels are Poisson, as they depend on the probability of cache miss at their respective descendants. The service rate of each cache depends on its clock frequency and cycle time. So it is deterministic. Here in our approach every cache level except the first one has two descendants i.e; clients, out of which only one can be served at a time on first come first serve basis with equal probabilities of being served. Therefore, M/D/1/2-FIFO queuing model may be applied for analysing any cache level except the first one for its utilization, throughput, average number of jobs, average queue length, average access time and average waiting time. This basic atomic M/D/1/2-FIFO queuing model is presented in figure 4.1.

Cache Level Cache Level i i-1 To cache level (i+1) E[s] = Csi / 2

Cache Level i-1 M/D/1/2-FIFO Queuing System Atomic Block Model for LogN and LogN+1 Cache System

Figure-4.1: Atomic block model for LogN and LogN+1 cache system

Atomic model for M/D/1/2 – FIFO for LogN and LogN+1 cache system (Figure-4.1) may be joined to form an open queue network of proposed binary based multi-level cache system. Every unit

(cache) may be treated as a single object which leads to a queuing network from where network performance parameters for the whole cache system can be computed. Queuing networks for LogN+1 and LogN models are shown in figures 4.2 & 4.3 respectively.

38

Figure-4.2: Queuing network model for LogN+1 cache model

Figure-4.3: Queuing Network Model for LogN cache model

4.2.2 Performance Equations

After mapping the LogN+1 and LogN model as M/D/1/2-FIFO queuing model, the respective performance equations can be derived now. In this section, equations for calculating average data request rate at any cache level, individual cache utilization, individual cache access time, average request queue length at individual cache, and overall average cache system access time have been derived with the following considerations.

39

be the average data request rate by a core.

th be the arrival rate of requests at the i level of cache

, be the probability of cache being hit

such that , be the probability of cache being missed at the ith level and search will proceed in the upper (i+1)th cache level.

4.2.2.1 Average Data Request Rate ( )

Consider which is a data request rate at L1 cache and is always equal to the request rate made by a core, therefore:

(4.1)

Let the probability of getting the data from L1 cache is , and the probability of its miss is . In case of a miss, the data will be searched in next upper level cache L2. Therefore, arrival rate at L2 can be given by;

; Substituting for , from Equation (4.1) (4.2)

Likewise, for L3 cache it is given as

Substituting for , from Equation (4.2) (4.3)

Similarly for L4 cache

Substituting for , from Equation (4.3) (4.4)

Generalizing request rate for LogN model

(4.5)

Also for LogN+1 model

40

(4.6)

Generalizing it for both the models and for any level of Cache

(4.7)

Summary of the request rate equations for different cache levels is given in table 4.1.

Table-4.1: Summary of the request rate equations for different cache levels.

Cache Request Probability Request Rate Level Rate Hit Miss

L1

L2

L3

L4

L(Log N)

L(LogN+1)

4.2.2.2 Average Cache Utilization ( )

th For cache utilization, Let be the utilization of i level of cache. Such that;

[ ] , where [ ] is the average service time.

Here average service time of any level of cache is same as of its cycle time,

. Every cache except the first one, is being shared by the lower two levels of cache with equal probability of being served so the average service time for L2 and above cache level may be given by . Therefore;

41

[ ]

(4.8)

[ ] ⁄

{

4.2.2.3 Average Individual Cache Access Time ( [ ]

th For an M/D/1 queuing model, the average cache access time, [ ] of an i cache level can be calculated using the following equation (4.9) [5].

[ ]( ) [ ] ⁄ (4.9)

4.2.2.4 Average Request Queue Length (

th The average request queue length, “ “ at i cache level can be calculated as given in [5].

⁄ (4.10)

4.2.2.5 Overall Average Cache Access Time [ ] )

Overall average cache access time [ ] of the LogN and LogN+1 queuing model can be calculated using

Little‘s equation [ ]

For LogN model [ ] * ( ) ⁄ + (4.11)

For LogN+1 model [ ] * ( ) ⁄ + (4.12)

Generalizing it for over all binary multi-level cache system

[ ] (4.13) * ⁄ +

42

4.3 Queuing Model for 3-level Cache system

Considering the present 3-level cache system for N cores where L1 cache is private to every core.

L2 is being shared by half of the cores. And L3 is then being shared by two L2 caches. In contrast to basic unit for a proposed binary multi-level cache model where we have symmetry here we have to apply

N different queuing models for analyzing individual cache levels. For analyzing L2 cache level M/D/1/ /2 –

FIFO queuing model can be applied and M/D/1/2 queuing model for analyzing L3 cache. L1 is simple to analyze as it is private to a core. Queuing model for L2 cache is shown in figure 4.4 and complete queuing network is given in figure 4.5.

th Let be the data request rate at i cache level. Also [ ] and be the average cache access time and cache utilization for ith cache level respectively. If be the average request rate of data by a core and is data request rate for L1 cache which is always equal to the request rate made by a core, then we can write

(4.14)

N / Cache Level 2 i-1

Cache Level i Cache Level To cache level (i +1) C i-1 a E[s]=2*Csi / N

c h e

L Cache Level e i-1 v Queuing System e l Atomic Block Model of 2nd level Cache for N Cores for 3 levels Cache System s

Figure-4.4: Atomic model of L2 cache for N cores for 3-level cache system

Considering the same definitions of probability of cache miss and hit as in section 4.2 and using figure-

4.4 we get.

Substituting for , from Equation (4.14) (4.15)

Request rate at L3 cache can be given by;

43

Substituting for , from Equation (4.15) (4.16)

Cache utilization at each level may be calculated as;

For L1 cache [ ] [ ] (4.17)

⁄ For L2 cache [ ] [ ] (4.18)

And for L3 cache [ ] [ ] (4.19)

Average cache access time and average queue length at ith cache level can be calculated using equations

(4.9) and (4.10) respectively. Overall average cache access time [ ] of the 3-level cache queuing model can be calculated using Little‘s equation [ ] .

For 3-Level cache system [ ] * ⁄ + (4.20)

L1 hit, Core 0 L2 hit,

L1 Miss,

L1 Private Cache L2 Miss, M A I L2 Cache L3 hit, sharing N/2 N Cores

M L2 Cache L3 Miss, E sharing N/2 M L3 Cache Cores OR L2 Miss, Y

L1 Miss, Core N L2 hit,

L1 hit, L1 Private Cache

Figure-4.5: Queuing network model for 3-Level cache system

44

4.4 Performance Evolution

For evaluating and comparing performance of both the proposed and the present 3-level cache system using queuing networks, presently available general purpose processor’s configuration is assumed, that is, each core is operating at 4.0 GHz, 64 KB L1 cache and 1GB main memory operating at

1.00 GHz. For LogN+1 and LogN models related computations, tables from chapter 3, 3.5, 3.6, 3.7 and

3.8 are also taken into account.

The proposed and present 3-level cache system may now be statistically analysed using the derived equations. For both proposed models, LogN+1 and LogN, average data request rate at any cache level may be calculated using equation (4.7). Similarly, individual cache utilization using equation (4.8), individual cache access time using equation (4.9), average request queue length at individual cache using equation (4.10) and finally overall average cache system access time can be calculated using equation

(4.13).

For detailed queuing network analysis of the LogN+1 and LogN models two more input parameters are required. First the initial data request rate made by the cores for their respective private L1 caches. It is taken as half of core speed. Second is the probability for data request generation. For this equal probabilistic method is used that is, every cache block has equal chance to be executed. It is done because of two reasons, firstly the two more important parameters, number of running programs at any given time and their size, can be included in calculations and their impact can be observed. Secondly this makes the queuing networks more generalized. However, a specific trace generator is used in the simulator developed during our PhD project. It is discussed in chapter 5 which tends to exploit principle of locality. The probability of finding data at any cache level is calculated using equation (4.21).

( ) (4.21)

4.4.1 LogN+1 Model

The following tables from 4.2 to 4.8 show the steps involved in calculating the average access time for each individual cache and overall cache system for LogN+1 model. For probability calculation

45

(equation 4.21), 64 running processes (programs) of 10MB each is assumed. In table 4.2 and 4.3, probabilities of finding the data (cache hit) or not (cache miss) at individual cache with variation in number of cores are shown. Table-4.4 shows the data request rates at individual cache. Utilization of each cache at different level is calculated in table-4.5. Tables 4.6 and 4.7 present the individual cache access time and request queue length at different levels respectively. Finally table-4.8 shows the overall average access time for the LogN+1 cache model.

Table-4.2: Probability of cache hit at different cache levels for different number of cores (LogN+1 model) No. of Cache            Cores Levels 1 ,1 2 ,1 3 ,1 4 ,1 5 ,1 6 ,1 7 ,1 8 ,1 9 ,1 10 ,1 11 ,1 4 3 0.0004 0.0050 0.0630 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 8 4 0.0008 0.0044 0.0250 0.1414 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 16 5 0.0016 0.0054 0.0189 0.0660 0.2297 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 32 6 0.0031 0.0079 0.0198 0.0500 0.1260 0.3175 0.0000 0.0000 0.0000 0.0000 0.0000 64 7 0.0063 0.0125 0.0250 0.0500 0.1000 0.2000 0.4000 0.0000 0.0000 0.0000 0.0000 128 8 0.0063 0.0210 0.0354 0.0595 0.1000 0.1682 0.2828 0.4757 0.0000 0.0000 0.0000 256 9 0.0063 0.0184 0.0540 0.0794 0.1167 0.1714 0.2520 0.3703 0.5443 0.0000 0.0000 512 10 0.0063 0.0165 0.0435 0.1149 0.1516 0.2000 0.2639 0.3482 0.4595 0.6063 0.0000 1024 11 0.0063 0.0151 0.0365 0.0882 0.2130 0.2573 0.3109 0.3756 0.4537 0.5481 0.6622 Table-4.3: Probability of cache miss at different cache levels for different number of cores(LogN+1model) No. of Cache            Cores Levels 1 , 0 P2,12 , 0 P3,13 , 0 P4,14 , 0 5 , 0 P6,16 , 0 P7,17 , 0 P8,18 , 0 P9,19 , 0 P10,110 , 0 P11,111 , 0 4 3 0.9996 0.9950 0.9370 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 8 4 0.9992 0.9956 0.9750 0.8586 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 16 5 0.9984 0.9946 0.9811 0.9340 0.7703 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 32 6 0.9969 0.9921 0.9802 0.9500 0.8740 0.6825 0.0000 0.0000 0.0000 0.0000 0.0000 64 7 0.9938 0.9875 0.9750 0.9500 0.9000 0.8000 0.6000 0.0000 0.0000 0.0000 0.0000 128 8 0.9938 0.9790 0.9646 0.9405 0.9000 0.8318 0.7172 0.5243 0.0000 0.0000 0.0000 256 9 0.9938 0.9816 0.9460 0.9206 0.8833 0.8286 0.7480 0.6297 0.4557 0.0000 0.0000 512 10 0.9938 0.9835 0.9565 0.8851 0.8484 0.8000 0.7361 0.6518 0.5405 0.3937 0.0000 1024 11 0.9938 0.9849 0.9635 0.9118 0.7870 0.7427 0.6891 0.6244 0.5463 0.4519 0.3378 Table-4.4: Request Rate at different cache levels for different number of cores (LogN+1 Model) No. of Cache      Cores Levels  1 P2,1 2 P3,1 3 P4,1 4 P5,1 5 P6,1 6 P7,1 7 P8,1 8 P9,1 9 P10,110 P11,1 11 4 3 2.0000 1.9992 1.9893 1.8640 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 8 4 2.0000 1.9984 1.9896 1.9399 1.6655 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 16 5 2.0000 1.9969 1.9860 1.9484 1.8198 1.4017 0.0000 0.0000 0.0000 0.0000 0.0000 32 6 2.0000 1.9938 1.9781 1.9388 1.8419 1.6098 1.0987 0.0000 0.0000 0.0000 0.0000 64 7 2.0000 1.9875 1.9627 1.9136 1.8179 1.6361 1.3089 0.7853 0.0000 0.0000 0.0000 128 8 2.0000 1.9875 1.9457 1.8769 1.7653 1.5888 1.3216 0.9478 0.4969 0.0000 0.0000 256 9 2.0000 1.9875 1.9510 1.8456 1.6991 1.5009 1.2436 0.9302 0.5857 0.2669 0.0000 512 10 2.0000 1.9875 1.9547 1.8696 1.6549 1.4040 1.1232 0.8268 0.5389 0.2913 0.1147 1024 11 2.0000 1.9875 1.9575 1.8861 1.7198 1.3535 1.0052 0.6927 0.4325 0.2363 0.1068

46

Table-4.5: Cache utilization of different cache levels for different number of cores (LogN+1 Model)

No. of Cache  P2,1 P3,1 P4,1 P5,1 P6,1 P7,1 P8,1 P9,1 P10,1 P11,1 Cores Levels 1 2 3 4 5 6 7 8 9 10 11 4 3 0.500 0.397 0.627 0 0 0 0 0 0 0 0 8 4 0.500 0.353 0.497 0.686 0 0 0 0 0 0 0 16 5 0.500 0.329 0.432 0.560 0.690 0 0 0 0 0 0 32 6 0.500 0.314 0.392 0.485 0.580 0.639 0 0 0 0 0 64 7 0.500 0.303 0.365 0.433 0.502 0.551 0.537 0 0 0 0 128 8 0.500 0.295 0.344 0.395 0.441 0.472 0.467 0.398 0 0 0 256 9 0.500 0.290 0.332 0.366 0.393 0.405 0.392 0.342 0.251 0 0 512 10 0.500 0.285 0.322 0.354 0.360 0.351 0.323 0.273 0.204 0.127 0 1024 11 0.500 0.282 0.315 0.344 0.356 0.318 0.268 0.209 0.148 0.092 0.047 Table-4.6: Individual average cache access time (nsec) at different cache levels for different number of cores (LogN+1 Model)

No. of Cache E [ s ] E [ s ] E3 [s] E [ s ] E [ s ] E [ s ] E [ s ] E [ s ] E [ s ] E [ s ] E [ s ] Cores Levels 1 2 4 5 6 7 8 9 10 11 4 3 0.375 1.055 2.317 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 8 4 0.375 0.900 1.495 2.958 0.000 0.000 0.000 0.000 0.000 0.000 0.000 16 5 0.375 0.822 1.202 1.878 3.199 0.000 0.000 0.000 0.000 0.000 0.000 32 6 0.375 0.774 1.050 1.470 2.130 2.991 0.000 0.000 0.000 0.000 0.000 64 7 0.375 0.742 0.956 1.252 1.660 2.170 2.592 0.000 0.000 0.000 0.000 128 8 0.375 0.719 0.892 1.115 1.395 1.721 2.034 2.239 0.000 0.000 0.000 256 9 0.375 0.702 0.849 1.023 1.226 1.448 1.666 1.851 2.002 0.000 0.000 512 10 0.375 0.689 0.817 0.966 1.116 1.270 1.422 1.567 1.710 1.868 0.000 1024 11 0.375 0.678 0.791 0.921 1.056 1.158 1.260 1.368 1.490 1.633 1.807

Table-4.7: Individual request queue length at different cache levels for different number of cores (LogN+1 Model) No. of Cache L L L L L L L L L L L Cores Levels 1 2 3 4 5 6 7 8 9 10 11

4 3 0.750 0.527 1.152 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 8 4 0.750 0.450 0.744 1.434 0.000 0.000 0.000 0.000 0.000 0.000 0.000 16 5 0.750 0.410 0.597 0.915 1.456 0.000 0.000 0.000 0.000 0.000 0.000 32 6 0.750 0.386 0.519 0.713 0.981 1.204 0.000 0.000 0.000 0.000 0.000 64 7 0.750 0.369 0.469 0.599 0.754 0.888 0.848 0.000 0.000 0.000 0.000 128 8 0.750 0.357 0.434 0.523 0.616 0.684 0.672 0.530 0.000 0.000 0.000 256 9 0.750 0.349 0.414 0.472 0.521 0.543 0.518 0.431 0.293 0.000 0.000 512 10 0.750 0.342 0.399 0.451 0.462 0.446 0.399 0.324 0.230 0.136 0.000 1024 11 0.750 0.337 0.387 0.434 0.454 0.392 0.317 0.237 0.161 0.096 0.048

Table-4.8: Average cache access time (nsec) LogN+1 cache model for different number of cores

Number of 4 8 16 32 64 128 256 512 1024 Cores

Cache Levels 3 4 5 6 7 8 9 10 11 [ ] (nsec) 1.26 1.69 2.06 2.28 2.34 2.28 2.15 1.97 1.81

47

From Table-4.2, it may be observed that the probability of finding the data increases as we move up the hierarchy and it is true for different number of cores as well. It is justifying the basis of our probability formula and exploiting the locality principle. Table-4.4 shows that the request rate decreases as one move up which signifies that most of the requests are being fulfilled at the lower cache levels.

Utilization of each cache is uniform and well distributed and no single cache at any level is causing congestion in the proposed cache system and it is shown in table-4.5. A number of calculations were made with different number of programs and sizes and this behaviour was always observed. Table-4.6 shows the individual average access time for each cache at different level for different number of cores.

Table-4.7 displays the average request queue length for individual cache. It may be observed that average queue length decreases as one moves up the hierarchy, clearly indicating that the request load is well distributed in the proposed LogN+1 cache model.

4.4.2 LogN Model

The same process of calculation with the same input parameters is made for LogN model. The following tables 4.9 to 4.15 show the steps involved in calculating the average access time for each individual cache and overall cache system for LogN model. In tables 4.9 and 4.10, probabilities of finding the data (cache hit) or not (cache miss) at individual cache with variation in number of cores are shown. Table-4.11 shows the data request rates at individual cache. Utilization of each cache at different level is calculated in table-4.12. Tables 4.13 and 4.14 present the individual cache access time and request queue length at different levels respectively. Finally table-4.15 shows the overall average access time for the LogN cache model.

Table-4.9: Probability of cache hit at different cache levels for different number of cores (LogN model)

Cache           No. of Cores 1 ,1 2 ,1 3 ,1 4 ,1 5 ,1 6 ,1 7 ,1 8 ,1 9 ,1 10 ,1 Levels

4 3 0.0004 0.0050 0.0630 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 8 3 0.0008 0.0099 0.1260 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 16 4 0.0016 0.0088 0.0500 0.2828 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 32 5 0.0031 0.0109 0.0379 0.1320 0.4595 0.0000 0.0000 0.0000 0.0000 0.0000 64 6 0.0063 0.0157 0.0397 0.1000 0.2520 0.6350 0.0000 0.0000 0.0000 0.0000 128 7 0.0063 0.0250 0.0500 0.1000 0.2000 0.4000 0.8000 0.0000 0.0000 0.0000 256 8 0.0063 0.0210 0.0707 0.1189 0.2000 0.3364 0.5657 0.9514 0.0000 0.0000 512 9 0.0063 0.0184 0.0540 0.1587 0.2333 0.3429 0.5040 0.7407 1.0000 0.0000 1024 10 0.0063 0.0165 0.0435 0.1149 0.3031 0.4000 0.5278 0.6964 0.9190 1.0000

48

Table-4.10: Probability of cache miss at different cache levels for different number of cores (LogN model) No. of Cache           Cores Levels 1 , 0 P2,12 , 0 P3,13 , 0 P4,14 , 0 5 , 0 P6,16 , 0 P7,17 , 0 P8,18 , 0 P9,19 , 0 P10,110 , 0 4 3 0.9996 0.9950 0.9370 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 8 3 0.9992 0.9901 0.8740 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 16 4 0.9984 0.9912 0.9500 0.7172 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 32 5 0.9969 0.9891 0.9621 0.8680 0.5405 0.0000 0.0000 0.0000 0.0000 0.0000 64 6 0.9938 0.9843 0.9603 0.9000 0.7480 0.3650 0.0000 0.0000 0.0000 0.0000 128 7 0.9938 0.9750 0.9500 0.9000 0.8000 0.6000 0.2000 0.0000 0.0000 0.0000 256 8 0.9938 0.9790 0.9293 0.8811 0.8000 0.6636 0.4343 0.0486 0.0000 0.0000 512 9 0.9938 0.9816 0.9460 0.8413 0.7667 0.6571 0.4960 0.2593 0.0000 0.0000 1024 10 0.9938 0.9835 0.9565 0.8851 0.6969 0.6000 0.4722 0.3036 0.0810 0.0000 Table-4.11: Request rate at different cache levels for different number of cores (LogN Model)

No. of Cache P2,1 P3,1 P4,1 P5,1 P6,1 P7,1 P8,1 P9,1 P10,1 Cores Levels  1  2  3  4  5  6  7  8  9  10 4 3 2.0000 1.9992 1.9893 1.8640 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 8 3 2.0000 1.9984 1.9786 1.7293 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 16 4 2.0000 1.9969 1.9792 1.8803 1.3484 0.0000 0.0000 0.0000 0.0000 0.0000 32 5 2.0000 1.9938 1.9721 1.8973 1.6470 0.8902 0.0000 0.0000 0.0000 0.0000 64 6 2.0000 1.9875 1.9562 1.8786 1.6907 1.2647 0.4617 0.0000 0.0000 0.0000 128 7 2.0000 1.9875 1.9378 1.8409 1.6568 1.3255 0.7953 0.1591 0.0000 0.0000 256 8 2.0000 1.9875 1.9457 1.8081 1.5931 1.2745 0.8458 0.3673 0.0179 0.0000 512 9 2.0000 1.9875 1.9510 1.8456 1.5527 1.1904 0.7822 0.3880 0.1006 0.0000 1024 10 2.0000 1.9875 1.9547 1.8696 1.6549 1.1532 0.6919 0.3267 0.0992 0.0080 Table-4.12: Cache utilization of different cache levels for different number of cores (LogN Model)

No. of Cache  P2,1 P3,1 P4,1 P5,15 P6,1 P7,1 P8,1 P9,1 P10,1 Cores Levels 1 2 3 4 6 7 8 9 10 4 3 0.500 0.397 0.627 0 0 0 0 0 0 0 8 3 0.500 0.397 0.623 0 0 0 0 0 0 0 16 4 0.500 0.353 0.495 0.665 0 0 0 0 0 0 32 5 0.500 0.329 0.429 0.545 0.624 0 0 0 0 0 64 6 0.500 0.313 0.388 0.470 0.533 0.502 0 0 0 0 128 7 0.500 0.303 0.360 0.417 0.457 0.446 0.326 0 0 0 256 8 0.500 0.295 0.344 0.380 0.398 0.379 0.299 0.154 0 0 512 9 0.500 0.290 0.332 0.366 0.359 0.321 0.246 0.143 0.043 0 1024 10 0.500 0.285 0.322 0.354 0.360 0.288 0.199 0.108 0.038 0.003 Table-4.13: Individual average cache access time (nsec) at different cache levels for different number of cores (LogN Model) L 1 No. of Cache E [s] E [ s ] E [ s ] E [ s ] E [ s ] E [ s ] E [ s ] Cores Levels E 1 [ s ] E 2 [ s ] 3 E 4 [ s ] 5 6 7 8 9 10 8 3 0.375 1.054 2.302 0.000 0.000 0.000 0.000 0.000 0.000 0.000 16 4 0.375 0.900 1.490 2.816 0.000 0.000 0.000 0.000 0.000 0.000 32 5 0.375 0.821 1.198 1.836 2.774 0.000 0.000 0.000 0.000 0.000 64 6 0.375 0.773 1.045 1.443 1.978 2.387 0.000 0.000 0.000 0.000 128 7 0.375 0.742 0.952 1.229 1.569 1.888 2.038 0.000 0.000 0.000 256 8 0.375 0.719 0.892 1.099 1.331 1.552 1.716 1.835 0.000 0.000 512 9 0.375 0.702 0.849 1.023 1.186 1.336 1.466 1.592 1.753 0.000 1024 10 0.375 0.689 0.817 0.966 1.116 1.203 1.291 1.399 1.545 1.744

49

Table-4.14: Individual request queue length at different cache levels for different number of cores (LogN Model)

No. of Cache Cores Levels L 1 L 2 L 3 L 4 L 5 L 6 L 7 L 8 L 9 L10 4 3 0.750 0.527 1.152 0.000 0.000 0.000 0.000 0.000 0.000 0.000 8 3 0.750 0.527 1.139 0.000 0.000 0.000 0.000 0.000 0.000 0.000 16 4 0.750 0.449 0.737 1.324 0.000 0.000 0.000 0.000 0.000 0.000 32 5 0.750 0.409 0.591 0.871 1.142 0.000 0.000 0.000 0.000 0.000 64 6 0.750 0.384 0.511 0.678 0.836 0.755 0.000 0.000 0.000 0.000 128 7 0.750 0.369 0.461 0.566 0.650 0.625 0.405 0.000 0.000 0.000 256 8 0.750 0.357 0.434 0.497 0.530 0.494 0.363 0.169 0.000 0.000 512 9 0.750 0.349 0.414 0.472 0.460 0.398 0.287 0.154 0.044 0.000 1024 10 0.750 0.342 0.399 0.451 0.462 0.347 0.223 0.114 0.038 0.004

Table-4.15: Average cache access time (nsec) LogN cache model for different number of cores

Number of 4 8 16 32 64 128 256 512 1024 Cores

Cache Levels 3 3 4 5 6 7 8 9 10

[ ] (nsec) 1.25 1.24 1.40 1.38 1.33 1.26 1.19 1.14 1.11

The results obtained for LogN model have same observations as fond LogN+1 model. From table-4.9, it may be observed that the probability of finding the data increases as we move up the hierarchy and it is true for different number of cores as well. Table-4.11 shows that the request rate decreases as one move up which signifies that most of the requests are being fulfilled at the lower cache levels. Utilization of each cache is also uniform in this model and well distributed and no single cache at any level is causing congestion in the LogN cache model and it is as shown in table-4.12. A number of calculations were made with different number of programs and sizes and this behaviour was always observed. Table-4.13 shows the individual average access time for each cache at different level for different number of cores. Table-4.14 displays the average request queue length for individual cache. It may be observed again that average queue length decreases as the one move up the hierarchy, clearly indicating that the request load is well distributed in the proposed LogN+1 cache model.

4.4.3 Present 3-level cache system

For 3-level cache system, average data request rate at any cache level can be calculated using equation (4.7). Following, respective cache utilization may be calculated using equations (4.17), (4.18) &

(4.19), individual cache access time using equation (4.9), average request queue length at individual

50

cache using equation (4.10) and finally overall average cache access time may be calculated using equation (4.20).

Whereas, when the same input parameters are applied to 3-level queuing cache model, over- utilization of L2 cache can be observed. Also, L2 becomes more congested as number of cores increases.

It can be shown in table-4.16. And it is due to all L1 private caches are communicating with L2 at the same time in case of cache hit. This request rate increases as the number of cores increases. And this scenerio can be observed with the results clearly in figure 4.6 based on table 4.16.

Table-4.16: cache utilization for 3-level cache system for different number of cores

No. of P2,1  P3,1 Cores  1  2 3 4 0.500 1.333 1.999 8 0.500 2.667 1.999 16 0.500 5.333 1.999 32 0.500 10.665 1.999 64 0.500 21.325 1.998 128 0.500 42.633 1.998 256 0.500 85.200 1.996 512 0.500 170.133 1.993 1024 0.500 339.200 1.987

Figure-4.6: Utilization of L1, L2 & L3 cache in 3-level cache system with different number of cores. Over utilization of L2 cache may be noted

Following table 4.17 shows the overall average access time for the present 3-level cache system for different number of cores.

51

Table-4.17: Average cache access time (nsec) LogN cache model for different number of cores

Number of 4 8 16 32 64 128 256 512 1024 Cores

Cache Levels 3 4 5 6 7 8 9 10 11 [ ] (nsec) 1.72 2.39 3.73 6.39 11.72 22.38 43.66 86.13 170.6

4.4.4 Result Analysis

After calculating the average cache access time for different number of cores using respective queuing models for LogN+1, LogN and 3-level cache system, all these result are compared. This comparison is shown table-4.18. A comparison graph for the same is given in figure-4.7. From this queuing analysis, our results found in conformity with the first approach. The overall average cache access time of the proposed cache system for greater number of cores is found much lower than the present 3-level cache system. Same observations are made that for less number of cores the overall average cache access time is nearly same for all three cache systems but as the number of cores increases the average cache access time for the LogN+1 and LogN models reduces sharply, It may however be noted again that the LogN model has slightly less average cache access time than the logN+1 model.

Table-4.18: Average cache access time for LogN+1, LogN and 3-level cache for different number of cores using queuing networks analysis

Queuing Network Analysis

Average Access Time (nsec)

Cache Levels Number Binary Tree Based Multi-level 3- Level Cache of Cores Cache system System (Semi Shared L2 LogN+1 Model LogN Model LogN+1 Model LogN Model Cache) 4 3 3 1.26 1.25 1.73 8 4 3 1.69 1.24 2.40 16 5 4 2.06 1.40 3.73 32 6 5 2.28 1.38 6.39 64 7 6 2.34 1.33 11.72 128 8 7 2.28 1.26 22.38 256 9 8 2.15 1.19 43.66 512 10 9 1.97 1.14 86.13 1024 11 10 1.81 1.11 170.66

52

1000 LogN+1 LogN 3-Level Cache A c L c

( o 100 e n g s s s e s c c s a 10 T ) i l m e e 1 4 8 16 32 64 128 256 512 1024 Number of Cores

Figure-4.7: Comparison of average cache access time in 3-level cache system with LogN and logN+1 cache model as number of cores varies, using queuing networks analysis.

Figures 4.8 (a-h) show the comparison graphs for average cache access time for LogN, LogN+1 and

3-level cache system obtained with two different analytical approaches used so far. First is the mathematical probabilistic model (in chapter 3) and second is the queuing modelling (in current chapter).

For both approaches, the average cache access time is different because of their respective analytical methodology and input parameters. It may also be observed for all three cache models, first probabilistic method approach shows the greater average cache access time because of the worst case analysis.

Queuing analysis yield lower average cache access time because of the use of fair probability rule.

However it can be clearly observed that in both analytical approaches, our proposed models require less average cache access time than the present cache system. And this difference in average cache access time increases as the number of cores increases. Further, it may be observed that in comparison to the proposed LogN and LogN+1 models, the LogN model has slightly lesser average cache access time.

(a)

53

(b)

(c)

(d)

54

(e)

(f)

(g)

55

(h)

Figures 4.8 (a-h): Comparison between the average cache access time of LogN+1, LogN and present 3-level cache system estimated for different number of cores using Probabilistic model and Queuing analysis. It is assumed that the system has 64 running processes, 1GB of MM operating at 1 GHz, 64KB of L1 cache and the cores are operating at 4.0 GHz.

4.5 Summary

In this chapter, the two proposed cache models, LogN+1 and LogN were further analysed using

M/D/C/K-FIFO queuing model. Related performance equations for average access time of individual cache and overall cache system and respective utilization have been derived. Besides, queuing model for present 3-level cache system has also been developed and its performance has been compared with the two proposed cache models. The results obtained with queuing model were found in conformity with the first approach that is, the proposed cache models had lesser access time, greater efficiency and scalability than that of the present 3-level cache system.

56

CHAPTER 5

Simulation of 'LogN+1' and 'LogN' Cache Models Using

‘MCSMC’

Simulation can be used to represent a model of the system behaviour in time domain. It is being proven as a key method for experimental computer architectural design such as cache, memory system,

IO system, processor pipelining, instruction execution trace etc. In previous chapters 3 & 4, performance of the proposed LogN+1 and LogN cache models for multi-core processors was evaluated and compared with that of the present 3-level cache system using a probabilistic mathematical method and M/D/C/K queuing network analysis respectively. In order to evaluate and analyse the proposed models in real time environment, simulation is used as a third independent tool. As there is no suitable simulator available to simulate such a large number of cores and cache levels there was no other choice except to develop the simulator for the simulation of LogN+1 and LogN models in real time environment. For this purpose a parallel trace-driven multi-level cache simulator named as 'MCSMC', (Multi-level Cache Simulator for

Multi-Cores) is developed. In this chapter software design and working of the developed simulator,

„MCSMC‟, is presented. Besides, calibration of 'MCSMC' with the CACTI, a standard cache simulator, is also discussed. The simulation results obtained for LogN+1 and LogN cache models and present 3-level cache system are analysed and discussed in the end of this chapter.

5.1 Cache Simulation

Simulation is a widely accepted tool for the evaluation of any proposed cache system under different applications and configuration scenarios. It is because of the high degree of configurability of cache memory which requires extensive design space exploration and identification of performance bottlenecks in system understudy. Parameters of cache performance like size, line (block) size, and

57

replacement policy can vary considerably on the basis of its configuration and workload. Cache memory is usually evaluated using either execution-driven or trace-driven simulation [59, 64]. Execution driven simulation is performed at various levels of abstraction, from the algorithmic level to the bit accurate and cycle-accurate RTL. However, execution-driven simulation is slow and requires high level skills of architectural modelling, application source-coding, and a development toolkit. Trace driven simulators accept a chronological stream of memory references and evaluate hit and miss statistics based on the selected configuration. Trace-driven simulation is an attractive way of exploring multi-level cache.

Currently available cache simulators like FLASH simulator [18], SimOS [62], SIMICS [55],

SimpleScalar [51], Cachegrind [71], CASTOR[43], Dinero III & IV [21], DCMSim[7], MSCSim[49],

Mlcachec[74], SMPCache [75] and many others either commercial , research or didactic based, usually support two or three level of caches. Some of them have an option for multi-level cache but they do not support multi-core processors and if such a support is there then they cannot be configured for the proposed LogN and LogN+1 multi-level cache system.

5.2 ‘MCSMC’ - Multi-level Cache Simulator for Multi-Cores

The 'MCSMC' (Multi-level Cache Simulator for Multi-Cores) is a parallel trace-driven multi- level cache simulator developed to simulate multi-level cache design for multi-core processors. This simulator is specially developed during this PhD research as no current cache simulator supports such a large number of cores and cache levels. The developed simulator has been calibrated with CACTI

(discussed is section 5.2.4) and tested for upto 2048 cores and 12 cache levels. It is coded in Visual C++ using OpenMP and Win32 process / thread libraries.

5.2.1 Input Parameters Set

The 'MCSMC' simulator has an input set of eleven parameters including one for defining the cache system and one for defining the number of simulation runs. All these parameters are defined in an input configuration file. If a cache system is simulated for at least 10 values for each performance

58

parameter than millions of different scenarios can be simulated. The following table-5.1 describes the input parameters with their default and maximum or other possible values.

Table-5.1: Input parameters for 'MCSMC' with their default and maximum or other possible values.

Maximum / Other Possible Input Parameter Default Values

Cache Model 3-level Cache (3) LogN+1 (13 )and LogN (12) (Cache Level)

Number of Cores 8 2048

Core Frequency 2 GHz 8 GHz

Main Memory Size 4 GB 32 GB

Main Memory Speed 1333 MHZ 2066 MHz

L1 Cache Size 1 MB 4 MB

Cache Line Size 32 KB 1 MB

Number of Programs 64 2048

Depends on Main Memory Each Program Size 10 MB available on computer running simulation

Replacement Policy LRU LIFO / FIFO / User defined

5.2.2 Software Modules

'MCSMC' is developed using modular approach. It has five independent software modules.

These modules are coded in a way that any module can be modified easily without changing the rest of simulation code. These modules can also be executed in parallel. The parallel execution of modules is discussed in section 5.2.3. Each module with its functionality is discussed below. Figure-5.1 shows the software design methodology used in the simulator.

59

5.2.2.1 Cache Architecture Generator

This software module in the simulator is responsible for generating LogN, LogN+1 and 3-level cache models. All the necessary queues, stacks, arrays and other data structures are initialized according to the input parameter file.

5.2.2.2 Program Scheduler

In our simulator “Program scheduler” is responsible for scheduling programs (jobs) on cores. For

„N‟, the number of cores and „P‟, the number of programs, if N=P, one program is assigned to each core.

For N>P, a single program is assigned to multiple adjacent cores. It is done because two adjacent cores share cache at the next level and the probability of data hits will increase. For N

5.2.2.3 Trace Generator

This module is responsible for the generation of traces used as memory references data set.

Different methods for generating memory traces are already available [11, 12, 13, 42, 46, 48, 50, 52, 57,

60]. For our simulation we have used distribution-driven trace generation with static data size characterisation approach [59]. In this approach the Least Recently Used Stack Model is used to concisely capture the key locality features in a trace and a two state Markov chain model is used for trace generation. The trace generated also exploits Pareto Principle [40] or 90/10 rule [10], and highlights the significance of locality in evaluating and optimizing cache performance. Besides, the trace generation method in our simulation is kept separate from rest of the code. Any of the trace generation methods can be coded and applied directly without affecting the remaining code.

60

Model: LogN, LogN+1 or 3-level

Cache Number of Programs Size and Frequency of Main Memory Number and Frequency of Cores Replacement Policy Each Program Size Speed of L1 Cache Size of L1 Cache Cache Line Size

Cache Architecture Generator

Scheduler

Trace Trace Trace Trace Trace Trace Generator Generator Generator Generator Generator Generator

Core Core Core Core Core Core

L1 Private L1 Private L1 Private L1 Private L1 Private L1 Private

L2 Binary Shared L2 Binary Shared L2 Binary Shared

Replacement Policy L3 Binary Shared L3 Binary Shared

Module N-1th Level cache N-1th Level cache

Nth Level cache

Main Memory

Design Methodology for MCSMC, (Multi-level Cache Simulator for Multi-Cores)

Figure-5.1: Simulator design methodology for parallel trace-driven multi-level cache simulator for LogN+1, LogN and 3-level models

61

5.2.2.4 Replacement Policy Module

This module is responsible to replace old cache blocks with the new one according to a defined policy. For our experiment, we applied Least Recently Used policy as it is widely accepted and mostly used replacement policy. During simulation, this module is executed as a completely independent thread which checks all the caches in each level and replaces the cache blocks accordingly.

5.2.2.5 Result Generation

This module is responsible for results generation. At present, results are generated in text file format. Graphical reporting is under development. For every core two text files are produced. One for keeping the traces, and other for highlighting the hit/miss ratio and cache access time. These files may be manipulated for average cache access time for each core and for overall system. Average access time for each trace may also be computed but it takes longer time depending on number of cores and number of programs being simulated. A detailed path for any trace may also be observed to see at which cache level it hit and missed.

5.2.3 Serial / Parallel Execution of ‘MCSMC’

In start of a simulation, ''cache architecture'' module initializes cache system according to the given input parameter sets. Next, programs assignment to each core is done by the ''program scheduler'' module. In multi-core processors, each core has its own private L1 cache and a separate set of code to be executed. For this a separate trace generator for each core is embedded in the simulation code. Trace generation and its searching in its private L1 cache by a core is a complete and independent task and hence it is coded as a single thread.

The 'MCSMC' is designed for parallel execution and it is written using OpenMP. Each independent function of the simulator is written as a separate module so that these modules could be executed in parallel with other modules. This approach makes it more efficient and less time consuming.

As OpenMP supports maximum 64 parallel threads therefore, upto 64 cores simulation can be done in parallel. Mutual exclusion at shared cache levels is addressed by semaphores. Figre-5.2 and 5.3 highlight snapshots at different stages during a simulation.

62

Figure-5.2: Snapshot 1, highlighting the details of trace searching at different cache levels.

In figure-5.2 details of trace searching at different cache levels is highlighted for LogN cache model. It is for 64 cores having 6 levels cache hierarchy (CL0 to CL5) and number of programs being executed is 64. Line 1 represents a trace„1348‟ generated for a program 35. It is searched first in

L0. Hit 0 indicates trace is not found out of 4096 traces from the same program present in L0. It than entered in CL1, the next number is an index to the number of cache in second level. Afterward the number is searched in CL2, CL3 and then CL4. In CL4 that is at 5th cache level the required trace „1348‟ is found. Similarly the next trace „4718‟ generated for program 36 not found in any cache level and will

63

be fetched from main memory. The next trace „988‟ for program 37 found in L0 cache and leaving the cache system.

Figure 5.3: Snapshot 2, highlighting the details of trace searching at different cache levels.

Figure-5.3 displays summary of the traces generated and where they found. It also indicates the number of hits out of the total visits. It is for 8 (0-7) cores having 8 (0-7) programs to be executed. Each program is assigned to a core. Number of cache levels for this scheme is 3 (LogN model). The above window can split into 3 sections. Moving from down to upward first 8 double lines represent the eight processors and their private L1 cache. Array length indicates the number of programs assigned to the core and array index is the program index. „P‟ represents program number, „V‟ is the number of visits

64

and „h‟ is hits. The last double line may be read as program 7 is executing on core 7. 1000 traces are generated out of which 589 are hits. Second section is of next four double lines indicating L2 caches. For

8 cores we have 4 L2 caches. Last double line of this section may be read as this cache is being used for program 6 and 7. For program 6, L2 is visited 502 times out of which 391 are hits. Similarly for program

7 this cache is visited 491 times out of which 381 are hits. Last section consisting of first three double lines indicates the status of L3 cache. Here we have two caches. Every cache is maintaining 4 programs.

Reading the top most double line as: L3 cache has four programs starting from 0-3. For program 0, 1, 2 and 3, L3 is visited 92, 121, 105 and 134 times respectively. Out of which all are the hits.

5.2.4 Comparison with CACTI Cache Simulator

In order to validate our developed simulator MCSMC, it has been calibrated with CACTI [257,258, 259].

CACTI is a standard analytical tool that takes a set of cache/memory parameters as input and calculates its access time, power, cycle time, and area. CACTI was originally developed by Dr. Jouppi and Dr.

Wilton in 1993 and since then it has undergone five major revisions. CACTI is also used by Hennessy and Patterson in their radical book “Computer Architecture: A Quantitative Approach"[40]. Following is the list of features of CACTI:

 Calculates Power, delay and cycle time for cache system

 Supports direct mapped caches, set-associative caches, fully associative caches,

embedded DRAM memories and commodity DRAM memories.

 Supports for multi-port uniform cache access (UCA) and multi-banked multi-port non-uniform

cache access (NUCA).

 Calculates leakage power that also considers the operating temperature of the cache.

 Provides an interface to perform trade-off analysis involving power, delay, area, and bandwidth.

 Provides supports for 90nm, 65nm, 45nm, and 32nm technology nodes.

CACTI has two types of input parameters mode. First is the normal input parameter mode and the other is detailed input parameter mode. Following table highlights the input parameters of both modes.

65

Table-5.2: Normal and detailed Input parameters for CACTI

Normal Input mode Detailed Input Mode

 Cache Size (bytes)  Line Size (bytes)  Associativity  Cache Size (bytes)  No. of Banks  Line Size (bytes)  Technology Node (nm)  Associativity  Read/Write Ports  No. of Banks  Read Ports  Technology Node (nm)  Write Ports  Single Ended Read Ports  No. of Bits Read Out  No. of Bits per Tag  Type of Cache (Normal| Serial| Fast)

In the calibration process readings obtained from MCSMC for 3-level cache system were compared with

CACTI results in its normal input parameters mode. The following table 5.3 shows the cache access time comparison between MCSMC cache simulator and CACTI for various cache and line sizes. The MSCMS simulator found 90% to 95% accurate with CACTI.

Table-5.3: Comparison of cache access time between MCSMC cache simulator and CACTI for various cache line sizes

Access Time (nsec) Cache Size MCSMC CACTI

Line Size 2KB

4 MB 4.15 4.25 8 MB 4.75 4.70 12 MB 5.19 5.16

Line Size 4KB

4 MB 7.69 7.66 8 MB 7.81 7.84 12 MB 8.10 8.04

Line Size 8 KB

4 MB 17.69 17.73 8 MB 14.75 14.80 12 MB 14.92 14.88

66

5.3 Performance Evolution

To study the performance of the proposed LogN+1 and LogN cache models and present 3-level cache system in a real time environment a number of simulations are run using MCSMC simulator for each LogN, LogN+1, and 3-level cache system with same or changing the input set.. Based on results the behaviour of average cache access time for all three cache models is analysed in detail.

5.3.1 Simulation Environment

All the simulated results presented in this section are obtained for a system having 64 running processes, 1GB of Main Memory operating at 1Ghz. 64KB of L1 cache, cores operating at 4.0 GHz and

LRU as a replacement policy. Numbers of time, the simulation with the same parameters were run and then average is taken. These simulations were extensively run on both Microsoft Server 2003 and server

2008 operating system installed on the latest Intel server 1500ALU with dual six cores hyper threaded

Intel Xeon 5670 processor having 24 GB of system memory.

5.3.2 Result Analysis

Table-5.4 shows average access times for LogN+1, LogN and 3-level cache system with variation in number of cores. Figure-5.4 shows the graphical representation of average cache access times in table-5.4. It can be observed that our proposed cache system performs much better for higher number of cores than the present 3-level cache system. In fact the average cache access time remains almost independent of number of cores unlike in case of 3-level cache where it increases with the increase in the number of cores.

It is further observed that the results obtained from simulation even with same input set vary. It is because of different trace generated for each simulation time. In real time situation, locality behaviour also varies from program to program and has a direct impact on hit/miss ratio and on access time.

67

Table-5.4: Average access time for LogN+1, LogN and 3-level Cache for different number of cores using MCSMC simulator

Simulation

Average Access Time (nsec)

Cache Levels Binary Tree Based Multi-level Number Cache system 3- Level Cache System of Cores (Semi Shared L2 Cache) LogN LogN+1 Model LogN+1 Model LogN Model Model

4 3 3 1.284 1.272 2.009 8 4 3 1.723 1.267 2.450 16 5 4 2.108 1.429 3.190 32 6 5 2.332 1.445 4.127 64 7 6 2.404 1.389 6.832 128 8 7 2.360 1.322 10.461 256 9 8 2.233 1.267 16.903 512 10 9 2.068 1.230 30.386 1024 11 10 1.936 1.213 58.410

100 LogN+1 LogN 3-Level Cache A c

c ( L e n o s s g s e 10 c S s

T ) c i a m l e e

1 4 8 16 32 64 128 256 512 1024 Number of Cores

Figure-5.4: Comparison of average access time in LogN and logN+1 cache model with 3-level cache system as number of cores varies, using simulation.

To compare the performance of present 3-level cache system and proposed binary tree based multi-level cache system for multi-core processors, LogN and LogN+1 model, three different analysing

68

approaches are used; mathematical probabilistic model, queuing network analysis and simulation. All the three approaches have produced results quite independently. no contradiction is seen in these results.

The simulation results were found in conformity with the results obtained using first two approaches and it also confirmed that the proposed cache models work much better and have much less average access time than the 3-level cache system (Figure-5.4).

Figures 5.5 (a-h) show the graphical representation for average cache access time of a core in

LogN, logN+1 and 3-level cache systems using all the three approaches. For all these approaches, the average cache access time is different because of their respective analytical methodology and input parameters. It may also be observed for all three cache models that our first probabilistic method approach showed the greater average cache access time because of the worst case analysis. Queuing analysis yielded lower average cache access time because of the use of fair probability rule whereas, our simulator generated the least average cache access time because of the use of a specific trace generator, which exploits principle of locality. However, it can be clearly observed that for all the three approaches, our proposed models require less average cache access time than the present 3-level cache system. And this difference in average cache access time increases as the number of cores increases. Further, it may be observed that in comparison to the proposed LogN and LogN+1 models, the LogN model has slightly lesser average cache access time.

LogN+1 A 2.50 c LogN c for 4 Cores ( 3 Level Cache e n 2.00 s s s e 1.50 c s T ) i 1.00 m e 0.50

0.00

(a)

69

LogN+1 A 3.50 c LogN c ( 3.00 for 8 Cores 3 Level Cache e n s s 2.50 s e c 2.00 s T ) 1.50 i m 1.00 e 0.50

0.00

(b)

LogN+1 A 5.00 c LogN c 4.50 for 16 Cores ( 3 Level Cache e n 4.00 s s 3.50 s e 3.00 c 2.50 s T ) i 2.00 m 1.50 e 1.00 0.50 0.00

(c)

LogN+1 A 8.00 c for 32 Cores LogN c ( 7.00 3 Level Cache e n 6.00 s s s e 5.00 c 4.00 s T ) i 3.00 m 2.00 e 1.00

0.00

(d)

70

LogN+1 A 16.00 c LogN c for 64 Cores ( 14.00 3 Level Cache e n 12.00 s s s e 10.00 c 8.00 s T ) i 6.00 m 4.00 e 2.00

0.00

(e)

LogN+1 A 30.00 c LogN c ( 3 Level Cache 25.00 for 128 Cores e n s s 20.00 s e c 15.00 s T ) i 10.00 m e 5.00

0.00

(f)

LogN+1 A 60.00 c LogN c ( for 256 Cores 3 Level Cache 50.00 e n s s 40.00 s e c 30.00 s T ) i 20.00 m e 10.00

0.00

(g)

71

LogN+1 A 120.00 c LogN c ( 3 Level Cache 100.00 e n for 512 Cores s s 80.00 s e c 60.00 s T ) i 40.00 m e 20.00

0.00

(h)

Figures 5.5-5.12: Comparison between the average access time of LogN+1, LogN and present 3-level cache system estimated for different number of cores using three different approaches, Probabilistic model, Queuing analysis and Simulation. It is assumed that the system has 64 running processes, 1GB of Main Memory operating at 1 GHz, 64KB of L1 cache and the cores are operating at 4.0 GHz.

5.4 Summary

In this chapter, a parallel trace-driven multi-level cache simulator MCSMC, (Multi-level Cache

Simulator for Multi-Cores), developed during this PhD research was discussed. This simulator was developed to evaluate and analyse the proposed LogN+1 and LogN cache models in real time environment as there is no suitable simulator available to simulate such a large number of cores and cache levels. It is developed using modular approach and a set of eleven input parameters. The simulator has been calibrated with other standard cache simulator and tested for upto 2048 cores and 12 cache levels The simulation results obtained were found again in conformity with the results obtained using first two approaches i-e.; mathematical probabilistic model and queuing network analysis. And it was confirmed that the proposed cache models work much better and have much less average access time than the 3-level cache system.

72

Model: LogN, LogN+1 or 3-level

Cache Number of Programs Size and Frequency of Main Memory Number and Frequency of Cores Replacement Policy Each Program Size Speed of L1 Cache Size of L1 Cache Cache Line Size

Cache Architecture Generator

Scheduler

Trace Trace Trace Trace Trace Trace Generator Generato Generato Generato Generato Generato r r r r r

Core Core Core Core Core Core

L1 Private L1 Private L1 Private L1 Private L1 Private L1 Private

L2 Binary Shared L2 Binary Shared L2 Binary Shared

Replacement Policy L3 Binary Shared L3 Binary Shared

Module N-1th Level cache N-1th Level cache

Nth Level cache

Main Memory

MC Design Methodology for SMC, (Multi-level Cache Simulator for Multi-Cores)

73

CHAPTER 6

SPC3 PM; A Multithreaded Parallel Software Development Environment for Multi-Core Processors

Multi-core processors are becoming common, yet writing even a simple parallel structure is tedious with existing threading packages. Writing an efficient and scalable parallel program is much more complex especially with scalable features as scalability which embodies the concept that a programmer should be able to get benefits in performance as the number of processor cores increases. In this chapter, to begin with, a survey of currently related available parallel tools is discussed. Later, we have discussed a new parallel multi-threaded programming model SPC3 PM developed during the course of our PhD project, for multi-core processors to maximize the performance gain and scalability.

6.1 Currently Available Parallel Programming Tools

Many individual researchers, research groups and commercial software vendors such as Intel,

Microsoft, Sun Microsystems and many others are continuously trying to provide assistance with the multi-core challenges and its programming. Many new and/or derived parallel programming languages are being devised. Tuned parallel libraries are also being used for common conventional tasks. And some related frameworks are also being proposed to address certain level of parallelism in multi-core programming. A brief summary of such programming tools appears in the following paragraphs.

6.1.1 Commercially Available Multi-Core Application Development Aids

Companies like Intel, Microsoft, Sun Microsystems, NVIDIA and many others have developed a variety of parallel languages, frameworks, libraries and related tools to aid multi-core programming as discussed in the following subsections.

73

6.1.1.1 Intel's Multi-Core Application Development Aids.

Intel offers a wide verity of tools, libraries and parallel languages in order to exploit parallelism and concurrency in multi-core architectures [76, 77, 106, 107]. It includes Parallel languages like Intel Ct

[78, 79] and Cilk++ [80], libraries like ArBB [81] and TBB [82], and tools like Intel Parallel Studio 2011

[83].

Intel Ct: Intel Ct is a programming model developed by Intel to facilitate the exploitation of its future multi-core chips [78]. It is based on the exploitation of SIMD (Single Instruction and Multiple

DataStream) behavior to produce automatically parallelized programs. Now Intel and Rapid Mind combined their efforts to produce a successor of Intel Ct named Intel Array Building Blocks (ArBB) which is a sophisticated library for vector or data parallelism [81].

Intel Cilk++: Intel Cilk ++ is an extension to C and C++ that offers a quick, easy and reliable way to improve the performance of programs on multi-core processors [80]. It provides a simple model for parallel programming, while runtime and template libraries offer a well-tuned environment for building parallel applications. With its keywords C and C++ developers can move quickly into the parallel programming domain and utilizes data parallelism. It is built on the Cilk technology developed at M.I.T. which is designed to provide a simple, well-structured model that makes development, verification and analysis easy. As it is an extension to C and C++, programmers typically do not need to restructure programs significantly in order to add parallelism. It is being used in the development of many multi-core applications. From the literature study under reference it appears that this cannot be used as major development tool as it provides limited parallelism, no support for functional parallelism and good parallel program always requires data or functional restructuring [84, 85, 86, 102, 103, 104].

Intel Array Building Blocks (Intel ArBB): It provides a generalized vector parallel programming solution that frees application developers from dependencies on particular low-level parallelism mechanisms or hardware architectures [81]. It is comprised of a combination of standard C++ library interface and runtime. It claims to produce scalable, portable, and deterministic parallel implementations

74 from a single high-level source description. It is suitable for GPU programming exploiting data parallelism [88, 90].

Intel TBB: Intel Threading Building Blocks (Intel TBB) offers a rich approach to express functional parallelism in a C++ program [82]. It is a library that helps programmer to take advantage of multi-core processor performance without being to be a threading expert. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanisms for scalability and performance. The library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX threads, Windows threads, or the portable Boost Threads in which individual threads of execution are created, synchronized, and terminated manually. A TBB program creates, synchronizes and destroys graphs of dependent tasks according to algorithms. Tasks are then executed respecting graph dependences. This approach groups

TBB in a family of solutions for parallel programming aiming to decouple the programming from the particulars of the underlying machine. This library is getting enough attention by the multi-core developers but this library lacks in data parallelism, exploiting concurrency and processor or core affinity

[86, 87, 88, 89].

Intel Parallel Studio 2011: It provides Microsoft Visual Studio C/C++ developers a tool suite that includes an innovative threading assistant, optimizing compiler and libraries, memory and threading error checker, and threading performance profiler [83]. Intel Parallel Studio components include Intel Parallel

Advisor [94], Intel Parallel Composer [93], Intel Parallel Inspector [92], and Intel Parallel Amplifier [91].

Intel Parallel Advisor is used to simplify threading in code by identifying those areas in serial and parallel applications where parallelism would have the greatest impact [94]. Intel Parallel Composer may be used to optimize C / C++ compiler, performance libraries, and supports Intel Parallel Building

Blocks [93]. Intel Parallel Amplifier fine-tunes the windows applications for optimal performance, ensuring cores are fully exploited and new processor capabilities are utilized [91] and Intel Parallel

Inspector claims to boost reliability by delivering the easiest, fastest and comprehensive method for

Microsoft Visual Studio C/C++ developers to proactively analyze code and diagnose the multi-threading errors [92].

75

6.1.1.2 Microsoft’s Multi-Core Application Development Aids

Microsoft is currently working on parallel computing and trying to integrate parallelism into their application development product [95]. Its main development platform Visual Studio 2010 [96] for parallel programmers provides a new technology, Concurrency Runtime [97, 98]. It provides a common scheduling layer that give applications a better control over the resources allocation. Microsoft is also looking at new languages, libraries and services for developers, like Axum, plus it has developed PLinq, which adds parallelism to its Linq language-integrated query technology [99, 100]. Microsoft Parallel

Patterns Library (PPL) is another library based on C++ template for multi-core programming that‟s similar to Intel TBB. Some high-level Intel TBB algorithms and containers have corresponding abstrac- tions in PPL. The PPL uses the 'Concurrency Run Time' (ConcRT) for task scheduling and load balancing [215].

Axum: Microsoft is als0 working on a new language for parallel programming named Axum formerly known as “Maestro”. This project aims to validate a safe and productive parallel programming model for the .NET framework. It‟s a language for web architecture and follows the principles of isolation, agents and message-passing to increase application safety, responsiveness, and scalability. Other advanced concepts that this language explores are data flow networks, asynchronous methods and type annotations for refining the side-effects. It is yet to be made available for the developers [99, 100].

6.1.1.3 Sun's Multi-Core Application Development Aids

Sun Microsystems is considering multi-core issues as potential improvements to the Java Virtual

Machine. The Java application layer has supporting functions built into the programming model for applications to take advantage of multiple cores and multiple processors. To address parallelism, Sun has taken a two-pronged approach: parallelizing the virtual machine and supporting applications with a concurrency model. This concurrency model is needed where applications do massively serial work, such as large data processing applications. The Java Platform included a concurrency framework that features

APIs to let developers process a large amount of data. The framework also lets developers break up a task into smaller tasks to be executed on different threads in parallel. With the planned Java Developer

76

Kit 7, which is Sun's implementation of Java S.E 7, the Sun is planning a new type of garbage collection, for memory management that is more concurrent and parallel [107, 108, 177].

6.1.1.4 Other Commercial Multi-Core Application Development Aids

Other professional software developers are also making attempts to address the evolving multi- core issues. Many of new parallel or concurrent languages, parallel libraries, compilers and tools are being launched. Like Clojure, this provides capabilities for multithreaded JVM programming [109, 110].

Another language is Scala, which is interoperable with Java [111, 112]. Among vendors with multi-core- oriented tools are Cilk Arts [114], Coverity [115], Fortify [116], HMPP [117, 118], SureLogic [119] etc., some of them are discussed in this section. But the problem with all these tools is that all of them are either architecture or application specific.

HMPP: It is a hybrid compiler with powerful data-parallel code generators. HMPP target generators, instantaneously prototype and evaluate the performance of the hardware accelerated critical functions.

The code generators are specifically designed to extract the most of data parallelism from C and

FORTRAN kernels and translate them into the programming model and language of the target such as

NVIDIA CUDA or SSE [117, 118].

IBM Cell: It allows the development of Cell Broadband Engine applications on the x86, x86-64, 64-bit

PowerPC (PPC64) and Cell BE-based Blade Center hosting platforms [120, 121]. It contains development tools including both PPU and SPU compilers, software libraries, system simulator and

Linux kernel.

OpenCL: It is for GPGPUs. Its development included many of industrial and institutional partners, including AMD, ARM, IBM, Intel, NVIDIA, and Texas Instruments [122, 123]. OpenCL (Open

Computing Language) is a low-level API standard for heterogeneous computing that runs on the CUDA architecture. Using OpenCL, developers can write compute kernels using a C-like programming language to harness the massive parallel computing power of NVIDIA GPU‟s to create compelling computing applications.

77

OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language

(based on C99) for writing kernels (functions that execute on OpenCL devices), plus APIs that are used to define and then control the platforms. OpenCL provides parallel computing for GPU using task-based data-level parallelism.

Tilera API: Tilera provides multi-core processors that deliver the high performance computing for multi-core embedded applications such as networking and digital video processing. It has its own specific

API for their programming [124].

Plurality: Plurality offers hardware and software solutions that simplify the task of migrating from serial processing to multi-core processing [125]. Unlike other many-core or multi-core processors that are designed as application-specific processors, Plurality‟s Hyper-Core Architecture Line (HAL) processors are intended to be general-purpose accelerators. There are many applications for which HAL is suited without requiring any modification like graphics acceleration, image processing, video surveillance, gaming, networking, security, and communication.

Quick Threads: Quick Thread is a runtime library and programming paradigm for writing multithreaded applications in 32-bit and 64-bit environments using C++, Fortran and mixed language programs [126]. It supports thread affinity, data binding affinity and NUMA architecture. The design goal of Quick Thread is to produce a minimal overhead mechanism for distributing work in a multi-threaded environment [127,

128].

6.1.2 Other Standard Shared Memory Programming Approaches for Use for Multi-core Processors

There are some standard shared and distributed memory programming models and libraries that were earlier proposed for SMPs and distributed environment. They are also currently being used for multi-cores. Some of them are described below.

6.1.2.1 Erlang: It is a programming language which has many features more commonly associated with an operating system than with a programming language. Like concurrent processes, scheduling, memory

78 management, distribution, networking, etc. [139, 140, 141]. The initial open-source Erlang release contains the implementation of Erlang and a large part of Ericsson's middleware for building distributed high-availability systems. Some of the attempts have been made to use it for multi-core processors but failed due to its process level execution approach. As a result it has failed to get enough attention of the multi-core programmers [142, 143, 161].

6.1.2.2 POSIX Threads (pthreads): It is a low level Application Programming Interface (API) for writing multithreaded applications [144, 145]. Being equipped with a rich set of multi-threading development tools, it is used in shared memory parallel application development. It uses FORK / JOIN model for thread creation and management. Its basic thread operation involves thread creation, synchronization, termination, data management, process interaction, and scheduling. Because of its low level programming approach, it is being extensively used in the development of the multi-core related tools and applications. However, the use of pthreads requires detailed knowledge of hardware and software platforms and in-depth information of threading mechanism [146, 147, 148].

6.1.2.3 OpenMP: Along with Intel Threading Building Blocks, another abstraction for C++ programmers is OpenMP, the most successful parallel extension [167]. OpenMP is a language extension consisting of pragmas, routines, and environment variables for FORTRAN and C programs. OpenMP is designed for data level parallelism and loop level parallelism and it is being extensively used to exploit these levels of parallelism in multi-core processors. [149, 150, 151, 152, 153, 154, 155, 156, 157, 158,

159, 160].

6.1.3 Research Oriented Multi-Core Application Development Tools

Besides commercial software companies, a number of researchers from academia are also giving their inputs to address the evolving multi-core issues. Many of the related researches have been put forwarded proposing many new concepts or modifications in the existing one to make them suitable for multi-core programming. These researches focus on parallel or concurrent languages, parallel libraries, compilers and tools. Unlike the commercial multi-core programming tools, they are more generic and focus on general purpose programming. Some of them are discussed in this section.

79

SWARM: SWARM (SoftWare and Algorithms for Running on Multi-core) has been introduced as an open source parallel programming framework. It is a library of primitives that exploit the multi-core processors. The SWARM programming framework is a descendant of the symmetric multiprocessor

(SMP) node library component of SIMPLE [132]. SWARM is built on POSIX threads that allow the user to use either the already developed primitives or direct thread primitives. SWARM has constructs for parallelization, restricting control of threads, allocation and de-allocation of shared memory, and communication primitives for synchronization, replication and broadcasting [129, 130, 131].

Manti-core: It is a run time model. Manti-core is a joint research project of between researchers at the

University of Chicago and Toyota Technological Institute at Chicago. This project explores the design and implementation of programming languages that support heterogeneous parallelism on multi-cores

[133, 134, 135, 136].

Unified Parallel C (UPC): UPC is a parallel extension of C Language. The UPC standard is maintained by the UPC Consortium, an open group of industry, government, and academic institutions. UPC programs are modeled as a collection of threads that share a single global address space that is logically partitioned among the threads. UPC supports data parallelism via shared arrays. Shared arrays are partitioned among threads such that each thread has part of the array allocated in its shared space. UPC supports a range of homogeneous multi-core architectures; threads can execute on different cores and different processors. On shared memory systems, threads communicate through memory. On distributed memory systems, threads communicate through runtime layers like MPI [168, 242].

ParMa (Parallel programming for Multi-core Architectures): It is an on-going ITEA 2 project that aims to develop tools for exploiting parallel computing in multi-core embedded systems [137].

Sequoia: Sequoia is a research project at Stanford University [169, 170, 242]. Sequoia explicitly targets the management of data in a processor‟s memory hierarchy, i.e., allocation and movement, to achieve high performance for cell architectures. It is an extension of C but introduces programming constructs that result in a different programming model. Sequoia‟s programs consist of computational functions

80 called tasks. Tasks support data parallelism, they are written without any references to the specific cell architecture.

SMOKE: It is a model framework that maximizes the performance of the processor in a purely gaming environment built to take advantage of all available cores [138].

In addition to above, some more programming languages are also being developed to aid multi- core programmers like Co-Array Fortran and Titanium [171, 172, 242]. Co-Array FORTRAN and

Titanium add language extensions to FORTRAN and Java, respectively. ''StreamIt'' was developed at

MIT and targets applications that process streaming data, e.g., real-time audio or video [173]. Under the

US Defense Advanced Research Projects Agency (DARPA), High Productivity Computing Systems

(HPCS) program is also working on many new high-performance computing languages to allow applications to scale future parallel systems and improve productivity of future programmers. The HPCS has been able to produce Chapel, Fortress and X10 developed by Cray, Sun, and IBM, respectively [174,

175, 176].

6.1.4 Current Multi-Core Research Group

There are number of active research groups worldwide, working to resolve multi-core challenges

(see 6.1.4). All these groups are trying to come up with some solutions for current multi-core challenges.

Some of the research groups having the greater contribution and impact are discussed below.

MCA (Multi-core Association)

This is an open membership organization that includes companies and programmers implementing multi-core technology [162]. It is involved in the development of a standard set of application programming interfaces (APIs) that support multi-core communications, resource management, task management, and debug facilities. These APIs aim to provide a foundation for a multitude of services and functions including load balancing, power management, reliability, and quality of service. One of these APIs is Multi-Core API (MCAPI). It is a message-passing API, being developed to define the basic elements of communication and synchronization that are required for closely

81 distributed embedded systems. This and related APIs are under development and promise to aid multi- core based applications development in future [163, 164, 165, 166].

UPCRC Illinois (Universal Parallel Computing Research Center, Illinois)

It is a joint research group of the computer science department of Illinois University, USA, and the Coordinated Science Laboratory. This group is being funded by Microsoft and Intel. This group is focusing on exploration of new applications, related programming models and architectural improvements for multi-core processors. On the software side this group is focusing on the development of new client applications that require high performance. These applications can utilize high levels of parallelism and make use software technologies developed to facilitate the exploitation of parallelism in multi-core processors. On hardware side they are trying to develop hardware technologies to facilitate the continued scaling of multi-core systems to the hundreds and thousands of cores [216].

HiPEAC (High Performance and Embedded Architecture and Compilation)

It is a European network on high performance and embedded architecture and compilation. It is a cluster of researchers from industry and academia interested in high performance computing. This network has thirteen research clusters/task forces. Two clusters namely ''Multi-core architecture'' and

''Programming models and operating systems'' are purely focusing on existing multi-core challenges.

These clusters are exploring hardware and software challenges like multi-core processor architecture, memory hierarchy, technology and application impact on architecture, architectural support for parallel programming models and operating systems [217].

UPMARC (Uppsala Programming for Multi-core Architectures Research Center)

It is a research program of Uppsala University, Sweden. This group focuses on development of new tools and approaches to make parallel programming easier, and to demonstrate their effectiveness through prototype implementations on real problems. Major areas of this research group include parallel applications and their performance, efficiency, predictability, ease of programming and correctness

[218].

82

6.1.5 Summary

In short, many attempts are being made either using existing development tools or by proposing modifications in them in order to fully exploit parallelism in multi-core architectures. But so far, all the approaches are either architectural specific or deal only with a certain level of parallelism that is functional, data or loop level parallelism. Majority of the commercially available multi-core development tools are specifically designed for embedded systems and GPUs. On the other hand some researchers have focused to exploit parallelism in general purpose multi-core processors but all such developments provide a limited set of functionality and only focus at certain level of parallelism. Some major advancement has been made by Microsoft and Intel, by adding multi-core programming and debugging support to their existing development tools. OpenMP, an open standard and TBB from Intel are being used most extensively for developing parallel applications for general purpose multi-core processors.

They provide different methodologies for the parallel application development. OpenMP provides data- level and loop-level parallelism using threads whereas TBB focuses on exploiting functional-level thread oriented parallelism using tasks. POSIX threads are also being used to develop multi-threaded applications for multi-core processors to exploit data level parallelism but it requires high skill of low level programming skills.

Multi-core support can be fully utilized only by programming language having support for all types of parallelism at thread level and having capability of parallel and concurrent execution.

Unfortunately, none of the existing programming techniques being used to develop multi-core application has all these features. The SPC3 PM is developed for providing all these supports in a single parallel programming model. The following table-6.1 shows the comparison of features of SPC3 PM with two other major multi-core programming approaches.

Table-6.1: Comparison of Parallel language features (OpenMP, TBB, and SPC3 PM)

Functional Data Loop Core Language Serial Parallel Concurrent parallelism parallelism parallelism interaction

OpenMP X   X   X

TBB  X  X   X

SPC3 PM       

83

6.2 Key Features of SPC3 PM

SPC3 PM, (Serial, Parallel, Concurrent Core to Core Programming Model), is a serial-like task- oriented multi-threaded parallel programming model for multi-core processors, that enables developers to easily write a new parallel code or convert an existing code written for a single processor. The programmer can scale it for use with specified number of cores, and ensure efficient task load balancing among the cores.

The development of SPC3 PM is motivated with an understanding that existing general-purpose languages do not provide adequate support for parallel programming. Existing parallel languages are largely targeted to scientific applications. They do not provide adequate support for general purpose multi-core programming whereas SPC3 PM is developed to equip a common programmer with multi-core programming tool for scientific and general purpose computing.

The SPC3 PM provides a set of rules for algorithm decomposition and a library of primitives that exploit parallelism and concurrency on multi-core processors. It helps to create applications that reap the benefits of processors having multiple cores as they become available.

It provides thread parallelism without the programmer requiring a detailed knowledge of platform details and threading mechanisms for performance and scalability. To use the library a programmer specifies tasks instead of threads and SPC3 PM Library maps those tasks onto threads and threads on available cores in an efficient manner. As a result, the programmer is able to specify parallelism and concurrency far more conveniently and with better results than using raw threads.

The programming model, SPC3 PM, helps programmer to control multi-core processor performance without being a threading expert. The ability to use SPC3PM on virtually any processor or any operating system with any C++ compiler makes it very flexible.

It has many other unique features that distinguish it with all other existing parallel programming models. It supports both data and functional parallel programming. Additionally, it supports nested parallelism, so one can easily build larger parallel components from smaller parallel components. A program written with SPC3 PM may be executed in serial, parallel and concurrent fashion. Besides, it

84 also provides processor core interaction to the programmer. Using this feature a programmer may assign any task or a number of tasks to any of the cores or set of cores.

The key features of SPC3 PM are summarized below.

 The SPC3 PM is a new shared programming model developed for multi-core processors.

 It provides a set of rules for algorithm decomposition and a library.

 The SPC3 PM works in two steps: defines the tasks in an application algorithm and then

arranges these tasks on cores for execution in a specified fashion.

 It provides Task based parallelism

 It exploits Thread-level parallel processing.

 It helps to make use of all the three programming execution approaches, namely, Serial, Parallel

and Concurrent.

 It provides a direct access to a core or cores for the maximum utilization of system.

 It supports all major types of parallelism like Loop, Data and Functional parallelism.

 It supports major decomposition techniques like Data, Functional and Recursive.

 It allows easy programming as it follows C/C++ structure.

 It can be used with other shared memory programming model like OpenMP, TBB etc.

 It is scalable and portable.

 It supports both 32 and 64 bit programming environment.

 It is designed for object oriented approach.

 It provides data sharing using C structures.

6.3 Design Concepts

Multi-core devices are being evolved in both architecture and core count. This trend is encouraging the software developers to decouple the code from the hardware, to enable applications to move between different architectures and automatically scale depending upon the available core count.

An appropriate programming model can enable this decoupling and can provide scalability and enhanced performance for multi-core processors [178, 179, 200, 201, 202]. Serial parallel programming models

85 cannot make most of multi-core processors as they execute the program using a single process / thread model and do not support either implicit or explicit parallelism. The existing parallel programming models are also unable to fully utilize the available potential of multi-cores because of many design issues. These issues include single operating system, cores interconnection, nature of parallelism, level and type of parallelism, task and threads designing, task and threads scheduling, algorithm decomposition techniques, programming patterns, data sharing, etc.

This section covers some major design issues with multi-core programming and then explores the design features of SPC3PM that make it suitable for multi-core processors. The figure-6.2 shows the design concept of SPC3PM.

6.3.1 Design Issues with Multi-Core Programming

In moving to multi-core era, the computer industry has effectively given up on running conventional programs faster, and is focusing on running programs efficiently utilizing the available cores. This makes the current revolution in processor design very different from its predecessors.

Previous revolutions were mostly invisible to programmers; the current revolution will eventually require that programs be rewritten in some concurrent and parallel programming language [180, 181, 187, 191].

The parallel and concurrent execution of many high-end scientific and commercial programs written by expert programmers are showing enough success but a lot is to be done to make it possible for conventional programmers in multiple problem domains as well [192, 193, 200].

Only parallelized applications can exploit the additional performance gain by the multi-core processor. In fact, since the individual cores on a CMP are often slower than the large single-core processors of the past, non-parallelized applications may in fact be slower on multi-core processors.

Also, since the number of cores will grow exponentially over time (under the new interpretation of

Moore's Law), any application, in order to grow in performance, must be written to use any number of cores in a scalable fashion. In other words, multi-core processing is only intensifying an already challenging problem [179, 180, 184, 186].

86

Most software today is grossly inefficient, because it is not written with sufficient parallelism in mind. Breaking up an application into a few tasks is not a long-term solution. So in order to make most of multi-core processors, either, a lot of parallelism is needed for efficient execution of a program on larger number of cores, or concurrent execution of multiple programs on multiple cores should be implemented

[183, 199].

The execution of a program at process or thread level on multi-core processors also play a vital role in performance evolution. Optimal application performance on multi-core architectures can be achieved by effectively using threads to partition software workloads [209]. Multi-threaded applications running on multi-core platforms have different design considerations than do multi-threaded applications running on single-core platforms. On single-core platforms, assumptions may be made by the developer to simplify writing and debugging a multi-threaded application. These assumptions may not be valid on multi-core platforms. The programmers have to look into each and every platform and threading details

[188, 189, 190].

Another issue with multi-core programming is the level of parallelism selected. At the lowest level, SIMD instructions can fetch a set of operands and perform an operation on them in a single CPU cycle. This level is also known as instruction oriented Data-level parallelism [203, 204, 205]. At the next level, multi-core architectures often include simultaneous multi-threading (SMT), which allows multiple threads to execute different instructions in the same CPU cycle. This level reflects the thread oriented functional-level or data-level parallelism. At the highest levels of parallelism, multi-core architectures can use multiple cores for the execution of either multiple applications or processes or threads. It is known as the application or process or thread oriented functional-level parallelism, or simply application level, process level and thread level parallelism respectively. Programs, which now must map onto this hardware often exhibit parallelism at different granularities roughly categorized as either fine- or coarse- grained. Pure data level parallelism is being successfully implemented on GPUs and Cells architectures whereas for general purpose multi-core processors, programming language should provide both data and functional levels of parallelism for maximum utilization of multi-core processors [195, 196].

87

Selection of programming model and decomposition technique has also great impact on program execution. Data-parallel model focuses on performing operations on a data set which is usually regularly structured in an array [191, 192, 195]. A set of tasks operates on the data independently on separate partitions. Data parallelism is usually classified as SIMD/SPMP whereas Task-parallel model focuses on parallel execution of independent processes or threads. These processes or thread behave distinctly, which emphasizes the need of inter-process communications. Task parallelism is a natural way to express message-passing communication. It is usually classified as MIMD/MPMD or MISD [195, 196].

In brief, efficient performance on modern multi-core processors requires an aggressive approach to parallelism. There are many performance mechanisms in modern processors, including but not limited to multiple cores, that depend on level of parallelism, programming model, problem decomposition, and execution model. In short general multi-core processors can be fully utilized only by programming language having support for Task oriented Thread level parallelism, assisting both Data and Task parallel models, and having capability of parallel and concurrent execution with core affinity. And none of the existing programming techniques being used to develop multi-core application has all these features. The

SPC3 PM is developed for providing all these supports in a single parallel programming model.

The programming model presented in this thesis takes care of all such issues as just discussed. It allows to express concurrency and parallelism in a C++ program for multi-core processors. It is built using Win32and win64 threads API that allow the user to use either already developed primitives or direct thread primitives. It helps programmer to control multi-core processor performance without having to be a threading expert. The ability to use SPC3 PM on any processor with any operating system using any C++ compiler has made it very flexible. It uses functions for common parallel and concurrent iteration patterns, enabling programmers to gain increased performance and throughput from multiple processor cores. Programs written using SPC3 PM can run on systems with a single processor core, as well as on systems with multiple processor cores. The following sub-sections highlight the basic design features of SPC3 programming model.

88

6.3.2 Task Based Parallelism

The SPC3 PM exploits task based thread-level parallelism. The program in SPC3 PM may be written simply in C/C++ language as in function style. Programmer has to identify and define the independent piece of code as a TASK in a program according to the SPC3 PM „Task Decomposition

Rules‟ disused in section 6.4.2 and 6.4.4.

In the SPC3PM, each task is treated as a running thread. On compiling a program written in

SPC3PM, all the identified tasks are encapsulated as an individual thread. These threads are then scheduled and executed accordingly. This whole mechanism hides the details of thread / process creation and their management form the programmer. This whole job of encapsulating individual tasks in a thread or threads is performed by the low level routines of SPC3PM written using win 32/64 thread API. These tasks can be executed either in serial, parallel or concurrent fashion on any specified core or pools of cores using the SPC3PM library functions. Any task either having dependency on other tasks or does not exploit parallelism may be executed serially. Any task exploiting data or loop parallelism may be executed in parallel on all available cores or on any specified pool of cores. Independent serial tasks of similar or different programs may be executed concurrently on different cores to improve the system throughput and minimize the execution time. Multiple instances of a task with different data sets may also be executed using concurrent function to execute that single task in parallel.

6.3.3 Thread Level Parallelism

The SPC3 PM exploits thread-level parallelism. Thread level is the best choice to exploit parallelism for multi-core because of its architecture and sharing of memory down to its lowest level that is core‟s private cache. Many other shared memory parallel language used for multi-cores like OpenMP and TBB also exploit thread-level parallelism. Process level parallelism is not a good choice for multi- core processors as processes of an application do not share the resources and communication among the processes are costlier than that among the threads. Cores in CMPs share the memory system and data may easily be shared at cache level using Threads. The thread level programming approach decreases the inter-communication cost and the program execution time.

89

The SPC3 PM uses process / thread model for the execution of a program. A program written in

SPC3 PM is executed as a process and all the tasks are executed as its spawned threads. In functional parallelism, for each task one associated thread is created. In loop and data parallelism, each task may have a pool of threads. Number of threads in a pool may be defined either by the programmer or operating system. The default number of threads in a pool is equal to number of available cores. The independent spawned threads (Tasks) are then scheduled on different available cores according to the program structure to execute the application in parallel. Threads spawning and distribution of computations of a task into pool of threads are done using win 32/64 API functions and some directives of OpenMP. Scheduling of thread or pool of thread is discussed 6.3.5 and 6.4.4. Figure-6.1 shows the threads orientation in SPC3PM.

Execution Modes Threads Spawning Execution on Cores

Serial, Single Task,

single Thread Program Sinlge Thread for a Single Task Single task executing on single core written using SPC3 Parallel, PM Single Task, Multiple Threads

Multplie Threads (a pool of Threads) for a Single task executing on available cores Single Task. (Loop and Data Parallelism)

Concurrent, Multiple Tasks, Multiple Threads

Multiple Threads for Multiple Tasks Multiple tasks executing on available cores (Functional Parallelism) Figure-6.1: Threads Orientation in SPC3 PM

6.3.4 Decompositions Techniques

The SPC3 PM supports all major decomposition techniques like Functional, Data and Recursive.

These techniques are commonly used to decompose a wide variety of problems. The speculative and exploratory decomposition techniques which are usually applied to scientific classes of problem only are

90 also supported by the SPC3 PM. The following table-6.2 shows how different decomposition techniques may be applied on a program using SPC3 PM.

In functional decomposition, a given problem is divided in terms of operation instead of data.

Independent operations that can be executed in parallel are separated and coded as independent tasks.

These tasks are then executed in parallel. In SPC3 PM, this type of parallelism can be exploited using its

Concurrent library function. All these independent operation may be coded as independent tasks and may be executed in parallel using the Concurrent function. Further study of the Concurrent function exploiting functional level of parallelism is discussed in section 6.4.4.3.

Table-6.2: Different Decomposition Techniques using SPC3 PM on ‘N- core’ machines

Different Decomposition Techniques using SPC3 PM on ‘N-core’ machines

Loop Level Data Higher level Data Decomposition & Functional Decomposition Decomposition Recursive Decomposition

Application Application Application { { { TASK1 TASK1 TASK1 TASK2 TASK2 TASK2 …. …. …. TASKN TASKN TASKN { { {

concurrent (TASK1(Dataset1),TASK1(Dataset2), concurrent (TASK1, TASK2,.. TASKN) parallel (TASK1,n) ….TASK1(DatasetN))

Data decomposition is a powerful and commonly used method for deriving concurrency in algorithms that operate on large data structures. In data decomposition, computation is done in two steps.

In the first step the data on which the computations are performed is partitioned and in second step this data portioning is used to induce a partitioning of the computation into tasks. As operations that these tasks perform on different data partitions are usually similar, so can easily be programmed using either

Concurrent or Parallel library function of SPC3 PM. For coarse grained data decomposition, Concurrent function with similar task having different data set may be used. Fine grained data decomposition that is loop-level data decomposition may easily be implemented using parallel function of SPC3 PM Library.

91

Execution of the Parallel and Concurrent function exploiting data decomposition technique are discussed in section 6.4 and 6.5.

Recursive decomposition is the one in which concurrency is induced in a given problem using divide and conquer strategy. In this, a given problem may be solved by first dividing it into a set of independent tasks or sub-problems. Each one of these tasks is solved by recursively applying a similar division into smaller tasks followed by the combination of their results. All these tasks can easily be executed in parallel on different cores with different data set using the Concurrent function of SPC3PM library. The following table 6.3 shows different pseudo-codes in C and SPC3 PM for a program finding the smallest number in an array „A‟ of length „n‟ using recursive decomposition.

Table-6.3: Pseudo-codes in C and SPC3 PM for a program finding the smallest number in an array ‘A’ of length ‘n’ using recursive decomposition.

Pseudo code for a program finding the smallest number in an array A of length n

Serial code using Recursive Recursive decomposition implemented using Serial code decomposition Concurrent function of SPC3PM

TASK1 (A,n) //SERIAL_MIN (A,n) { //begin procedure RECURSIVE_MIN (A,n) min = A[0]; begin for i=1 to n-1 do procedure if (n=1) then if (A[i]

6.3.5 Task Scheduling

In SPC3 PM, a task (from the programmers perspective), or a thread (from the operating systems perspective) is the smallest schedulable unit. The SPC3 PM provides a special feature of scheduling a task or tasks on a particular core or pool of cores. The programmer has two options either decide himself to

92 schedule a task on a particular core or pool of cores, or leave it to operating system for its scheduling. In

SPC3 PM, all the threads in execution can be bound to a core or set of cores. This is done using the concept of process, threads and core affinity. When a user specifies a task or tasks to schedule on a core or set of cores, the SPC3 PM library forces the operating system to schedule the task or tasks accordingly.

In this case the default thread scheduler used by operating system is deactivated. If the user skips this option then operating system uses its own default thread scheduling polices to schedule the SPC3 PM tasks (threads). In SPC3 PM, process and thread affinity is handled by using different low level threading packages available in Microsoft Visual Studio and core affinity is introduced in SPC3 PM by using self written assembly routines. Usage of this feature and its results are discussed in detail in section 6.4 and

6.5 respectively.

6.3.6 Execution Modes

The SPC3 PM provides all three execution modes; that is serial, parallel and concurrent. A task may be executed in serial or parallel fashion using Serial or Parallel function of SPC3 PM Library. As the given problem can only be parallelized up to an extent and it is therefore very difficult to utilize all the cores by dividing the problem among all cores. The other way to achieve maximum utilization of the multi-core processors is to use it in concurrent execution mode. Same task with different data set or different tasks of a program or different programs may be executed in parallel on different cores using concurrent function of SPC3 PM. Any task which is executed through Serial function of SPC3 PM is forced to run serially on a single core by SPC3 PM. When task is executed by parallel function of SPC3

PM, the SPC3 PM divides the data into spawned worker threads and execute these tasks in parallel on different cores. In concurrent function, SPC3 PM spawns numbers of concurrent threads equal to number of tasks being executed using this function. These threads are then scheduled on the cores for their execution. Detailed functionality of the execution modes are discussed in 6.4.

6.3.7 Types of Problems Supported

The SPC3 PM can be used for all types of problems like linear, recursive, regular or irregular. For a linear or regular problem, where the flow of program can be pre-determined, the SPC3 PM can be

93 effectively used with all its functions. For recursive type problems where a task or tasks is in process of repeating itself or themselves in a continuous way, the concurrent function of SPC3 PM with different datasets may be used. The SPC3 PM also works effectively for irregular problems that involve pointers and have communication structures that depend on the data. Parallel algorithms for these problems tend to be quite different than the serial algorithms and are often more complicated requiring larger overheads.

Such algorithms include trees, graphs, or most sorting or merging algorithms. The SPC3 PM manages these types of problems by introducing the serial, parallel and concurrent execution within the program at different stages.

6.3.8 Data Sharing

Parameter passing among the parallel processes or threads has always been a challenge in shared programming model. The SPC3 PM deals the private and shared data variable in terms of C structure.

The parameters passing to a particular task may be defined in a structure associated with that particular task. The pointer of that structure can be passed as a parameter to a task. Both types of shared and private structures may be initialized and passed in this way. The concept of using C structure as a parameter for data sharing makes parallel programming more convenient as the C structure is well known to any C/C++ programmer. Using this structure, unlimited number of variable can be passed and returned; also structure provides support of all types of data variables and their easy management.

6.3.9 Compilation

The SPC3 PM offers an efficient approach to express concurrency and parallelism in a C++ program for multi-core processors. It supports both the 32 and 64 bit parallel programming environment.

The SPC3 PM follows the C/C++ program structure; its tasks are like the functions of C/C++. The SPC3

PM Library function can be called from anywhere in a C / C++ program as discussed in section 6.5. The program written using SPC3 PM can be compiled with any C++ compiler for any operating system.

94

3 SPC PM Parallel Multi-Threaded Programming Model for Multi-Core Architecture

Linear Recursive Regular Irregular

User User Layer

Problem Problem

3 Functional Data Recursive Speculative Domain Follow SPC PM

Decomposition Decomposition Decomposition Decomposition Decomposition Rules for Task Decomposition

Layer Shared Win32 POSIX OpenMP TBB Etc Variables API threads Library 3 Programming Programming SPC PM Library

Respective Compilers C/ C++ Compiler

Our Customized Thread scheduler OR Layer System System Operating Systems Support for Shared Memory Threading / Scheduling

Layer

Hard ware ware Hard

Figure-6.2: The design concept of SPC3PM

6.4 Programming with SPC3 PM

The SPC3 PM provides a higher-level, shared memory, task-based thread parallelism without knowing the platform details and threading mechanisms. Programming with SPC3 PM is based on two steps. First decomposing the application into tasks in accordance with SPC3 PM Task decomposition rules, discussed in section 6.4.2 and then writing the code using SPC3 PM Library discussed in subsequent section 6.4.4. The library can be used in simple C / C++ program. In SPC3 PM, the user specifies the tasks not threads and the SPC3 PM library encapsulates these tasks onto threads. The result is that SPC3 PM allows to specify parallelism and concurrency far more conveniently and gives better results than using raw threads.

The steps involved in the development of an application using SPC3PM are described below.

1) The user determines that his application can be programmed to take advantage of multi-core

processors.

2) The problem is decomposed by the user following the SPC3 PM 'Task Decomposition Rules'. See

section 6.4.2

3) Each Task is coded in C /C++ as an independent unit to be executed independently and

simultaneously by each core.

4) Coding of Main Program using SPC3 PM Library to allow the user to run the program in serial,

parallel or concurrent mode.

5) Compilation of code using any standard C/C++ compiler.

6) Execution of Program on a multi-core processor.

6.4.1 Rules for Task Decomposition

The user can decompose the application / problem on the basis of following rules.

i. The user should be able to breakdown the problem in various parts to determine if they can

exploit Functional, Data or Recursive decomposition.

ii. Identify the loops for the loop parallelism and may be defined as Tasks.

96

iii. Identify independent operations that can be executed in parallel and may be coded as

independent Tasks.

iv. Identify the large data sets on which single set of computations have to be performed. Target

these large data sets as Tasks.

v. Tasks should be named as Task1, Task2,….. TaskN. If a Task returns a value it should be named

with suffix „R‟ like TaskR1, TaskR2…. TaskRN.

vi. There is no limit on the number of Tasks.

vii. Each Task should be coded using either C/C++/VC++/C# as an independent function. viii. A Task may or may not return the value. A Task should only intake and return structure pointer

as a parameter. Initialize all the shared or private parameters in the structure specific to a Task.

This structure may be shared or private.

ix. Arrange the tasks using SPC3 PM Library in the main program according to the program flow.

6.4.2 Properties of a Task

The SPC3 PM treats a task as a running thread. Such a task can be identified by observing the following characteristics.

Independent Tasks

A function or a portion of code which can be run independently may be a candidate of a task.

Independent running means the task should use separate resources from other tasks, its execution should not depend on the results of other tasks and no other task depends on its results. This is important to maximize concurrency and minimize the need for synchronization.

Runable as a Process or Thread

A task should have a logical program structure and computation. It should be that part of a program which can be encapsulated as a process or thread effectively.

97

Asynchronous processing

A task should be consisting of asynchronous elements of a program. Asynchronous elements are capable to be executed independently and scheduled the operating systems.

Inter-Task communications

There should be minimum communication between the tasks. Coarse grain communication is preferable than the fine grain to minimize communications between the tasks.

Response to asynchronous events

A task should be capable to handle events that occur at random intervals, such as network communications or interrupts from hardware and the operating system. Tasks can be used to encapsulate and synchronize the servicing of these events apart from the rest of application.

Code Size and Memory requirements

In good program decomposition, all the candidate tasks in a program are nearly equal in term of code size, memory requirements and computations.

CPU intensive

The task should perform long computations. Time-consuming calculations that are independent of activities elsewhere in the program can be a good task.

98

6.4.3 Program Structure

Define Task1 Define Task2 Define Task3 Define Task4 ... Define TaskN

Structure Structure Structure Structure STRUCT_NAME STRUCT_NAME STRUCT_NAME STRUCT_NAME { { { { //The structure //The structure //The structure //The structure //having private //having private //having private //having private //or global //or global //or global //or global //parameters //parameters //parameters //parameters //associated with a //associated with a //associated with a ……………….. //associated with a //specified task //specified task //specified task //specified task } } } } STRUCT_NAME STRUCT_NAME STRUCT_NAME STRUCT_NAME *P_ TASK1 *P_ TASK2 *P_ TASK3 *P_ TASKN

Task1(LPVOID) Task2(LPVOID) Task3(LPVOID) TaskN(LPVOID) { { { ……………….. { //performing //performing //performing //performing //some //some //some //some //computation //computation //computation //computation } } } }

void main( void)

{ // any declaration; // any piece of code ;

Serial (Task1, P_ TASK1 ); //execution of task 1 in serial

// Any other code ;

Parallel (Task2, P_ TASK2); //execution of task 2 in parallel

// any other code ;

Concurrent (Task3 , P_ TASK3, Task4, P_ TASK4); //execution of task 3 and 4 concurrently

}

99

6.4.4 The SPC3 PM Library

This library provides basic functions for writing the parallel and concurrent program using

C/C++ structures. For programmers, this library provides basic and simple functions to write parallel applications for multi-core processors. Internally this library looks after all the complex mechanism behind the functions provided by the library. Like encapsulation of tasks into threads, thread management including creation, starting, suspending and termination of threads, threads spawning, thread inter- communications, static or dynamic threads scheduling on cores, data sharing and handling serial, parallel and concurrent execution modes are handled by this library. This library is written using Win32/64 thread

API and assembly routines in Visual C++ environment. This library also includes few routines of

OpenMP to make use of its some distinctive futures and provides three fundamental functions.

 Serial Function

 Parallel Function

 Concurrent Function

6.4.4.1 Serial Function: This function is used to specify a task that should be executed serially. When a task is executed using this function, SPC3 PM encapsulates the task into a thread and then this thread is executed serially to compute the associated task. The thread can be scheduled on any available core either by internal scheduler of operating system or as specified by the programmer. This function has three variants.

Serial (Task i) The specified task 'Task i' is executed in a serial fashion and it is scheduled on the available core /cores by the operating system. The following figure-6.3 represents the auto allocation of a thread of a serial task using serial function of SPC3 PM Library.

100

SPC3 PM Serial Task, Serial (Task i )

by Other task Idle or

by Other task Idle or by Other task Idle or

SPC3 PM TASK i SPC3

the

the the

OS executing OS OS

executing executing

a

a a

ssigned ssigned ssigned

Core 0 Core 1 Core 2 Core 3

Figure-6.3: Auto allocation of a thread of a serial task using serial execution function.

Serial(Task i, core) It is a second variant of serial function. It provides an extra functionality related to core scheduling. The given task may be bound to a particular core for its execution. After a comprehensive comparison of the results obtained using these two variants of serial function it is observed that the later version of the serial function has better performance. It is further discussed in section 6.5.2. The following figure-6.4 represents the allocation of a thread on the specified core of a serial task using Serial function of the SPC3 PM library. A thread of the serial task is being scheduled on core 1.

SPC3 Serial Task , core assigned 1 Serial (Task i , 1)

by OS Other task Assigned Idle or executing

by OS Other task Assigned Idle or

by OS Other task Assigned Idle or executing

SPC

3

PM PM

executing

TASK i

Core 0 Core 2 Core 3 Core 1

Figure-6.4: Allocation of a thread on a specified core of a serial task using serial execution function

101

*p Serial(Task i, core, *p) This third variant of serial function has capability to take and return the parameters. This function intakes a structure pointer as a parameter. All the parameters which are to be passed to a task are defined in a C structure and its pointer is passed to this serial function. Similarly all the return values should be specified in a return structure.

6.4.4.2 Parallel Function: This function is used to specify a Task that should be executed in parallel.

When a Task is executed in parallel mode, a pool of threads is created to execute the associated task in parallel. Also, it has an option to distribute the work of the Task among the threads in a team. These threads are scheduled on the available cores either by operating system or as specified by the programmer. At the end of a parallel function, there is an implied barrier that forces all threads to wait until the work inside the region has been completed. Only the initial thread continues execution after the end of the parallel function. The thread that starts the parallel construct becomes the master of the new thread pool. Each thread in a pool is assigned a unique thread id to identify it. They range from zero (for the master thread) up to one less than the number of threads within the thread pool. This function also has four variants, as described below.

Parallel (Task i) The specified task is executed in a parallel fashion and it is scheduled on the available core/cores by the operating systems. The number of threads in a thread pool is also set by operating system and it is always equal to number of available cores. The following figure-6.5 represents the allocation of parallel threads of a parallel task using parallel function of SPC3 PM Library. Four threads equal to the number of cores spawned for a Task, and each thread is scheduled to a core.

Parallel (Task i,num-threads) The specified task is executed in a parallel fashion and it is scheduled on the available core/cores by the operating systems and the number of threads in a thread pool is set by the programmer as specified in num-threads variable. If the number of threads in thread pool is less than the available cores, only that number of cores is utilized and selection of the cores is done by the operating system. If the number of threads in thread pool is equal to the number of cores, all the cores will execute a thread. Finally, if number of the spawned parallel threads is greater than the number of cores, multiple threads will be scheduled on each core.

102

SPC3 PM Parallel Task , All Cores Consumed parallel (Task i )

SPC SPC SPC SPC

3 3 3 3

PM TASKPM i TASKPM i TASKPM i TASKPM i

Core 0 Core 1 Core 2 Core 3

Figure-6.5: Allocation of a thread pool of a parallel task using parallel function. Threads equal to the number of cores spawned and scheduled on each core.

The following figure-6.6 shows the allocation of „N‟ parallel threads of a parallel task using parallel function of SPC3 PM Library. Nine threads, as defined, spawned for a Task and scheduled on each core.

In this case each core is executing more than one thread.

SPC3 Parallel Task , All Cores Consumed ; number of threads=9 parallel (Task i , 9 )

SPC SPC

SPC

SP

C3

3 3

3

PM TASKPM i TASKPM i

PM TASKPM i

PM TASKPM i

Core 0 Core 1 Core 2 Core 3

Figure-6.6: Allocation of a thread pool of a parallel task using parallel function. Threads equal to the number defined 9, spawned and scheduled on each core accordingly.

Parallel (Task i, core list ) This variant of parallel function provides a functionality of a cores specification. The given task may be bound to a particular set of cores for its execution. The following figure-6.7 shows the allocation of „N‟ parallel threads of a parallel task using parallel function of SPC3

PM library on the specified cores. Nine threads as defined spawned for a Task and scheduled on the specified 3 cores. Core 2 is free to execute some other task assigned by operating system.

103

SPC3 Parallel Task , core assigned 0,1,3; number of threads=9 parallel (Task i , 9, 3 )

task Assignedtask by OS Idle or executing Other

SPC

SPC

SPC

3

3

3

PM TASKPM i

PM TASKPM i TASKPM i

Core 0 Core 1 Core 2 Core 3

Figure-6.7: Allocation of a thread pool of a parallel task using parallel function. Threads equal to the number defined spawned and scheduled to specified cores accordingly.

*p parallel (Task i, core, *p) This fourth variant of parallel function has capability to take and return the parameters. This function intakes a structure pointer. All the parameters which have to be passed to a task is defined in a C structure and its pointer is passed to this serial function. Similarly all the return value should be specified in a return structure.

6.4.4.3 Concurrent Function: This function is used to specify the number of independent tasks that should be executed in concurrent fashion on available cores. These may be same tasks with different data set or different tasks. When the Tasks are executed defined in this function, a set of threads equal or greater to the number of tasks defined in concurrent function is created such that each task is associated with a thread or threads. These threads are scheduled on the available cores either by operating system or specified by the programmer. In other words, this function is an extension and fusion of serial and parallel functions. All the independent tasks defined in concurrent functions are executed in parallel where as each thread is being executed either serially or in parallel. This function also has three variants.

Concurrent (Task i, Task j, ....Task N) The specified tasks are executed in concurrent fashion and they are scheduled on the available cores by the operating system. If the numbers of tasks are less than the number of available cores, then multiple threads are created for each task in order to fully utilize the available cores. The following figure-6.8 shows the concurrent execution of two tasks Task i and Task j on four cores. Each task has two spawned threads scheduled on a core by the operating system.

104

SPC3 Concurrent Tasks , All Cores Consumed Concurrent (Task i , Task J )

Task i Task j

i

SPC SPC

SPC

SPC

3 3

3

3

PM PM T

PM TASK

PM TASK

TASK

ASK j

i

i

j

Core 0 Core 1 Core 2 Core 3

Figure-6.8: Concurrent execution of two tasks Task i and Task j on four cores. Each task has two spawned threads scheduled on a core by the operating system.

Concurrent (Task i, core , Task j , core, ……) This variant of concurrent function helps to specify a particular core for a task. Thus a desired core may be assigned to a particular Task. Any given task may be bound to a particular core or set of cores for its execution. The scheduling of a thread to a particular instead of leaving it to operating system show grater performance and system through and it is discussed in the subsequent section. The following figure-6.9 represents the concurrent execution of two tasks Task i and Task j on specified cores. Two tasks are executed using concurrent function of SPC3 PM library.

Each task is scheduled on respective cores assigned to it by the programmer.

SPC3 Concurrent Tasks , Concurrent (Task i , 1, Task j, 3 )

Task i Task j

Assigned by OS Idle or Other task

Assigned by OS Idle or

SPC3 PM TASK PM SPC3 i TASK PM SPC3 j

Other task

Core 1 Core 2 Core 0 Core 3

Figure-6.9: Concurrent execution of two tasks Task i and Task j on four cores. Each task is scheduled on the respective core as assigned

105

Concurrent (Task i, core , *p, Task j , core, *p ……) This fourth variant of concurrent function has capability to take and return the parameters for each task. This function intakes a structure pointer. All the parameters which are to be passed to a task are defined in a C structure and its pointer is passed to this concurrent function. Similarly all return values should be specified in a return structure.

6.5 Performance Evolution

This section discusses the performance evolution of the SPC3 PM and its performance comparison with other related programming environments like C/C++ and OpenMP. As discussed in section 6.3.7, SPC3 PM may be applied to nearly all types of problems. However, matrix multiplication is taken as the targeted algorithm because of its extensive computation and memory requirement, standard test set and broad use in all types of scientific and desktop applications. It is as building block in many of applications covering nearly all subject areas. Graph theory uses matrices to keep track of distances between pairs of vertices in a graph. Computer graphics uses matrices to project 3-dimensional space onto a 2-dimensional screen. Matrix calculus generalizes classical analytical concept such as derivatives of functions or exponentials to matrices etc. [220, 227, 219].

Serial and parallel matrix multiplication has always been a challenging task for the programmers because of its extensive computation and memory requirement, standard test set and broad use in all types of scientific and desktop applications. With the advent of multi-core processors, parallel matrix multiplication has become more challenging. Now all the processors have built-in parallel computational capacity in form of cores and existing serial and parallel matrix multiplication techniques have to be revisited to fully utilize the available cores and to get the maximum efficiency and the minimum executing time [193, 219, 224, 225].

For performance evolution of SPC3 PM, a standard and fundamental matrix multiplication algorithm is selected. The algorithm is decomposed into task using SPC3PM task decomposition rules and then coded using SPC3 PM Library. To begin with this program is executed using Serial function of

SPC3 PM Library and its performance is compared with a standard C / C++ code for the same algorithm

[210]. The same program for matrix multiplication is then written using SPC3 PM and executed using

106

Parallel function of SPC3 PM Library. The results obtained are compared with the standard code written using OpenMP for the same matrix multiplication algorithm [210]. Finally, the SPC3 PM Matrix multiplication program is run using Concurrent function of SPC3 PM Library. The results obtained are again compared with those obtained from OpenMP code. It is because there is no other standard programming model that supports concurrent execution of programs on multi-core processors.

For the execution of the programs, the latest Intel server 1500ALU with dual six core hyper threaded Intel Xeon 5670 processor is used. This system can execute a maximum of 24 (2*2*6=24) parallel threads. Operating systems used are windows server 2003 and 2008.

6.5.1 Matrix Multiplication Algorithms

Matrices offer a concise way of representing linear transformations between vector spaces, and matrix multiplication corresponds to the composition of linear transformations. The matrix product of two matrices can be defined when their entries belong to the same ring, and hence can be added and multiplied, and, additionally, the number of the columns of the first matrix matches the number of the rows of the second matrix. The product of an m×p matrix A with an p×n matrix B is an m×n matrix denoted C such that

Where 1 ≤ i ≤ m is the row index and 1 ≤ j ≤ n is the column index. The running time of square matrix multiplication is O (n3). The running time for multiplying rectangular matrices (one m×p-matrix with one p×n-matrix) is O (mnp).

The more efficient algorithms do exist. Strassen's algorithm [213], referred to as "fast matrix multiplication", with a multiplicative cost of O (n2.807). Strassen's algorithm is awkward to implement, compared to the naive algorithm, and it lacks numerical stability. Nevertheless, it is beginning to appear in libraries such as BLAS [211, 212]. The other algorithm with the lowest known exponent k is the

Coppersmith–Winograd algorithm [214] with an asymptotic complexity of O (n2.376). However, the

107 constant coefficient hidden by the Big O Notation is so large that the Coppersmith–Winograd algorithm is only worthwhile for matrices that are too large to handle on present-day computers [211].

Besides the serial matrix multiplication algorithms, many of parallel matrix multiplication algorithms and implementations for SMPs and distributed systems have been proposed. Like Systolic algorithm [221], Cannon‟s algorithm [222], Fox‟s algorithm with square decomposition, Fox‟s algorithm with scattered decomposition [222], SUMMA [223], DIMMA [226], 3-D matrix multiplication [228] etc.

Majority of the parallel implementations of matrix multiplication for SMPs are based on functional parallelism. The existing algorithms for SMPs are not so efficient for multi-core and have to be re-written using some multi-core supported language [191, 193]. These algorithms are also difficult for common programmer to understand as they require detailed related subject knowledge. On the other hand distributed algorithms which are usually based on data parallelism also cannot be applied on the shared memory multi-core processors because of the architectural change. Some attempts have also been made to solve matrix multiplication using data parallel or concurrent approaches on cell or GPUs [230, 231,

232, 233, 224, 225]. But the associated problem with these approaches is architectural dependence and cannot be used for general purpose multi-core processors.

6.5.2 Serial Function

To test the performance of a Serial function of our programming model the standard serial matrix algorithm is implemented using C++ and two variants of SPC3 PM Serial function that is „auto core assignment‟ and „specified core assignment‟. For SPC3 PM using serial function, the algorithm is decomposed into task using SPC3PM task decomposition rules and then coded using Serial function of

SPC3PM library. The program is then executed using two variant of Serial function. For C / C++ version of the program, the JavaMath benchmark [236] is selected. The reason for the selection of JAVAMath benchmark is the use of standard matrix multiplication algorithm in it.

Some other implementations for matrix multiplication like BLAS [237], ATLAS [238].

LAPACK [239], Intel MKL [240] and .NET Matrix library [241] have also been considered. After analyzing it has been found that these approaches use their specified algorithms for matrix

108 multiplications other than the standard matrix multiplication algorithm and cannot be used for comparison in our case. The following table 6.4 shows these three different implementations of serial matrix algorithm.

Table-6.5 shows the execution time for each of the three approaches, C++, SPC3 PM Serial function with „auto core assignment‟, and SPC3 PM with „specified core assignment‟ for different sizes of matrices. Figure-6.10 shows comparison graph of all these three programming approaches.

Table-6.4: Three different algorithms for serial matrix algorithm.

Matrix Multiplication Algorithm SPC3PM, Serial SPC3 PM, Serial Serial C/C++ with Specified Core Auto-core Assignment Assignment

procedure Matrix_Mul Task(LPVOID) Task(LPVOID) (a[][],b[][],c[][]) { { PMYDATA data; PMYDATA data; { data=(PMYDATA)lp; data=(PMYDATA)lp;

int i,j,k; for(i=data->val3; i< for(i=data->val3; i< data->val1; i++) data->val1; i++) for (i=0; ival2; for(j=0; j< data->val2; for (j=0; jval2 for(k=0;k< data- c[i][j]=c[i][j] ;k++) >val2;k++) + c[i][j]=c[i][j]+ c[i][j]=c[i][j]+ a[i][k]*b[k][j] a[i][k]*b[k][j]; a[i][k]*b[k][j]; ; } } } } }

} void main (void) void main (void) void main void { { typedef struct MyData typedef struct MyData { { { int val1,val2,val3; int val1,val2,val3; //initialize int A[][],B[][],C[][] int A[][],B[][],C[][] A[][],B[][],C[][]; } MYDATA, *PMYDATA; } MYDATA, *PMYDATA;

Matrix_mul(A[][],B[][ //initialize PMYDATA; //initialize PMYDATA; ]

c[][] ) serial(Task,PMYDATA,core serial(Task,PMYDATA); ); } } }

109

Table-6.5: Execution time in Seconds for each of three approaches, C++, SPC3 PM Serial Function with ‘auto core assignment’ and SPC3 PM with ‘specified core assignment’ for different sizes of matrices. Execution Time (Sec) SPC3PM, Serial Matrix Size SPC3 PM, Serial C/C++ with specified core Auto core Assignment Assignment 100 X 100 1 1 1 1000 X 1000 13 12 12 2000 X 2000 154 147 145 3000 X 3000 551 527 522 4000 X 4000 1374 1308 1286 5000 X 5000 2707 2660 2565 6000 X 6000 4642 4570 4457 7000 X 7000 7285 7097 7022 8000 X 8000 10925 10679 10560 9000 X 9000 15606 15352 15032 10000 X 10000 21497 20895 20609

From the table-6.5 and figure-6.10, it can be clearly observed that proposed Serial function of

SPC3 PM takes lesser execution time than the standard C++ program. It can be further observed in compare to SPC3 PM Serial function with „auto core assignment‟, the SPC3 PM Serial function with

„specified core assignment‟ takes further lesser execution time.

Figure-6.10: Comparison of execution time for each of three approaches, C++, SPC3 PM serial with auto core assignment, and SPC3 PM with specified core assignment for different sizes of matrices

110

Based on the readings in table 6.5, the following table 6.6 shows the respective speedups of two variants of SPC3 PM with respect to standard C++ algorithm. Figure-6.11highlights the comparison in speedups of all these three programming approaches. From this figure, it can be observed again that

Serial function of proposed SPC3 PM has greater speedup than the standard C++ program. It can be further observed in compare to SPC3 PM Serial function with „auto core assignment‟, the SPC3 PM Serial function with „specified core assignment‟ has more speedup.

Table-6.6: Speedup of SPC3 PM Serial function with ‘auto core assignment’ and with ‘specified core assignment for different sizes of matrices with C++. SPC3 Serial, Auto SPC3 Serial Specified Matrix Size C++ core assignment core assignment 100 X 100 1 1.00 1.00 1000 X 1000 1 1.08 1.08 2000 X 2000 1 1.05 1.06 3000 X 3000 1 1.05 1.06 4000 X 4000 1 1.05 1.07 5000 X 5000 1 1.02 1.06 6000 X 6000 1 1.02 1.04 7000 X 7000 1 1.03 1.04 8000 X 8000 1 1.02 1.03 9000 X 9000 1 1.02 1.04 10000 X 10000 1 1.03 1.04

Figure-6.11: Speedup comparison of three serial approaches, C++, SPC3 PM serial function with ‘auto core assignment’ and with ‘specified core’ assignment.

111

Both the variant of Serial Function of the proposed SPC3 PM show lesser execution time due to the task encapsulation of a task into a thread. In normal C/ C++ program the program is executed as a process where as in SPC3 PM each task is executed as a thread. A thread always takes lesser communication and scheduling time than a process. As a result serial function of our proposed programming model takes much lesser execution time than that of standard C/C++ code.

It is also observed that in compare to SPC3 PM Serial function with „auto core assignment‟, the

SPC3 PM Serial function with „specified core assignment‟ takes further lesser execution time. It is due to the forced assignment of a task (thread) on a core. It is observed that operating systems continuously shuffles a task in execution between all available cores. After a certain interval of time, it forces the task in execution to schedule on different core. This switching of task between the cores consumes the CPU time and results in increase in execution time. Serial function of SPC3 PM with „specified core assignment‟ forces a task to continue to execute completely on a specified core. This force assignment of a task (thread) on a core avoids the unnecessary switching of task and hence the execution time decrease.

Both variants of Serial function of SPC3 PM show the greater speedup than the C/C++ approach.

These speedups are uniform and scalable. The speedup for Serial function with „auto core assignment‟ ranges from 1.4 to 1.6 and remains stable as the problem size increases. The same observation can be made for the second variant of Serial function with „specified core assignment‟. This variant has much improved speedup than that of the first variant ranging between 1.6 and 1.8, and remains stable as the problem size increases.

6.5.3 Parallel Function

The same standard serial matrix algorithm is selected for parallelization. The parallel matrix multiplication algorithm is implemented with standard shared programming model OpenMP to compare the results with the parallel function of our proposed SPC3 PM library. The standard OpenMP parallel matrix multiplication program written for the same serial standard algorithm is taken from the book

'Using OpenMP, Portable Shared Memory Parallel Programming‟ by Barbara Chapman, MIT press. The

112 following table-6.7 shows the algorithms for parallel matrix multiplication using OpenMP, and Parallel function of our SPC3PM.

Table-6.7: Parallel matrix algorithm for OpenMP and SPC3 PM (Parallel)

Matrix Multiplication Algorithm Matrix Multiplication Algorithm OpenMP SPC3 PM (Parallel)

void main (void) { Task(LPVOID) { // inintilizillig the matrics PMYDATA data; int A[][],B[][],C[][] data=(PMYDATA)lp;

int core ; // number of parallel for(i=data->val3; ival1; i++) threads for(j=0; j< data->val2; j++) { omp_set_num_threads(core); for(k=0;k< data->val2 ;k++) c[i][j]=c[i][j]+ a[i][k]*b[k][j]; // initializing the parallel loop } #pragma omp parallel for } private(i,j,k) void main (void) for (i=0; i

In OpenMP implementation the basic computations of addition and multiplication are placed within the three nested „for‟ loops. The outer most is parallelized using OpenMP keyword

„pragma omp parallel for‟. The metrics are solved using row level distribution. The matrix is divided into set of rows, equal number of parallel threads defined by the variable „core ‟such that each row set is computed on a single core. For parallel execution of SPC3PM, we can use the same program structure as used for serial execution of SPC3PM. In this, all the basic computations, that is, three nested loops are placed within a task. The only change in the program that has to make for its parallel execution is the execution of the Task using Parallel function of SPC3PM library instead of its Serial function. This parallel function enforces the loop parallelism. This parallel function spawns a pool of worker threads

113 and data is divided among these threads. The number of worker threads in a pool is by default equal to the number of available cores unless specified. The SPC3PM algorithm controls the number of threads using the parameter 'no-threads' of parallel function. These threads are then scheduled on the cores for their parallel execution.

Tables 6.8 to 6.11 show the execution time for each of the two approaches i-e., OpenMP and

Parallel function of SPC3 PM for different sizes of matrices and different number of parallel threads.

Figures 6.12 to 6.15 show the comparison in execution time for these two approaches performing parallel matrix multiplication with 4,8,12 and 24 parallel threads.

Table-6.8: Execution Time (Sec) for parallel matrix multiplication using OpenMP and Parallel function of SPC3 PM for 4 Parallel threads

Execution Time (Sec) Number of Parallel Matrix Size Threads SPC3 PM, Parallel OpenMP (Parallel) 100 X 100 1 1 1000 X 1000 3 3 2000 X 2000 35 36 3000 X 3000 163 162 4000 X 4000 407 404 5000 X 5000 4 735 738 6000 X 6000 1250 1244 7000 X 7000 2076 2078 8000 X 8000 3095 3093 9000 X 9000 4565 4558 10000 X 10000 5415 5425

Table-6.9: Execution Time (Sec) for parallel matrix multiplication using OpenMP and Parallel function of SPC3 PM for 8 Parallel threads

Execution Time (Sec) Number of Parallel Matrix Size Threads SPC3 PM, Parallel OpenMP (Parallel) 100 X 100 1 1 1000 X 1000 2 2 2000 X 2000 24 25 3000 X 3000 87 83 4000 X 4000 215 212 5000 X 5000 8 430 433 6000 X 6000 709 703 7000 X 7000 1091 1099 8000 X 8000 1740 1742 9000 X 9000 2499 2503 10000 X 10000 3265 3276

114

Table-6.10: Execution Time (Sec) for parallel matrix multiplication using OpenMP and Parallel function of SPC3 PM for 12 Parallel threads

Execution Time (Sec) Number of Parallel Matrix Size Threads SPC3 PM, Parallel OpenMP (Parallel) 100 X 100 1 1 1000 X 1000 1 1 2000 X 2000 17 18 3000 X 3000 64 65 4000 X 4000 170 164 5000 X 5000 12 338 330 6000 X 6000 582 573 7000 X 7000 839 842 8000 X 8000 1285 1291 9000 X 9000 1809 1799 10000 X 10000 2670 2664

Table-6.11: Execution Time (Sec) for parallel matrix multiplication using OpenMP and Parallel function of SPC3 PM for 24 Parallel threads

Execution Time (Sec) Number of Parallel Matrix Size Threads SPC3 PM, Parallel OpenMP (Parallel) 100 X 100 1 1 1000 X 1000 1 1 2000 X 2000 10 10 3000 X 3000 39 36 4000 X 4000 84 86 5000 X 5000 24 170 171 6000 X 6000 313 303 7000 X 7000 480 476 8000 X 8000 716 710 9000 X 9000 1022 1011 10000 X 10000 1439 1431

115

Figures 6.12 to 6.15: Comparison of execution times (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Parallel for different number of cores

Tables 6.8 to 6.11 clearly show that the Parallel function of proposed SPC3 PM and standard

OpenMP take the same execution time for the variety of matrix sizes and number of parallel threads. The

116 execution time increases with the increase in matrix size and decreases with the increase in number of threads in both models which is quite natural. These results also indicate that the proposed SPC3 PM has the same performance for the data parallel execution as that of the OpenMP. The SPC3 PM does not require any extra overhead time. This is because both the models use worker thread pools to compute the given loop. Data is distributed among worker threads equally.

Using tables 6.8 to 6.11 and table 6.5, the speedup of the parallel function of SPC3 PM is compared with standard C++ algorithm for different number of parallel threads and matrix sizes and presented in table 6.12. The speedup obtained is linear and scalable and demands on the number of parallel threads not on the problem size. For any given number of parallel threads, the speedup remains linear as the problem size increases. This comparison is also presented in figure 6.16.

Table-6.12: Speedup for Matrix multiplication using SPC3 Parallel function with different number of parallel threads and different matrix sizes.

SPC3 Parallel SPC3 Parallel SPC3 Parallel SPC3 Parallel Matrix Size N=4 N=8 N=12 N=24

100 X 100 1.00 1.00 1.00 1.00

1000 X 1000 4.30 6.50 6.50 13.00

2000 X 2000 4.28 6.16 8.56 15.40

3000 X 3000 3.40 6.64 8.48 15.31

4000 X 4000 3.40 6.48 8.38 15.98

5000 X 5000 3.67 6.25 8.20 15.83

6000 X 6000 3.73 6.60 8.10 15.32

7000 X 7000 3.51 6.63 8.65 15.30

8000 X 8000 3.53 6.27 8.46 15.39

9000 X 9000 3.42 6.23 8.67 15.44

10000 X 10000 3.96 6.56 8.07 15.02

117

18

16

14

12 C++ 10 SPC3 Parallel N=4 8 SPC3 Parallel N=8 6 SPC3 Parallel N=12 SPC3 Parallel N=24 4

2

0 100 X 100 1000 X 2000 X 3000 X 4000 X 5000 X 6000 X 7000 X 8000 X 9000 X 10000 X 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Figure-6.16: Speedup comparison of matrix multiplication using SPC3 Parallel with different number of parallel threads and different matrix sizes.

6.5.4 Concurrent Function

To analyze the performance of the third function of SPC3 PM library that is the Concurrent function, same standard matrix multiplication algorithm used for serial and parallel implementations is used. For this the same program structure used for Serial version of SPC3PM is also used here. All the basic computations that are three nested loops are placed within a task. The only change in the program that has to make for its concurrent execution is the execution of the task using Concurrent function of

SPC3PM library instead of Serial function. This concurrent function enforces the functional parallelism.

The idea is to execute this task concurrently on different cores with different data set. Every Task has its own private data variables defined in a structure „My_Data‟. All the private structures are associated with their tasks and initialized accordingly. Using the Concurrent function of SPC3 PM, the required number of concurrent tasks are initialized and executed. The Concurrent function spawned the number of independent threads as defined in the concurrent function. These threads are executed concurrently on the available cores. It performance has been compared with the standard OpenMP implementation of parallel matrix multiplication. it may be noted that since no other model similar to 'SPC3PM Concurrent' is currently available, comparison of performance is made with that of OpenMP code same as used in 6.5.3.

The following table 6.13 shows the algorithms for parallel matrix multiplication using OpenMP and concurrent function of SPC3PM.

118

Table-6.13: Parallel matrix algorithm for OpenMP and concurrent function of SPC3 PM.

Matrix Multiplication Algorithm Matrix Multiplication Algorithm OpenMP (Parallel) SPC3 PM, Concurrent

void main (void) Task(LPVOID) { { PMYDATA data; // inintilizillig the matrics data=(PMYDATA)lp; int A[][],B[][],C[][] for(i=data->val3; ival1; i++) int core ; // number of parallel for(j=0; j< data->val2; j++) threads {

for(k=0;k< data->val2 ;k++) omp_set_num_threads(core); c[i][j]=c[i][j]+ a[i][k]*b[k][j];

} // initializing the parallel loop } #pragma omp parallel for private(i,j,k) void main (void) for (i=0; i

119

Tables 6.14 to 6.17 show the execution time for each of the two approaches, OpenMP and

Concurrent function of SPC3 PM for different sizes of matrices and different number of parallel threads.

Based on tables 6.14 to 6.17, table 6.18 shows the speedup obtained for the SPC3 PM concurrent function for different matrix sizes and different number of concurrent threads. Figure 6.17 represents the comparison of speedup in the form of graph.

Table-6.14: Execution Time (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent Function for 4 concurrent threads. Number of Execution Time (Sec) Matrix Size Concurrent SPC3 PM, OpenMP (Parallel) Threads Concurrent 100 X 100 1 1 1000 X 1000 3 3 2000 X 2000 36 23 3000 X 3000 162 85 4000 X 4000 404 202 5000 X 5000 4 738 396 6000 X 6000 1244 682 7000 X 7000 2078 1086 8000 X 8000 3093 1619 9000 X 9000 4558 2303 10000 X 10000 5425 3161

Table-6.15: Execution Time (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent Function for 8 concurrent threads Number of Execution Time (Sec) Matrix Size Concurrent SPC3 PM, OpenMP (Parallel) Threads Concurrent 100 X 100 1 1 1000 X 1000 2 1 2000 X 2000 25 12 3000 X 3000 83 44 4000 X 4000 212 104

5000 X 5000 433 204 8 6000 X 6000 703 351 7000 X 7000 1099 559 8000 X 8000 1742 833 9000 X 9000 2503 1186 10000 X 10000 3276 1626

120

Table-6.16: Execution Time (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent Function for 12 concurrent threads Number of Execution Time (Sec) Matrix Size Concurrent SPC3 PM, OpenMP (Parallel) Threads Concurrent 100 X 100 1 1 1000 X 1000 1 1 2000 X 2000 18 8 3000 X 3000 65 30 4000 X 4000 164 72 5000 X 5000 12 330 141 6000 X 6000 573 242 7000 X 7000 842 384 8000 X 8000 1291 575 9000 X 9000 1799 816 10000 X 10000 2664 1126 Table 6.17: Execution Time (Sec) for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent Function for 24 concurrent threads. Number of Execution Time (Sec) Matrix Size Concurrent SPC3 PM, OpenMP (Parallel) Threads Concurrent 100 X 100 1 1 1000 X 1000 1 1 2000 X 2000 10 7 3000 X 3000 36 26 4000 X 4000 86 58 5000 X 5000 24 171 113 6000 X 6000 303 197 7000 X 7000 476 314 8000 X 8000 710 467 9000 X 9000 1011 661 10000 X 10000 1431 905

Table-6.18: Comparison of speedup obtained for the SPC3 PM concurrent function with different number of concurrent threads and matrix sizes. SPC3 PM SPC3 PM SPC3 PM SPC3 PM Matrix Size Concurrent Concurrent Concurrent Concurrent N=4 N=8 N=12 N=24 100 X 100 1.00 1.00 1.00 1.00 1000 X 1000 4.33 13.00 13.00 13.00 2000 X 2000 6.70 12.83 19.25 22.00 3000 X 3000 6.48 12.52 18.37 21.19 4000 X 4000 6.80 13.21 19.08 23.69 5000 X 5000 6.84 13.27 19.20 23.96 6000 X 6000 6.81 13.23 19.18 23.56 7000 X 7000 6.71 13.03 18.97 23.20 8000 X 8000 6.75 13.12 19.00 23.39 9000 X 9000 6.78 13.16 19.13 23.61 10000 X 10000 6.80 13.22 19.09 23.75

121

Figure-6.17: Shows the comparison of speedup based on table 12 for SPC3 PM concurrent function with 4, 8, 12 and 24 concurrent threads.

From tables 6.14 to 6.17 it can be clearly seen that the SPC3 PM Concurrent function shows much reduced execution time than the standard OpenMP implementation. It is so, because in OpenMP implementation, it is compiler‟s responsibly to distribute the data evenly on the number of parallel threads. This process of data distribution consumes time. Also, the algorithm has 3 level nested complex loops causing more difficulties and dependence in the shared data. Whereas in Concurrent function of

SPC3 PM, functional parallelism is exploited. In this approach the data is distributed among the concurrent tasks on coding time. Same task is executed concurrently with different data set. This approach saves the data distribution and threads communication time and reduces the complexity and data dependency between the threads.

Table 6.18 and figure 6.17 show the speedup based on the readings in table 6.9 and tables 6.14 to

6.17 for the concurrent function of SPC3 PM, The speedup obtained is linear and scalable and depends on the number of parallel threads not on the problem size. For any given number of parallel threads, the speedup remains linear as the problem size increases.

Also, Figure 6.18 to 6.21 highlight the comparison in speedups for matrix multiplication using

SPC3 PM Concurrent and OpenMP for 4, 8, 12 and 24 parallel threads with different matrix sizes. From these figures it can be clearly seen that the speedups obtained using Concurrent function of SPC3 PM are much higher than that of OpenMP for any given number of parallel threads and for any problem size. It

122 can be also be observed that Concurrent function of SPC3 PM provides super linear speedup. Ideally for any number of parallel threads, the speedup should be equal to that number of parallel threads. Here, for

4 parallel threads, the speedup obtained from SPC3 PM concurrent function is between 6 and 7. This speedup remains constant as the problem size increases. Similarly for 8, 12 and 24 parallel threads the achieved speedups are about 13, 19 and 23 respectively independent of problem size.

Figures 6.18: Comparison of speedups for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent (For 4 cores)

Figures 6.19: Comparison of speedups for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent (For 8 cores)

123

Figures 6.20: Comparison of speedups for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent (For 12 cores)

Figures 6.21: Comparison of speedups for parallel matrix multiplication using OpenMP and SPC3 PM Concurrent (For 24 cores)

6.6 Summary

In this chapter, a new parallel multi-threaded programming environment, SPC3 PM (Serial

Parallel and concurrent Core to Core Programming Model) developed for a multi-core processors is discussed. The development of SPC3 PM was motivated to equip a common programmer with multi-core programming environment and tools for scientific and general purpose computing. The SPC3 PM provides a set of rules for algorithm decomposition and a library of primitives that exploit thread oriented

124 task level parallelism and concurrency on multi-core processors. It has many unique features that distinguish it with all other existing parallel programming models like support of both data and functional parallel programming, capability of executing a program or part of a program in serial, parallel and concurrent fashion and processor core interaction feature that enable a programmer to assign any task or a number of tasks to any of the cores or set of cores.

125

CHAPTER 7

Solving Travelling Salesman Problem Using SPC3 PM

This chapter covers the performance analysis of the SPC3 PM based on the Travelling Salesman

Problem (TSP). To study the performance of SPC3 PM on a large and multi-dimensional irregular problem, the Travelling Salesman Problem is selected. A few reasons for its selection are its wide application spectrum, its representativeness, generalized structure, rigidity to solve and difficult to parallelize. This problem has also been graded as a classical problem of combinatorial optimization. Out of many solutions for TSP, the present Lin-Kernighan heuristic, which uses local search optimization meta-heuristics, is chosen for its parallelization using the SPC3 PM. This LK-heuristic, generally considered being one of the most effective methods for generating optimal or near-optimal solutions for the symmetric traveling salesman problem. The standard serial code for LK- heuristics, LKH-2 software is parallelized using SPC3 PM Task decomposition rules and SPC3 PM Library to make this serial code suitable for multi-core processors. In this chapter, introduction of TSP, its different solution techniques and the selected algorithm, Lin-Kernighan heuristic are discussed. Later, the methodology involved in the parallelization of the code using SPC3 PM is presented. Finally, the performance of the parallelized code is analyzed on many standard TSP instances available in TSPLIB.

7.1 Travelling Salesman Problem (TSP)

The travelling salesman problem (TSP) is one of the most representative irregular problems in combinatorial optimization. This problem has also been graded as a classical problem of combinatorial optimization. Despite its simple formulation, TSP is hard to solve. The difficulty becomes apparent when one considers the number of possible tours. For a symmetric problem with n cities there are (n-1)!/2 possible tours. If n is 20, there are more than 1018 tours. For 7397-city problem in TSPLIB, there will be

126 more than 1025,000 possible tours. In comparison it may be noted that the number of elementary particles in the universe has been estimated to be ‘only’ 1087[246].

In TSP, we consider a salesman who has to visit n cities, the TSP asks for the shortest tour through all the cities such that no city is visited twice and the salesman returns at the end of tour back to the staring city. The shortest tour automatically implies the minimum cost and the minimum time.

Mathematically, let be a graph, where V is a set of n nodes and E is set of arcs. Let [ ] be a cost matrix associated with E, where represents the cost of going from city i to city j. The problem is to find a permutation ) of the integers from 1 through n that minimizes the

quantity .

Using integer programming formulation, the TSP can be defined as

∑ ∑

Such that ∑ And ∑

∑i ∑j xi j | |-1 And { }

Where if arc (i,j) is in the solution and 0 otherwise.

Properties of the cost matrix C are used to classify problems.

 If cij = cji for all i and j, the problem is said to be symmetric; otherwise, it is asymmetric.

 If the triangle inequality holds {cik (cij + cjk) for all i, j and k}, the problem is said to be metric.

 If cij are Euclidean distances between points in the plane, the problem is said to be Euclidean. A

Euclidean problem can be both symmetric and metric.

127

7.1.1 TSP Applications

TSP has diversified application areas because of its generalized nature. TSP is being used to solve many major problems of nearly all engineering disciplines, medicine and computational sciences

[245, 247, 250]. Major TSP based applications include:

 Machine Scheduling

 VLSI Floorplan Optimization

 Drilling of printed circuits boards

 Overhauling of gas turbine engines

 Analysis of the structure of crystals

 Seismic Vessel Problem (SVP)

 Stacker Crane Problem (SCP)

 Cutting Stock Problems

 Gene Mapping

 DNA Analysis

 Protein Function Prediction

 Material handling in a warehouse

 Assignment of routes for planes of a specified fleet

 Vehicle Routing Problem (VRP)

 Astronomical theories

 Network Routing Algorithms and many more.

7.1.2 TSP Solutions

TSP is one of the most famous hard, irregular and combinatorial optimization problems. It has been proven that TSP is a member of the set of NP-complete problems. This is a class of difficult problems whose time complexity is probably exponential. In order to find the optimal solution for any

TSP based problem a number of solutions have been proposed, which can be classified into three classes as Exact, Heuristic and Meta-heuristic Algorithms.

128

7.1.2.1 Exact Algorithms: These algorithms are used when we want to obtain an exact optimal solution.

In this, every possible solution is identified and compared for optimal solution. These algorithms are suitable for a smaller number of inputs. Brute-force method, Dynamic programming algorithm of Hell and Karp, Branch-and-Bound and Branch-and-Cut algorithm are some of the famous algorithms of this class [245, 246].

7.1.2.2 TSP Heuristics: These heuristics are used when the problem size is large enough, time is limited or the data of the instance is not exact. In this class, instead of finding all possible solutionS of a given problem, a sub optimal solution is identified. TSP heuristic can be roughly partitioned into two classes:

‘Constructive heuristic’ and ‘Improvement heuristic’. Constructive heuristics build a tour from scratch and stop when one solution is produced. Improvement heuristics start from a tour normally obtained using a construction heuristic and iteratively improve it by changing some parts of it at each iteration.

Improvement heuristics are typically much faster than the exact algorithm and often produce solutions very close to the optimal one. Greedy Algorithms, Nearest Neighbor, Vertex Insertion, Random

Insertion, Cheapest Insertion, Saving Heuristics, Christofides Heuristics, Krap-Steele Heuristics, and ejection-chain method are the well known heuristics algorithm of this class [245,246, 247].

7.1.2.3 Meta-Heuristics: These are intelligent heuristics algorithms having the ability to find their way out of local optima. The Meta-heuristic approaches are the combination of first two classes. These Meta- heuristics contain implicit intelligent algorithms, ability to find their way out of local optima and possibility of numerous variants and hybrids. These heuristics are relatively more challenging to parallelize. Due to these reasons meta-heuristic approaches have drawn attention of many researchers.

Some of the well-known established meta-heuristics are Random optimization, Local search optimization, Greedy algorithm and hill-climbing, Best-first search, Genetic algorithms, Simulated annealing, Tabu search, Ant colony optimization, Particle swarm optimization, Gravitational search algorithm, Stochastic diffusion search, Harmony search, Variable neighborhood search, Glowworm swarm optimization (GSO) and Artificial Bee colony algorithm. However because of the nature of problem, all these meta-heuristics cannot be used for solving TSP. Specific meta-heuristics for solving

129

TSP include Simulated Annealing, Genetic Algorithms, Neural Networks, Tabu Search, Ant colony optimization, and Local search optimization [245,246, 247].

7.1.2.4 Hyper-Heuristics: This is an emerging direction in modern search technology. It is termed as

Hyper-heuristic as it aims to raise the level of granularity at which optimization system can operate. They are broadly concerned with intelligently choosing the right heuristic or algorithm in given situation. A hyper-heuristic works at a higher level when compared with the typical application of meta-heuristics to optimize problems, i.e., a hyper-heuristics could be a heuristic or meta-heuristic which operates on other low level heuristics or meta-heuristics [245].

7.2 Lin-Kernighan Heuristic (LKH)

The Lin-Kernighan heuristic is an implementation of local search optimization meta-heuristic

[248, 249, 250, 251]. This heuristic is generally considered to be one of the most effective methods for generating optimal or near-optimal solutions for the symmetric traveling salesman problem.

Computational experiments have shown that LKH is highly effective. Even though the algorithm is approximate, optimal solutions are produced with an impressively high frequency. LKH has produced optimal solutions for all solved problems including an 85,900-city instance in TSPLIB. Furthermore, this algorithm has improved the best known solutions for a series of large-scale instances with unknown optima, like ‘World T P’ of 1,904,711-city instance. After the original algorithm (LK), its two successive variants LKH-1 and LKH-2 have also been proposed with further improvements in the original algorithm [246, 248, 250, 251, 254].

7.2.1 Basic Lin–Kernighan Heuristic Algorithm (LKH)

The Lin–Kernighan algorithm belongs to the class of so-called local search algorithms [252, 253]. A local search algorithm starts at some location in the search space and subsequently moves from the present location to a neighboring location.

The algorithm is specified in exchanges (or moves) that can convert one candidate solution into another. Given a feasible TSP tour, the algorithm repeatedly performs exchanges that reduce the length of

130 the current tour, until a tour is reached for which no exchange yields an improvement. This process may be repeated many times from initial tours generated in some randomized way. The details of the algorithm can be found in [250].

The Lin–Kernighan algorithm (LK) performs so-called k-opt moves on tours. A k-opt move changes a tour by replacing k edges from the tour by k edges in such a way that a shorter tour is achieved.

Let T be the current tour. At each iteration step the algorithm attempts to find two sets of edges, X = {x1,

. . . , xk } and Y = {y1, . . . , yk }, such that, if the edges of X are deleted from T and replaced by the edges of Y , the result is a better tour. The edges of X are called out-edges. The edges of Y are called in-edges.

The two sets X and Y are constructed element by element. Initially X and Y are empty. In step i a pair of edges, xi and yi, are added to X and Y, respectively. Figure 7.1illustrates a 3-opt move.

Figure-7.1: A 3-opt move. x1, x2, x3 are replaced by y1, y2, y3 [ Ref. 250]

In order to achieve a sufficiently efficient algorithm, only edges that fulfill the following four criteria may enter in X or Y edges.

a) The sequential exchange criterion: xi and yi must share an endpoint, and so must yi and xi+1. If t1 denotes one of the two endpoints of x1, we have in general: xi = (t2i−1, t2i ), yi = (t2i , t2i+1) and xi+1 = (t2i+1, t2i+2) for i ≥ 1 as shown in Figure 7.2. The sequence (x1, y1, x2, y2, x3, . . . , xk , yk ) constitutes a chain of adjoining edges. A necessary (but not sufficient) condition that the exchange of edges X with edges Y results in a tour is that the chain is closed, i.e., yk = (t2k , t1). Such an exchange is called sequential. For such an exchange the chain of edges forms a cycle along which edges from X and Y appears alternately, a so-called alternating cycle as shown in Figure 7.3.

131

Figure-7.2: Restricting the choice of xi , yi , xi+1, and Figure-7.3: Alternating cycle (x1, y1, x2, y2, x3, y3, x4, y4) yi+1 [ Ref. 250] [ Ref. 250]

Generally, an improvement of a tour may be achieved as a sequential exchange by a suitable numbering of the affected edges. However, this is not always the case. Figure 4 shows an example where a sequential exchange is not possible. Note that all 2- and 3-opt moves are sequential. The simplest non- sequential move is the 4-opt move shown in Figure 7.4, the so-called double-bridge move.

Figure-7.4: Non-sequential exchange (k = 4) [ Ref. 250]

b) The feasibility criterion: It is required that xi = (t2i−1, t2i ) is chosen so that, if t2i is joined to t1, the resulting configuration is a tour. This feasibility criterion is used for i ≥ 3 and guarantees that it is possible to close up to a tour. This criterion was included in the algorithm both to reduce running time and to simplify the coding. It restricts the set of moves to be explored to those k-opt moves that can be performed by a 2- or 3-opt move followed by a sequence of 2-opt moves. In each of the subsequent 2-opt moves the first edge to be deleted is the last added edge in the previous move (the close-up edge). Figure

7.5 shows a sequential 4-opt move performed by a 2-opt move followed by two 2-opt moves.

132

Figure-7.5: Sequential 4-opt move performed by three 2-opt moves. Close-up edges are shown by dashed lines [ Ref. 250]

c) The positive gain criterion: It is required that yi is always chosen so that the cumulative gain, Gi , from the proposed set of exchanges is positive. Suppose gi = c(xi )− c(yi ) is the gain from exchanging xi with yi . Then Gi is the sum g1 + g2 +・・+gi . This stop criterion plays a major role in the efficiency of the algorithm. d) The disjunctivity criterion: It is required that the sets X and Y are disjoint. This simplifies coding, reduces running time, and gives an effective stop criterion.

e) The candidate set criterion: The search for an edge to enter the tour, yi =(t2i , t2i+1), is limited to the five nearest neighbors to t2i .

7.2.2 Modified Lin–Kernighan Heuristic Algorithm (LKH-1)

Original LKH algorithmic was able to find solutions that are 1–2% above optimum. To improve its results further some modifications were introduced by revising the edge selection criteria and this modified LKH was named as LKH-1. This LKH-1 showed better and optimum solutions then LKH. The details of the modified algorithm can be found in [251]. The revised criteria are described briefly below a) The sequential exchange criterion: This criterion has been relaxed a little. When a tour can no longer be improved by sequential moves, attempts are made to improve the tour by non-sequential 4- and

5-opt moves.

133 b) The feasibility criterion: A sequential 5-opt move is used as the basic sub-move. For i ≥ 1 it is required that x5i = (t10i−1, t10i ), is chosen so that if t10i is joined to t1, the resulting configuration is a tour.

Thus, the moves considered by the algorithm are sequences of one or more 5-opt moves. However, the construction of a move is stopped immediately if it is discovered that a close up to a tour results in a tour improvement. Using a 5-opt move as the basic sub-move instead of 2- or 3-opt moves broadens the search and increases the algorithm’s ability to find good tours at the expense of an increase of running times. c) The disjunctivity criterion: The sets X and Y need no longer be disjoint. In order to prevent an infinite chain of sub-moves the last edge is deleted in a 5-opt move which have not been added previously in the current chain of 5-opt moves. d) The candidate set criterion: The costs of the edges, is replaced by a new measure called the α- measure. Given the cost of a minimum1- tree, the α-value of an edge is the increase of this cost when a minimum 1-tree is required to contain the edge. Using α-nearness it is often possible to restrict the search to relatively a few of the α-nearest neighbors of a node and obtain optimal tours.

7.2.3 Lin–Kernighan Heuristic Algorithm with General k-opt Sub-moves (LKH-2)

LKH-2 eliminated many of the limitations and shortcomings of LKH-1. This variant extends the previous one with new algorithms and data structures for solving very large instances, and facilities for obtaining solutions of even higher quality. The details of the LKH-2 algorithm can be found in [248]. A brief description of the main features of LKH-2 is given below.

General k-opt sub-moves: LKH-2 introduced the use of general k-opt moves for tour selection instead of 2- or 3-opt move in LKH-1. Where K is any chosen integer greater than or equal to 2 and less than the number of cities.

Partitioning: In LKH-2 the concept of portioning the large-scale problem into sub-problems was added. Each sub-problem is solved separately, and its solution is used to improve a given overall tour, T.

The set of nodes is partitioned into subsets of a given maximum size. Each subset, S, induces a sub-

134 problem consisting of all nodes of S, and with edges fixed between nodes that are connected by segments of T whose interior nodes do not belong S.

Tour merging: LKH-2 provides an improved tour merging procedure that attempts to produce the best possible tour from two or more given tours using local optimization.

Iterative partial transcription: Iterative partial transcription was added in LKH-2. It is a general procedure for improving the performance of a local search based heuristic algorithm. It attempts to improve two individual solutions by replacing certain parts of either solution by the related parts of the other solution

7.3 LKH-2 Software

LKH-2 software provides an effective serial implementation of the Lin-Kernighan heuristic

Algorithm with General k-opt Sub-moves for solving the traveling salesman problem. It is written in visual C++. Computational experiments have shown that LKH-2 software is highly effective for solving

TSP. This software has produced optimal solutions for all solved problems we have been able to obtain including a 85,900-city instance available in the TSPLIB. Furthermore, it has improved the best known solutions for a series of large-scale instances with unknown optima, among these a 1,904,711-city instance commonly known as World TSP. Similarly LKH-2- software also currently holds the record for all instances with unknown optima provided in the DIMACS TSP Challenge which provides many benchmark instances range from 1,000 to 10,000,000 cities. Its six versions form 2.0.0 to 2.0.5 have been released. For our study we have used its latest 2.0.5 version released in 2010. This software can be downloaded free from [254].

7.3.1 Execution of LKH-2 software

LKH-2 Software is written using functional programming. Its complex computation is divided into ninety eight functions which can be called from the main program accordingly. This software also takes help of thirteen header files. After analyzing the code and working of LKH-2, it has been found that

135 the working of LKH-2 can be broken into seven basic stages as discussed follow. A flow chart representing the stages of LKH-2 software on the basis of the stages are shown in section 7.3.2 .

Stage 1: Read parameter file

This is the first step in LKH-2 software. A function is calleld to open the parameter file and to read the specified problem parameters in the file. The deatails regarding parameter file is presented appandix A.

Stage 2: Read Problem file

In the next step, the specifed problem file is read. In the TSP library all the instances and their releated infromation is placed in an indivuall files using a standard format. This file is known as the problem file. The ''ReadProblem function'' in LKH-2 software reads the problem data in TSPLIB format for further processing.

Stage 3: Partitioning of the problem

After reading the problem, the large problem may be divided into number of sub-problems as defined in the parameter file using the parameter ‘sub-problem size'. If sub-problem size is zero than no partitioning of the problem is done. Else by default the sub-problems are determined by sub-dividing the tour into segments of equal size. However LKH-2 software also provides five other different techniques to partition the problem. These include Delaunay Partitioning, Karp Partitioning, K-Means Partitioning,

Rohe Partitioning and MOORE or SIERPINSKI Partitioning.

Stage 4: Initialization of data structures and statistics variable

After reading the probelm and its partitioning, if done, the releated data structures and statistics variables are initialized. The major statistical variables include minimum and maximum trials, total number and number of success trails, minimum and maximum cost, total Cost, minimum and maximum

Time, total Time.

136

Stage 5: Generation of Initial Candidate Set

The ''CreateCandidateSet'' function and its sub-functions determines a set of incident candidate edges for each node. If the penalties (the Pi-values in the paramenter file) is not defined, the ''Ascent function'' is called to determine a lower bound on the optimal tour using sub-gradient optimization. Else the penalties are read from the file, and the lower bound is computed from a minimum 1-tree. The function ''GenerateCandidates'' is called to compute the Alpha-values and a set of incident candidate edges is associated to each node

Stage 6: Find Optimal Tour

This is main processing step where the optimal tour is found. After the creation of candidate set, the ''FindTour’’ function is called for predetermined number of times (Runs). FindTour performs a number of trials, where in each trial it attempts to improve a chosen initial tour using the modified Lin-

Kernighan edge exchange heuristics. If tour found is better than the existing tour, the tour and time are recorded.

Stage 7: Update Statistics

This step is called in two levels. Firstly this step is processed after every individual call for

"FindTour’’ function to update the respective statistical variables. Finally it is processed at the end of total runs of ''FindTour’’ function to calculate and report the average statistics.

137

7.3.2 Flow Chart for LKH-2 Software Processing

Start

Read Parameter File

Read Problem File

Decompose YES Problem into Sub-problems

Partitioning of the Problem into Sub- NO problems as defined in Parameter File

Initialization of Data Structures

Initialization of Statistics Variable

Generation of Initial Candidate Set

Find Optimal Tour

Update Statistics Variable

NO YES If Run => 0 Outcome of Program Execution

Figure-7.6: Stages in Original serial LKH-2 software

138

7.4 Parallelization of LKH-2 Software using SPC3 PM

LKH-2 software is a serial code and cannot make most of multi-cores unless modified accordingly. This LKH-2 software code can be made suitable for multi-core processors by introducing parallelism and concurrency in it. Here it is done using SPC3 PM. As discussed in the chapter 6, SPC3

PM, Serial Parallel and Concurrent Core to Core programming Model provides an environment to decompose the application into tasks using its task decomposition rules and then execute these tasks in serial, parallel and concurrent fashion. As LKH-2 software is written in function style so we have to only restructure the some part of the code to make it suitable for SPC3 PM.

Working of LKH-2 software can be decomposed into seven stages as discussed in the previous section. Out of seven, the most important and computational intensive stages are its sixth and seventh stage, finding of the optimal tour using LKH-2 algorithm and update the statistics accordingly. The other related time consuming step is to execute this stage multiple times as defined in the parameter file (runs).

The rest of the stages do not demand much of time and computations and can be executed in serially.

LKH-2 software is parallelized by converting its tour finding and related routines (sixth stage) into tasks according to the SPC3 PM Task decomposition rules and executing them in parallel using parallel function of SPC3 PM Library. To execute this stage multiple times as defined in the parameter file (runs), concurrent function of SPC3 PM is used. This concurrent execution enables to execute this stage in parallel on the available cores. This approach of decomposition and execution of LKH-2 software makes it suitable for parallel execution on multi-core processors.

Two level of parallel and concurrent execution of stages also makes this LKH-2 software scalable with respect to multiple-cores processors. The available cores are divided into sets equal to number of runs of stage 6. Each set execute the stage concurrently and cores in each set execute the single task of finding the optimal tour in parallel. Number of sets and number of cores in each set is calculated using the following simple relations.

(7.1)

139

(7.2)

For example, on a 24 cores processor with 8 runs of ‘finding the tour’ task, the total 8 sets with 3 cores each are created. Each individual execution of the task is performed on each set concurrently. Whereas, 3 cores in each set is responsible to execute the task with in a set in parallel. The new parallel program structure for LKH-2 software using SPC3 PM is given below.

// Define Tasks according to the proposed partitioning

Task1 // Read parameter file Task2 // Read problem file Task3 // Partitioning of the problem Task4 // Initialization of data structures and statics variable Task5 // Generation of initial candidate set Task6 // Find tour using LKH Task7 // Update statistics Task8 // Find tour in parallel { parallel (Task6, C); // parallel execution of Task6 on ‘C’ number of cores serial (Task7); }

void main( void ) // Start of the main program { serial (Task1); serial (Task2); If (sub_problem_size > 0) serial (Task3); serial (Task4); serial (Task5); concurrent (Task8 Task8 Task8 …… Task8 ) // Concurrent execution of Task8 equal to the number of sets (S) serial (Task7); }

140

7.4.1 Flow Chart for LKH-2 Software Processing Parallelized using SPC3 PM

Start

Read Parameter File

Read Problem File

Decompose YES Problem into Sub-problems

Partitioning of the Problem into Sub- problems as defined in Parameter File NO

Initialization of Data Structures

Initialization of Statistics Variable

Generation of Initial Candidate Set

Concurrent Execution Concurrent Execution

Find Optimal Tour Find Optimal Tour Find Optimal Tour Find Optimal Tour Find Optimal Tour in Parallel in Parallel in Parallel in Parallel in Parallel

Update Statistics Update Statistics Update Statistics Update Statistics Update Statistics Variable Variable Variable Variable Variable

Report Minimum, Maximum and Average Statistics

Figure-7.7: Stages in parallelized LKH-2 software using SPC3 PM 7.5 Performance Evolution

141

7.5 Performance Evolution

This section discusses the performance comparison of the parallelized LKH-2 software using

SPC3 PM and the original LKH-2 software version 2.0.5. on various instances of TSP library [246, 255,

256]. The LKH-2 software is parallelized using SPC3 PM Task decomposition rules and SPC3 PM Library to make this serial code suitable for multi-core processors as discussed in section 7.4.

All the computational tests reported in this chapter, for both original and parallelized LKH-2 code with SPC3 PM on different instances of TSPLIB, have been made using the default values of parameters defined in LKH-2 software parameter file. The default values have proven to be adequate in many applications [246]..

For the execution of the algorithms, the latest Intel server 1500ALU with dual six core hyper threaded Intel Xeon 5670 processor having 64 bit Windows 2008 operating is used. Thus total number of parallel threads that can be executed is 2*2*6=24.

7.5.1 TSP Library (TSPLIB)

TSPLIB is a collection of more than 100 standard problem instances of the TSP. It was created by Reinelt in 1990. TSPLIB is most widely studied problem set in computational work on the TSP. Any proposed TSP algorithm may be evaluated by running it on any instance of the TSPLIB and observing its result’s variations with its best known optimal solution. The TSP instances range from 14 cities to 85,900 cities, covering a variety of industrial applications, geographic point sets and academic challenges problems [246, 255,256]. These problem instances on the basis of their sizes can be divided into following four classes.

a) Small TSP Instances:

This class contains all the instances having fewer than 1,000 cities. These instances are readily

solved by using any TSP software within a minute on a normal present computer like P4 and

Core2 Duo. There are 78 TSP test instances to be exact in this class. A list of all instances in this

class with their optimal values is enlisted in Appendix A.

142

b) Medium Size TSP Instances:

All the test instance having cities between 1,000 to 2,500 fall in this problem class. These

instances are also easily computable within an hour or two on a normal present computer. A total

of 21 TSP instances are included in this class. A list of all instances in this class with their

optimal values is shown in Appendix A.

c) Large TSP Instances:

This class of test instances consists of all TSP examples having between 2501 and 10,000 cities.

All these instances require 24 x 7 non-stop parallel computations for their solutions. These

instances require days or even weeks to solve serially for their optimal value. There are 6 such

TSP instances that fall in this class. A list of all instances in this class with their optimal values is

shown in Appendix A.

d) Very Large TSP Instance:

Lastly, there are seven test TSP instances in which number of cities exceeds from 10,000. These

problem instances are the real challenges for the modern day computing. These problems require

months or years on a single processor and very specific algorithm for their solutions. A list of all

instances in this class with their optimal value is shown in Appendix A.

7.5.2 Result Analysis

For our study we have selected medium size and large size TSP instances of TSBLIB. The instances of medium size and large size TSP classes are executed for their optimal solutions with both the original (serial) and modified (parallel) LKH-2 code with default parameter file and ten runs for each

TSP instance. Table 7.1 shows the minimum, average and total run time for original serial LKH-2 software for each medium size TSP instances. Table 7.2 shows the minimum, average and total run time for the parallelized LKH-2 software using SPC3 PM for each medium size TSP instances.

143

Table-7.1: Minimum, Average and Total run time for original serial LKH-2 software for each medium size TSP instances Average Root Minimum Average Time Total Time TSP Instance Optimal Value Gap Time (Sec) (Sec) (Sec)

pr1002 259045 0.00% 1 1 12

si1032 92650 0.00% 5 7 74

u1060 224094 0.01% 54 103 1026

vm1084 239297 0.02% 30 42 420

pcb1173 56892 0.00% 0 3 30

d1291 50801 0.00% 3 4 43

rl1304 252948 0.16% 14 14 140

rl1323 270199 0.02% 2 12 117

nrw1379 56638 0.01% 14 16 158

fl1400 20127 0.18% 3663 3906 39061

u1432 152970 0.00% 3 3 33

fl1577 [22204,22249] 0.24% 1218 2189 21888

d1655 62128 0.00% 2 4 39

vm1748 336556 0.00% 20 22 220

u1817 57201 0.09% 68 119 1188

rl1889 316536 0.00% 65 135 1348

d2103 [79952,80450] 0.63% 146 162 1624

gr2121 2707 0.00% 25 30 303

u2319 234256 0.00% 1 1 10

pr2392 378032 0.00% 1 1 10

144

Table-7.2: Minimum, Average and Total run time for Parallelized LKH-2 software using SPC3 PM for each medium size TSP instances

Average Root Minimum Average Time Total Time TSP Instance Optimal Value Gap Time (Sec) (Sec) (Sec)

pr1002 259045 0.00% 0 1 9

si1032 92650 0.00% 2 5 51

u1060 224094 0.01% 34 67 673

vm1084 239297 0.02% 15 27 271

pcb1173 56892 0.00% 0 2 20

d1291 50801 0.00% 2 3 31

rl1304 252948 0.16% 8 10 98

rl1323 270199 0.02% 2 8 77

nrw1379 56638 0.01% 8 11 112

fl1400 20127 0.18% 1883 2370 23695

u1432 152970 0.00% 2 2 22

fl1577 [22204,22249] 0.24% 809 1422 14222

d1655 62128 0.00% 2 3 28

vm1748 336556 0.00% 11 16 159

u1817 57201 0.09% 44 81 811

rl1889 316536 0.00% 43 100 1001

d2103 [79952,80450] 0.63% 106 137 1368

gr2121 2707 0.00% 15 22 219

u2319 234256 0.00% 1 1 7

pr2392 378032 0.00% 1 1 7

145

The following figure 7.8, 7.9 and 7.10 are based on tables 7.1 and 7.2. Figure 7.8 shows the comparison of minimum time between original serial LKH-2 software and parallelized LKH-2 software using SPC3

PM for the medium size TSP instances. Similarly, figure 7.9 and 7.10 show the comparison of average and total time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for 10 runs of each medium size TSP instances

160 Minimum Time 140 Orignal Serial LKH-2 120 Parallelized LKH-2 100

80

60

Time inSec Time 40

20

0

TSPLIB Instances

Figure-7.8: Comparison of minimum run time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the medium size TSP instances

180

160 Average Time Orignal Serial LKH-2 140 Parallelized LKH-2 120

100

80

60

Time inSec Time 40

20

0

TSPLIB Instances

Figure-7.9: Comparison of average run time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the medium size TSP instances

146

1800

1600 Serial Orignal LKH-2 Total Time 1400 Parallelized LKH-2 1200

1000

800 Time inSec Time 600

400

200

0

TSPLIB Instances

Figure-7.10: Comparison of total run time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the medium size TSP instances

From Figure 7.8, for minimum execution time, it may clearly be observed that our parallelized LKH-2 software using SPC3 PM requires much lesser time that of the original LKH-2 software requires. It is so because the main function of finding the optimal tour using LKH algorithm is being executed in parallel on the available cores as defined in equation (2). In this case, 10 runs of each instance are executed concurrently on 20 cores. That is each run has a set of nearly 2 cores for its execution in parallel.

Speedup obtained in our case ranges from 1.5 to 1.7, which is where much near to the ideal speedup which should be 2 in this case.

Similarly, from Figure 7.9, the same observation can be made that the average execution time for parallelized LKH-2 software using SPC3 PM requires much lesser time that of the original LKH-2 software requires. It is so, because all the required runs of an instance are running in parallel on their respective allocated set of 2 cores.

For the total execution time required for 10 runs of each instances, the parallelized LKH-2 software code shows much greater performance gain in comparison to original LKH-2 code. This is because of the concurrent execution of all required runs on the available cores. In this case as defined by the equation

(1), total 10 sets are created. Each set is responsible to execute a run of a given instance. Thus all the runs are executed concurrently on 24 core machine making most of the multi-core processor and reducing the

147 total execution time remarkably. Whereas, in serial execution of original LKH-2 software, next run of a

TSP instance is executed only after the completion of the first run.

Table 7.3 shows the minimum, average and total time for original serial LKH-2 software for each large size TSP instances. Whereas, table 7.4 shows the minimum, average and total time for the parallelized LKH-2 software using SPC3 PM for each large size TSP instances. All the results of computational tests reported here are taken with default parameter file and having ten runs for each TSP instance.

Table-7.3: Minimum, Average and Total run time for original serial LKH-2 software for each large size TSP instances

Average Root Minimum Average Total Time TSP Instance Optimal Value Gap Time (Sec) Time (Sec) (Sec) pcb3038 137694 0.00% 430 499 4993

fl3795 [28723,28772] 0.31% 5114 6473 64725

fnl4461 182566 0.09% 2460 2759 27594

rl5915 [565040,565530] 0.37% 3220 3329 33286

pla7397 23260728 0.00% 1280 1544 15440

Table-7.4: Minimum, Average and Total run time for Parallelized LKH-2 software using SPC3 PM for each large size TSP instances

Average Root Minimum Average Total Time TSP Instance Optimal Value Gap Time (Sec) Time (Sec) (Sec) pcb3038 137694 0.00% 279 365 540

fl3795 [28723,28772] 0.31% 3299 4474 7109

fnl4461 182566 0.09% 1507 1873 2675

rl5915 [565040,565530] 0.37% 1849 2420 3201

pla7397 23260728 0.00% 762 1107 1525

The following figure 7.11 based on tables 7.3 and 7.4 shows the comparison of minimum time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the large size TSP instances. Similarly, figures7.12 and 7.13 show the comparison of average and total time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the large size TSP instances.

148

6000 Minimum Time Orignal Serial LKH-2 5000 Parallelized LKH-2 4000

3000

Time in Sec Time 2000

1000

0 pcb3038 fl3795 fnl4461 rl5915 pla7397 TSPLIB Instances

Figure-7.11: Comparison of total run time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the large size TSP instances

8000 Average Time Orignal Serial LKH-2 7000 Parallelized LKH-2 6000

5000

4000

Time in Sec Time 3000

2000

1000

0 pcb3038 fl3795 fnl4461 rl5915 pla7397 TSPLIB Instances

Figure-7.12: Comparison of average run time between original serial LKH-2 software and parallelized LKH- 2 software using SPC3 PM for the large size TSP instances

80000 Total Time Orignal Serial LKH-2 70000 Parallelized LKH-2 60000

50000

40000

Time in Sec Time 30000

20000

10000

0 pcb3038 fl3795 fnl4461 rl5915 pla7397 TSPLIB Instances

Figure-7.13: Comparison of total run time between original serial LKH-2 software and parallelized LKH-2 software using SPC3 PM for the large size TSP instances

149

Figures 7.11, 7.12 and 7.13, give have actual comparison for large size TSP instance as that of the medium size TSP instance. The minimum, average and total execution time for parallelized LKH-2 software using SPC3 PM is found lesser than that of the original serial LKH-2 software requires.

7.7 Summary

The results from this study show that the SPC3 PM (Serial, Parallel, and Concurrent Core to

Core Programming Model) provides a simpler, effective and scalable way to parallelize a given code especially for irregular algorithms and make it suitable for multi-core processors. With the concurrent and parallel function of SPC3 PM, the programmer can transform a given serial code into parallel and concurrent executable form for making most of the multi-core processors.

The Lin-Kernighan Heuristic (LKH-2) for Solving Travelling Salesman Problem which is generally considered to be one of the most effective methods for generating optimal or near-optimal solutions for the symmetric traveling salesman problem is made further effective and less time consuming by introducing parallelism and concurrency in the algorithm with the help of SPC3 PM.

Besides, the new parallel and concurrent implementation of the algorithm found much more scalable and suitable for multi-core processors.

150

CHAPTER 8

Conclusions and Future Work

In this chapter the finding of our PhD research are summarized. Based on these results and various thoughts that were developed during the course of PhD work, directions for future work are also specified.

8.1 Summary

Multi-core processors are becoming common and they have built-in parallel computational power, which can be fully utilized only if the program in execution is written accordingly. Multi-core processors differ with the traditional parallel architectures in two aspects: software design and architecture. Present parallel programming models and tools cannot be directly used for multi-core processors because of many considerations such as memory and cache design, cores inter-connections, level of parallelism supported, design of threads, decomposition technique, programming patterns, tasks and threads scheduling, operating system support etc. Most software today is being proven inefficient on multi-core processors as they can’t take advantage of multiple cores due to insufficient parallelism and concurrency. Breaking up an application into a few tasks is not a long-term solution. In order to make most of multi-core processors, either lots and lots of parallelism are actually needed for efficient execution of a program on larger number of cores, or concurrent execution of multiple programs on multiple cores or hybrid way of the first two approaches is needed. In short, multi-core processors can only be utilized only with programming models designed specifically by keeping the multi-core's architectural and software considerations in mind.

This thesis focuses on development of a new parallel multi-threaded programming model for multi-core processors. This PhD research has contributed by proposing a new multi-level cache system

151 for multi-core processors based on binary tree data structure, in place of the present 3-level cache system as an architectural improvement. This new proposed cache system leads to two new cache models, LogN and LogN+1 model, as discussed in chapter 3. This new cache system has also enabled us to earn a US patent approval. The models were first analysed and their performance was compared with the existing 3- level cache models using the basic mathematical probabilistic approach. The results obtained indicated that, for higher number of cores, the proposed binary based multi-level cache system worked more efficiently with reduced overall average cache access time than present 3-level cache system. Also, it indicated the increase in performance gain as the number of cores increased. The proposed cache system also has a less chance to be affected by the cache coherence problem because of its scalable and symmetric architecture. Cache load is well distributed and no cache at any level is over utilized. To further evaluate and compare the proposed LogN+1and LogN models with the present 3-level cache system, Queuing theory was used as a second analysing approach, as discussed in chapter 4. These models were analysed using M/D/C/K-FIFO queuing model. Related performance equations for average access time of individual cache and overall cache system and respective utilization were derived.

Besides, queuing model for present 3-level cache system was also developed. On comparison, the results obtained with queuing model confirmed that the proposed cache models had lesser access time, greater efficiency and scalability than that of the present 3-level cache system.

A parallel trace-driven multi-level cache simulator was also developed as part of this PhD research, as discussed in chapter 5. This simulator is named as MCSMC (Multi-level Cache Simulator for Multi-Cores). The simulator was validated using a standard cache simulator, CACTI, for the 3-levle cache system, and then tested successfully for upto 2048 cores and 12 cache levels. The MCSMC simulator had to be developed in order to evaluate and analyse the proposed LogN+1 and LogN cache models in real time environment, since there was no suitable simulator available to simulate such a large number of cores and cache levels. The simulation results obtained were again found in conformity with the results obtained using the first two approaches and it was confirmed that the proposed cache models worked much better and had much lesser average access time than the 3-level cache system.

152

In addition to above, a new parallel multi-threaded programming environment, SPC3 PM (Serial

Parallel and Concurrent Core to Core Programming Model) for multi-core processors, as discussed in chapter 6, has also been developed. The SPC3 PM provides a common programmer with multi-core programming environment and tools for scientific and general purpose computing. It is a serial-like, task- oriented, multi-threaded parallel programming model for multi-core processors that enables developers to easily write a new parallel code or convert an existing code written for a single processor. The programmer can scale it for use with specified number of cores, and ensure efficient task load balancing among the cores. The SPC3 PM provides a set of rules for algorithm decomposition and a library of primitives that exploit parallelism and concurrency on multi-core processors. It has also many other unique features that distinguish it with all other existing parallel programming models. It supports both data and functional parallel programming. Additionally, it supports nested parallelism, so one can easily build larger parallel components from smaller parallel components. A program written with SPC3 PM may be executed in serial, parallel and concurrent fashion using Serial, Parallel and Concurrent functions of SPC3 PM Library, respectively. Besides, it also provides processor core interaction to the programmer.

Using this feature a programmer may assign any task or a number of tasks to any of the cores or set of cores of a CMP.

Performance and behavior of SPC3 PM has been analyzed on different classes of problems including basic, complex, regular and irregular problems. However, detailed results and performance analysis of two different classes of problems have been reported in this thesis. First is the matrix multiplication, a basic and regular problem. And the second, Travelling Salesman problem (TSP), a complex and irregular problem.

Performance of the SPC3 PM was first analyzed using basic matrix multiplication algorithm.

Matrix multiplication is considered to be a standard test algorithm because of its extensive computation and memory requirement and its broad use in all types of scientific and desktop applications. The basic matrix multiplication algorithm was coded using SPC3PM rules and libraries. This code was executed in serial, parallel and concurrent fashion using the Serial, parallel and concurrent functions of SPC3 PM

Library and their performance were compared with other related standard implementations (JavaMath

153

Benchmark for serial and OpenMP for parallel and concurrent execution). The algorithm was coded using SPC3 PM and executed using its serial function. It showed greater speedup than that of C++ implementation of the algorithm in JavaMath benchmark. For parallel function the speedup obtained was the same as that obtained using OpenMP. For concurrent execution, the speedup obtained was much greater than that from OpenMP. Besides, the matrix multiplication algorithm coded using SPC3 PM showed better scalability and load balancing on available cores.

Performance of SPC3 PM was also evaluated on a large and multi-dimensional irregular problem.

For this purpose, one of the classical problems of combinatorial optimization, Travelling Salesman

Problem (TSP), was selected as discussed in chapter 7. Out of many solutions for TSP, the present Lin-

Kernighan heuristic, which uses local search optimization meta-heuristics, was chosen for its parallelization using the SPC3 PM. This LK-heuristic, is generally considered to be one of the most effective methods for generating optimal or near-optimal solutions for the symmetric traveling salesman problem. The standard serial code for LK- heuristics, LKH-2 software was parallelized using SPC3 PM rules and library to make this serial code suitable for multi-core processors and compared with the original (serial) code. The results based on the medium and large size instances of TSPLIB, showed that the SPC3 PM provided a simpler, effective and scalable way to parallelize a given code especially for irregular algorithms and to make it suitable for multi-core processors. The Lin-Kernighan Heuristic

(LKH-2) for Solving TSP was made further effective and less time consuming by introducing parallelism and concurrency in the algorithm with the help of SPC3 PM. Besides, the new parallel and concurrent implementations of the algorithm were found much more scalable and suitable for multi-core processors.

8.2 Future work

For LogN+1 and LogN cache model, some performance enhancement may be visualized because of their generalized and symmetric architecture. It can be further explored in future to see if k-way, especially 2-way set associative cache mapping works better. For cache coherence, tree data structure based directory protocols like Full-Map directories, Limited directories and Chained directories may be more efficient. The proposed cache system may also work much better for the codes having tree data structures. Also, many of the efficient and intelligent techniques and algorithms, which have been

154 developed for manipulating tree data structures, like searching, addition and deletion in a tree, may be applied after slight modifications, for the respective operations in the proposed cache system.

The MCSMC produces the output in text files format and then results had to be summarized manually. In future a GUI may be developed to present the results in graphical format. Although, the

MCSMC has been validated with CACTI but it may further be tuned with other standard cache simulators. The MCSMC has been tested for upto 2048 cores and 12 cache levels; it may further be upgraded for more number of cores and cache levels. Besides, the simulator may be checked and tuned for some other trace generators and cache replacement polices.

No programming model can be claimed as complete and perfect since programming languages, libraries, compilers and related performance and debugging tools are continuously upgraded with the passage of time to meet new requirements. Similarly, the SPC3 PM is developed to meet the user requirements for multi-core programming and it may require upgradation as the processor architecture and / or programming patterns change. However, some more synchronizing tools may be added to SPC3

PM in its current state. Besides, behaviour of SPC3 PM may further be worked out for more complex and specified problems.

155

Appendix A

A.1 Small TSP Instance 1-1000

Name Number of cities Type Bounds burma14 14 GEO 3323 ulysses16 16 GEO 6859 gr17 17 MATRIX 2085 ulysses22 22 GEO 7013 gr24 24 MATRIX 1272 fri26 26 MATRIX 937 bayg29 29 GEO 1610 bays29 29 GEO 2020 dantzig42 42 MATRIX 699 swiss42 42 MATRIX 1273 att48 48 ATT 10628 gr48 48 MATRIX 5046 hk48 48 MATRIX 11461 eil51 51 EUC 2D 426 berlin52 52 EUC 2D 7542 brazil58 58 MATRIX 25395 st70 70 EUC 2D 675 eil76 76 EUC 2D 538 pr76 76 EUC 2D 108159 gr96 96 GEO 55209 rat99 99 EUC 2D 1211 kroA100 100 EUC 2D 21282 kroB100 100 EUC 2D 22141 kroC100 100 EUC 2D 20749 kroD100 100 EUC 2D 21294 kroE100 100 EUC 2D 22068 rd100 100 EUC 2D 7910 eil101 101 EUC 2D 629 lin105 105 EUC 2D 14379 pr107 107 EUC 2D 44303 gr120 120 MATRIX 6942 pr124 124 EUC 2D 59030 bier127 127 EUC 2D 118282 ch130 130 EUC 2D 6110 pr136 136 EUC 2D 96772 gr137 137 GEO 69853 pr144 144 EUC 2D 58537 ch150 150 EUC 2D 6528 kroA150 150 EUC 2D 26524 kroB150 150 EUC 2D 26130

156

pr152 152 EUC 2D 73682 u159 159 EUC 2D 42080 si175 175 MATRIX 21407 brg180 180 MATRIX 1950 rat195 195 EUC 2D 2323 d198 198 EUC 2D 15780 kroA200 200 EUC 2D 29368 kroB200 200 EUC 2D 29437 gr202 202 GEO 40160 ts225 225 EUC 2D 126643 tsp225 225 EUC 2D 3919 pr226 226 EUC 2D 80369 gr229 229 GEO 134602 gil262 262 EUC 2D 2378 pr264 264 EUC 2D 49135 a280 280 EUC 2D 2579 pr299 299 EUC 2D 48191 lin318 318 EUC 2D 42029 linhp318 318 EUC 2D 41345 rd400 400 EUC 2D 15281 fl417 417 EUC 2D 11861 gr431 431 GEO 171414 pr439 439 EUC 2D 107217 pcb442 442 EUC 2D 50778 d493 493 EUC 2D 35002 att532 532 ATT 27686 ali535 535 GEO 202310 si535 535 MATRIX 48450 pa561 561 MATRIX 2763 u574 574 EUC 2D 36905 rat575 575 EUC 2D 6773 p654 654 EUC 2D 34643 d657 657 EUC 2D 48912 gr666 666 GEO 294358 u724 724 EUC 2D 41910 rat783 783 EUC 2D 8806 dsj1000 1000 CEIL 18659688

157

A.2 Midsized TSP Instance 1000-2500

Name Number of cities Type Bounds pr1002 1002 EUC 2D 259045 si1032 1032 MATRIX 92650 u1060 1060 EUC 2D 224094 vm1084 1084 EUC 2D 239297 pcb1173 1173 EUC 2D 56892 d1291 1291 EUC 2D 50801 rl1304 1304 EUC 2D 252948 rl1323 1323 EUC 2D 270199 nrw1379 1379 EUC 2D 56638 fl1400 1400 EUC 2D 20127 u1432 1432 EUC 2D 152970 fl1577 1577 EUC 2D [22204,22249] d1655 1655 EUC 2D 62128 vm1748 1748 EUC 2D 336556 u1817 1817 EUC 2D 57201 rl1889 1889 EUC 2D 316536 d2103 2103 EUC 2D [79952,80450] gr2121 2121 MATRIX 2707 u2319 2319 EUC 2D 234256 pr2392 2392 EUC 2D 378032

A.3 Large TSP Instance 2501-10000

Name Number of cities Type Bounds pcb3038 3038 EUC 2D 137694 fl3795 3795 EUC 2D [28723,28772] fnl4461 4461 EUC 2D 182566 rl5915 5915 EUC 2D [565040,565530] rl5934 5934 EUC 2D [554070,556045] pla7397 7397 CEIL 23260728

A.4 Very Large TSP Instance 10000+

Name Number of cities Type Bounds rl11849 11849 EUC 2D [920847,923368] usa13509 13509 EUC 2D [19947008,19982889] brd14051 14051 EUC 2D [468942,469445] d15112 15112 EUC 2D [1564590,1573152] d18512 18512 EUC 2D [644650,645488] pla33810 33810 CEIL [65913275,66116530] pla85900 85900 CEIL [141904862,142487006]

158

References

[1] M. A. Ismail, S. H. Mirza, T. Altaf,” A Parallel and Concurrent Implementation of Lin- Kernighan Heuristic (LKH-2) for Solving Travelling Salesman Problem for Multi-Core Processors using SPC3 Programming Model”, International Journal of Advanced Computer Science and Applications,USA, Vol. 2(6), 2011.

[2] M. A. Ismail, S. H. Mirza, T. Altaf, “Concurrent Matrix Multiplication on Multi-Core Processors”, International Journal of Computer Science and Security (IJCSS), Vol. 5 (2), 2011.

[3] M. A. Ismail, S. H. Mirza, T. Altaf, “Binary Tree Based Multi-Level Cache System for Multi-Core Processors”, in proc. of International conference on high performance computing, networking and communication systems, HPCNCS'09, Orlando, USA, July 13- 16, 2009.

[4] M. A. Ismail, S.H. Mirza , T. Altaf, “Design of a Cache Hierarchy for LogN and LogN+1 Model for Multi-Level Cache System for Multi-Core Processors”, in proc. of Frontier of information technology FIT’09, ACM, CIIT, Abbottabad, Pakistan, 2009

[5] C. G. Cassandras, S. Lafortune, “Introduction to Discrete Event Systems”, Springer, 2007.

[6] C. Kachris, C. Kulkarni, “Configurable Transactional Memory”, International Symposium on Field-Programmable Custom Computing Machines, 2007, pp 65-72.

[7] E. Cordeiro, S. Stefani, I. G. A. Soares, T. Martins, “DCMSim: Didactic Cache Memory Simulator”, Frontiers in Education Conference – FIE 2003, 2003, pp. F1C14-F1C19.

[8] D. Bader, V. Kandle, K. Madduri, “SWARM: A parallel programming framework for multi core processors”, IPDPS 2007, IEEE International, 2007.

[9] D. Gross, J.F. Shortle, J.M. Thompson, “Fundamentals of queuing theory”, John Wiley & Sons, 2008.

[10] D. Spinellis, “Code Quality: The Open Source Perspective”, Addison Wesley, 2007.

[11] D. Thiebaut, J. L. Wolf, and H. S. Stone, “Synthetic Traces for Trace-Driven Simulation of Cache Memories”, IEEE Transactions on Computers.1992.

[12] E. Berg, E. Hagersten, S. Cache, “A Probabilistic Approach to Efficient and Accurate Data Locality Analysis”, IEEE Symposium on Performance Analysis of Systems and Software, 2004.

[13] E. Sorenson, J. K. Flanagan, “Evaluating Synthetic Trace Models using Locality Surfaces”, IEEE Workshop on Workload Characterization, 2002.

[14] G. Bolch, S.Greiner, H. Meer, K.S. Trivedi, “Queueing Networks and Markov Chains”, Chapter 7, John Wiley & Sons, 2007.

[15] G. Giambene, “Queuing Theory and Telecommunications: Networks and Applications”, Chapter 6, Springer, 2005.

[16] G. S. Brodal, F. Fagerberg, R. Jacob, “Cache Oblivious Search Trees via Binary Trees of Small Height”, 3rd Annual. ACM Symposium. ACM, 2007.

159

[17] G. Smit, A.Kokkeler, P. Wolkotte, “Multi core architectures and streaming applications”, SLIP’08, Newcastle UK, 2008.

[18] J. Gibson, R. Kunz, D. Ofelt, M. Horowitz, J. Hennessy, M. Heinrich, “FLASH vs. (simulated) FLASH: closing the simulation loop”, in proc. of 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2000, pp. 49–58.

[19] H. E. Rewini, M. A. Barr, “Advanced Computer Architecture and Parallel Processing, Wiley, 2005.

[20] H. Wang, D. Tang, Xiang Gao, “An Enhanced Hyper Transport Controller with Cache Coherence Support for Multiple-CMP”, in proc. of international Conference on Networking, Architecture, and Storage, IEEE Computer Society, 2009.

[21] Diner IV: Hill, M. D., Dinero IV, http://www.cs.wisc.edu/~markhill/DineroIV/ [Nov, 2009].

[22] Bayesian Theorem: http://en.wikipedia.org/wiki/baysian [Nov, 2009].

[23] Sicrotex: http://sicortex.com/products [Nov, 2009].

[24] Ambric: http://www.ambric.com/products_am2045_overview.php [Nov, 2009].

[25] AMD: http://www.amd.com/uk/products/Pages/graphics.aspx [Jan, 2011].

[26] AMD: http://www.amd.com/uk/products/Pages/processors.aspx [Jan, 2009].

[27] Azulsystems: http://www.azulsystems.com/products/compute_appliance.htm [Nov, 2009].

[28] Cavium Networks: http://www.caviumnetworks.com/Table.html [Nov, 2009].

[29] FreeScale: http://www.freescale.com/webapp/sps/site/homepage.jsp [Nov, 2009].

[30] Intel: http://www.intel.com/products [Jan, 2011].

[31] Intellasys: http://www.intellasys.net/index.php [Nov, 2009].

[32] NVidia: http://www.nvidia.com/page/products.html [Nov, 2009].

[33] Picochip: http://www.picochip.com/products_and_technology/multicore_dsp [Nov, 2009].

[34] Plurality: http://www.plurality.com/products.html [Nov, 2009].

[35] Sun Systems: http://www.sun.com/servers/index.jsp [Nov, 2009].

[36] Tilera: http://www.tilera.com/products/processors.php [Nov, 2009].

[37] IBM: http://www-03.ibm.com/press/us/en/pressrelease/19508.wss [Nov, 2009].

[38] IBM: http://www-03.ibm.com/systems/power/ [Nov, 2009].

[39] J. Chang and G. S. Sohi. “Cooperative cache partitioning for chip multiprocessors”, in proc. of ICS’07. 2007.

[40] J. L. Hennessy, D. A. Patterson, “Computer Architecture: A Quantitative Approach”, Morgan Kauffman Publishers. 2003.

160

[41] J. Lin, Q. Lu, X. Ding, Z. Zhang, and P. Sadayappan., “Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems”, in proc. of Int. Symp. on High Performance Computer Architecture, , Salt Lake City, UT, USA. 2008, pp. 367-378

[42] J. Spirn, “Program Behavior Models and Measurements”, Elsevier, 1977.

[43] J. Tao, M. Kunze, F. Nowak, R. Buchty, “ Performance Advantage of Reconfigurable Cache, Design on Multicore Processor Systems, in proc. of Int Journal of Parallel Programming , vol. 36, Springer, 2008, pp. 346-360.

[44] J. Tao, M. Kunze, W. Karl, “Evaluating the Cache Architecture of Multicore Processors”, in proc. of 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing, IEEE Computer Society, 2008.

[45] J. Yun, W. Zang, “Hybrid multi core architecture for boosting single threaded performance”, ACM SIGARCH Computer Architecture, vol. 35, 2007.

[46] K. Grimsrud, J. Archibald, R. Frost, B. Nelson, “On The Accuracy of Memory Reference Models”, in proc. of Seventh International Conference on Modeling Techniques and Tools for Computer Performance Evaluation. 1997.

[47] L. Barroso, K. Gharachorloo, R. McNamara,”A scalable architecture based on single-chip multiprocessing”, SIGARCH Computer Architecture, vol. 28(2), 2000, pp. 282–293.

[48] L. Eeckhout, K. De Bosschere, H. Neefs, “Performance Analysis Through Synthetic Trace Generation”, in proc. of IEEE Symposium on Performance Analysis of Systems and Software, IEEE. 2000.

[49] M. N. Luiza, J. Leandro, D. Mendes, S. Martins, “MSCSim –Multilevel and Split Cache Simulator”, in proc. of 36th ASEE/IEEE Frontiers in Education Conference, IEEE, San Diego, CA, 2006.

[50] A. Agarwal, M. Horowitz, and J. Hennessy, “An Analytical Cache Model”, ACM Transactions on Computer Systems, 1989.

[51] T. Austin, E. Larson, D. Ernst, “SimpleScalar: an infrastructure for computer system modeling”. Computer 35(2), 2002, pp. 59–67.

[52] M. Brehob and R. Enbody, “An Analytical Model of Locality and Caching”, Technical Report, Michigan State University, 1999.

[53] M. Kämpe, F. Dahlgren, “Exploration of the Spatial Locality on Emerging Applications and the Consequences for Cache Performance”, in proc. of 14th International Parallel and Distributed Processing Symposium, IPDPS, Cancun, 2000.

[54] M. Pericas, A.Cristal and F. Cazorla, “A flexible Heterogeneous multi core architecture”, in proc. of 16th international conference of parallel architecture and compilation techniques, 2007.

[55] P.S. Magnusson, B. Werner, “Efficient Memory Simulation in SimICS”, in proc. of 8th Annual Simulation Symposium. Phoenix, Arizona, USA, 1995.

[56] M. D. Marino, “32-core CMP with multi-sliced L2: 2 and 4 cores sharing a L2 slice”, in proc. of Computer Architecture and High Performance Computing, SBAC-PAD '06, Ouro Preto. 2006.

161

[57] P. Denning, S. Schwartz, “Properties of the Working-Set Model, Communications of the ACM, 1972.

[58] P. Kongetira, K. Aingaran, K.O. Niagara, “A 32- Way multithreaded Sparc processor”, IEEE Micro, vol. 25(2), 1972.

[59] R. Hassan, A. Harris, N. Topham, “Synthetic Trace-Driven Simulation of Cache Memory”, in proc. of 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07), 2007.

[60] R. L. Mattson, J. Gecsei, D. R. Slutz, I. L. Traiger, “Evaluation Techniques for Storage Hierarchies”, IBM System Journal, 1970.

[61] R. Ladner, J. Fix, and A. LaMarca, “The cache performance of traversals and random accesses”. in proc. of 10th Ann. Symp. Discrete Algorithms (SODA-99), Baltimore, MD, ACM-SIAM, 1999, pp. 613–622.

[62] S. A. Herrod, “Using Complete Machine Simulation to Understand Computer System Behavior”. Ph.D. thesis, Stanford University, 1998.

[63] S. Cho and L. Jin. “Managing distributed shared L2 caches through OS-level page allocation”. in proc. of MICRO’06, 2006, pp. 455–468.

[64] S. Jose, “From single core to multi-core: preparation for a new exponential”, in proc. of ICCAD’06, CA. 2006.

[65] S. K. Moore, “Multicore is bad news for supercomputers”, IEEE spectrum, 2008.

[66] T. Li, D. Baumberger, S. Hahn, “Efficient operating systems scheduling for performance asymmetric multi core architectures”, in proc. of SC’07 Reno, USA, 2007.

[67] T. Mattson, G. Henry, “An overview of the Intel TFLOPS super computer”, Intel Technology journal, 2008.

[68] “The core of the issue: multi-core and you”, Linux magazine November, 2007.

[69] Tomasevic, “A study of snoopy cache coherence protocols”, in proc. of Twenty-Fifth International Conference on System Sciences, Hawaii, 1992.

[70] R. A. Uhlig, “Trace-Driven Memory Simulation: A Survey”, Intel Microcomputer Research Lab, University of Michigan, Ann Arbor, MI, 2008.

[71] www.cachegrind: a Cache-miss Profiler.

[72] X. Zhang, S. Dwarkadas, Kai Shen, “Towards practical page coloring-based multicore cache management”, in proc. of 4th ACM European conference on Computer systems, Germany, 2009, pp. 89-102.

[73] NEDUET: www.neduet.edu.pk\cise\research [May, 2011].

[74] Malcache: http://pages.cs.wisc.edu/~arch/www/ [Mar, 2010].

[75] SMP Cache: http://arco.unex.es/smpcache/ [Mar, 2010].

[76] Intel Software: http://software.intel.com/en-us/ [Nov, 2010].

162

[77] M. Feldman, "Intel Flexes Parallel Programming Muscles ", HPC Wire, Sept.2, 2010

[78] A. Ghuloum, "Ct: channelling NeSL and SISAL in C++", in Proc. of 4th ACM SIGPLAN workshop on Commercial users of functional programming, CUFP'07, 2010.

[79] Intel Research: http://techresearch.intel.com/ResearchAreaDetails.aspx [Nov, 2010].

[80] Intel Software: http://software.intel.com/en-us/articles/intel-cilk-plus [Nov, 2010].

[81] Intel Software: http://software.intel.com/en-us/articles/intel-array-building-blocks [Nov, 2010].

[82] Intel TBB: http://threadingbuildingblocks.org [Nov, 2010].

[83] Intel Software: http://software.intel.com/en-us/articles/intel-parallel-studio-home/ [Nov, 2010].

[84] A. Chandramowlishwaran, K. Knobe, R. Vuduc†, "Performance evaluation of concurrent collections on high-performance multicore computing systems" in Proc. IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010, Atlanta, GA.

[85] C. E. Leiserson,"The Cilk++ concurrency platform ", Journal of Supercomputing, Vol. 51(3), 2008, pp. 44-257.

[86] W. Lei,C. Han, "Application of parallel ant colony algorithm based on TBB and Cilk++ in path optimization", Journal of Computer Applications, Vol. 10, 2010.

[87] J.Reinders, Intel threading building blocks: outfitting C++ for multi-core processor, Wiley, 2009.

[88] A. Robison, M. Voss, A. Kukanov, "Optimization via Reflection on Work Stealing in TBB", in Proc. IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2008,Miami, FL .

[89] W. Kim, M. Voss, "Multicore Desktop Programming with Intel Threading Building Blocks," IEEE Software, vol. 28, no. 1, pp. 23-31, January/February, 2011.

[90] M. M.T. Chakravarty, G. Keller, S. Lee, "Accelerating Haskell array codes with multicore GPUs", In Proc. of sixth workshop on Declarative aspects of multicore programming, DAMP '11, Jan 2010. Austin USA.

[91] Intel Software: http://software.intel.com/en-us/articles/intel-parallel-amplifier [Nov, 2010].

[92] Intel Software: http://software.intel.com/en-us/articles/intel-parallel-inspector [Nov, 2010].

[93] Intel Software: http://software.intel.com/en-us/articles/intel-parallel-composer [Nov, 2010].

[94] Intel Software: http://software.intel.com/en-us/articles//intel-parallel-advisor [Nov, 2010].

[95] Microsoft: http://www.microsoft.com/en-us/default.aspx [Nov, 2010].

[96] Microsoft: http://msdn.microsoft.com/en-us/vstudio/default [Nov, 2010].

[97] L. Issam, "Concurrency Runtime (CRT): The Task Scheduler", Dr. Dobbs Journal, September 20, 2010.

[98] Microsoft: http://msdn.microsoft.com/en-us/library/dd504870.aspx [Nov, 2010].

163

[99] Microsoft: Axum Programmer's Guide, Microsoft. [Nov, 2010].

[100] Microsoft: http://msdn.microsoft.com/en-us/devlabs/dd795202 [Dec, 2010].

[101] Microsoft: http://msdn.microsoft.com/en-us/library/dd460688.aspx [Dec, 2010].

[102] M. Frigo, C.E. Leiserson, and K.H. Randall, “The Implementation of the Cilk-5 Multithreaded Language”, in Proc. ACM SIGPLAN 1998 Conf. Programming Lan¬guage Design and Implementation (PLDI 99), ACM Press, 1998, pp. 212–223.

[103] M. Frigo et al., “Reducers and Other Cilk++ Hyperobjects,” in Proc. 21st Ann. Symp. Parallelism in Algorithms and Architectures, ACM Press, 2009, pp. 79–90.

[104] “A Quick, Easy and Reliable Way to Improve Threaded Performance: Intel Cilk Plus”, Intel, 2010; http://software.intel.com/en-us/articles/intel-cilk-plus.

[105] “Sophisticated Library for Vector Parallelism: Intel Array Building Blocks”, Intel, 2010; http://software.intel.com/en-us/articles/intel-array-building-blocks.

[106] A. Ghuloum et al., “Future-Proof Data Parallel Algorithms and Software on Intel Multi- core Architecture” , Intel Technology J., vol. 11, no. 4, 2007, pp. 333–347.

[107] P. Krill, "Multicore: New Chips Mean New Challenges", Infoworld Nov 5, 2008, http://pcworld.about.net/od/softwareservices/Multicore-New-Chips-Mean-New.htm

[108] Sun Microsystems: https://www.sun.com [Nov, 2010].

[109] Stuart Halloway, "Programming Clojure '', Programmetic Programmers. 1st ed. 2009

[110] Clojure: http://clojure.org/ [Nov, 2010].

[111] M. Odersky, L. Spoon, B. Venner, "Programming in Scala'' , Artima Devolper. 2nd Ed. 2008.

[112] http://www.scala-lang.org/ [Nov, 2010].

[113] B. Q. Brode and C. R.Warber, "DEEP: a development environment for parallel programs", citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.36.1185&rep

[114] http://www.xconomy.com/boston/2007/11/09/cilk-arts-commercializes-mits-approach-to- parallel-programming/ [Nov, 2010].

[115] Coverity: www.coverity.com [Nov, 2010].

[116] Fortify: https://www.fortify.com/ [Nov, 2010].

[117] R. Dolbeau, S. Bihan, F. Bodin,"HMPP: A Hybrid Multi-core Parallel Programming Environment",www.caps-entreprise.com/.../caps-hmpp-gpgpu-Boston-Workshop-Oct-2007. pdf. [Nov, 2010].

[118] HMPP: http://www.pathscale.com/HMPP [Nov, 2010].

[119] SureLogic: http://www.surelogic.com/ [Nov, 2010].

[120] J. A. ,Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, "Introduction to the Cell multiprocessor", IBM Journal of Research and Development, July 2005, Vol. 49 Issue 4.5, pp. 589 - 604.

164

[121] IBM, https://www-01.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine [Nov, 2010].

[122] Nvidia: http://www.nvidia.com/object/cuda_opencl_new.html [Nov, 2010].

[123] R. Tsuchiyama, T. Nakamura, T. Iizuka, A. Asahara, S. Mik, The OpenCL Programming Book, Fixstars Corporation, 2010

[124] Tilera: http://www.tilera.com/ [Nov, 2010].

[125] Plurality: http://www.plurality.com/programmingModel.html [Nov, 2010].

[126] Quick Threads: http://www.quickthreadprogramming.com/ [Nov, 2010].

[127] A Comparative Analysis between Quick Thread and Intel® Threading Building Blocks (TBB) http://www.quickthreadprogramming.com/ComparativeanalysisbetweenQuickThreadandInt elThreadingBuildingBlocks20009.htm

[128] Comparison between QuickThread and OpenMP 3.0 under system load ; http://www.quickthreadprogramming.com/ComparisonbetweenQuickThreadandOpenMP3.0 undersystemloads.htm.

[129] D. A. Bader, V. Kanade, K. Madduri, "SWARM: A Parallel Programming Framework for Multicore Processors,” IEEE International Parallel and Distributed Processing Symposium, 2007, IPDPS, pp.491.

[130] D. Bader. ''SWARM: A parallel programming framework for multicore processors'', https://sourceforge.net/projects/multicore-swarm, 2006.

[131] SWARM: http://multicore-swarm.sourceforge.net/#introduction [Nov, 2010].

[132] D. A. Bader and J. J´aJ´a. SIMPLE: A methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing, vol. 58(1), 1999 , pp. 92–108

[133] M. Fluet, L. Bergstrom, "Programming in Manticore, a Heterogenous Parallel Functional Language", Lecture Notes in Computer Science, 2010, Volume 6299/2010, pp. 94-145

[134] M. Fluet, M. Rainey, J. H. Reppy, A. Shaw, Y. Xiao, "Manticore: a heterogeneous parallel language", In Proc. of the Workshop on Declarative Aspects of Multicore Programming (DAMP 2007), January 2007, Nice, France, 37-44.

[135] M. Fluet, M. Rainey, J.H. Reppy, A. Shaw, "Implicitly-threaded parallelism in Manticore", In Proc. of the 13th ACM SIGPLAN International Conference on Functional Programming (ICFP 2008), September 2008.Victoria, BC, Canada, pp. 119-130.

[136] L. Bergstrom, J. Reppy, "Arity Raising in Manticore.", In International Symposia on Implementation and Application of Functional Languages (IFL 2009), Volume 6041 of Lecture Notes in Computer Science, pages 90-106, New York, NY, 2009.

[137] PARMA: http://www.parma-itea2.org/ [Nov, 2010].

[138] SMOKE: http:// www.gamasutra.com/view/feature/3861/performance_scaling_with_cores_ .php [Nov, 2010].

[139] J. Armstrong, "A history of Erlang'', in Pro. HOPL III, Proceedings of the third ACM

165

SIGPLAN conference on History of programming languages, ACM New York, NY, USA, 2007

[140] J. Armstrong, ''Programming Erlang: Software for a Concurrent World''. The Pragmatic Bookshelf, Raleigh, NC , 2007

[141] J. Armstrong , ''Erlang'', Communications of the ACM, Vol. 53(9), September 2010 .

[142] J. Zhang, Characterizing the Scalability of Erlang VM on Many-core Processors, January 20, 2011

[143] R. Raghuraman, ''n-Core Encore? '', Potentials, IEEE , Vol. 29(6), 2010, pp-39-41.

[144] F. Garcia, J. Fernandez, ''POSIX Thread Libraries'', Linux Journal, Vol. 2000 (70es), 2000.

[145] Ian K.T. Tan, I. Chai, P. K. Hoong, " Pthreads Performance Characteristics on Shared Cache CMP, Private Cache CMP and SMP," in Proc. Second International Conference on Computer Engineering and Applications, ICCEA, vol. 1, 2010, pp.186-191.

[146] M. Aldinucci, M. Meneghin, M. Torquati, "Efficient Smith-Waterman on Multi-core with FastFlow," in Proc. of 18th Euromicro Conference on Parallel, Distributed and Network- based Processing, 2010, pp.195-199.

[147] N. Geoffray, G. Thomas, J. Lawall, G. Muller, B. Folliot , ''VMKit: a substrate for managed runtime environments'' , in Proc. of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, ACM, NY, USA, 2010.

[148] M. Schindewolf, O. Mattes, W. Karl, ''Thread Creation for Self-aware Parallel Systems'', Facing the Multicore-Challenge, Lecture Notes in Computer Science, Vol. 6310, 2011, pp. 42-53.

[149] C. Terboven, D. Mey, S. Sarholz," OpenMP on Multicore Architectures",In Proc. of 3rd International Workshop on OpenMP, IWOMP 2007, Beijing, China, June 3-7, 2007. pp. 54- 64.

[150] J. Tao, K. D. Hoàng W. Karl, "CMP Cache Architecture and the OpenMP Performance", In Proc. of 3rd International Workshop on OpenMP, IWOMP 2007, Beijing, China, June 3-7, 2007. pp 77-88.

[151] F. Broquedis, F. Diakhaté, S.Thibault, et al.,"Scheduling Dynamic OpenMP Applications over Multicore Architectures", In Proc. of 4th International Workshop, IWOMP 2008 West Lafayette, IN, USA, May 12-14, 2008, pp. 170-180.

[152] A.Duran, J. M. Perez, E. Ayguadé, R. M. Badia , J. Labarta, "Extending the OpenMP Tasking Model to Allow Dependent Tasks", In Proc. of 4th International Workshop, IWOMP 2008 West Lafayette, IN, USA, May 12-14, 2008, pp. 111-121

[153] T. Hanawa, M. Sato, J. Lee, T. Imada, H. Kimura, et al.,"evaluation of Multicore Processors for Embedded Systems by Parallel Benchmark Program Using OpenMP", In Proc. of 5th International Workshop on OpenMP, IWOMP 2009 Dresden, Germany, June 3-5, 2009. pp. 15-27.

[154] C. Liao, D. J. Quinlan, J. J. Willcock, T. Panas, "Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore", In Proc. of 5th International Workshop on

166

OpenMP, IWOMP 2009 Dresden, Germany, June 3-5, 2009. pp. 28-41.

[155] K. Fürlinger, D. Skinner, "Performance Profiling for OpenMP Tasks", In Proc. of 5th International Workshop on OpenMP, IWOMP 2009 Dresden, Germany, June 3-5, 2009. pp. 132-139.

[156] E. Ayguade, R.M. Badia, D. Cabrera, A. Duran, M. Gonzalez, et al., "A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures", In Proc. of 5th International Workshop on OpenMP, IWOMP 2009 Dresden, Germany, June 3-5, 2009. pp. 157-167.

[157] P. Carribault, M.Pérache, H. Jourdren, "Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC", In Proc. of 6th International Workshop on OpenMP, IWOMP 2010, Tsukuba, Japan, June 14-16, 2010, pp. 14-21.

[158] S. Ohshima, S. Hirasawa H. Honda, "OMPCUDA : OpenMP Execution Framework for CUDA Based on Omni OpenMP Compiler", In Proc. of 6th International Workshop on OpenMP, IWOMP 2010, Tsukuba, Japan, June 14-16, 2010, pp. 161-173.

[159] J. Yan, J. He, W. Han, W. Chen, W. Zheng, "How OpenMP Applications Get More Benefit from Many-Core Era", In Proc. of 6th International Workshop on OpenMP, IWOMP 2010, Tsukuba, Japan, June 14-16, 2010, pp. 83-95.

[160] ''OpenMP Application Program Interface Ver. 3.0'', OpenMP Architecture Review Board, May 2008; www.openmp.org/mp-documents/spec30.pdf.

[161] S. Vinoski, Verivue, "Concurrency with Erlang", IEEE Internet Computing, IEEE, 2007, pp. 90- 94.

[162] MCA: http://www.multicore-association.org/home.php [Mar, 2011].

[163] L. Markus, M. Thomas, ''Embedded Multicore Processors and Systems'', Micro, IEEE, vol. 29(3), 2009, pp. 7 - 9.

[164] C. Elwakil , Z. Yang , ''Debugging support tool for MCAPI applications'', in proc. of 8th Workshop on Parallel and Distributed Systems, PADTAD '10: Testing, Analysis, and Debugging, ACM, NY, USA, 2010.

[165] S. Gal-On, M. Levy, ''Measuring Multicore Performance'', IEEE Computer, Vol. 41(11), pp. 99 - 102.

[166] M. Elwakil, Z. Yang, L. Wang, ''CRI: Symbolic Debugger for MCAPI Applications'', Lecture Notes in Computer Science, vol. 6252, 2010, pp. 353-358.

[167] OpenMP: http://www.openmp.org/ [Mar, 2011].

[168] T. El-Ghazawi, W. Carlson, T. Sterling, K. Yelick, ''UPC: Distributed Shared Memory Programming'', Hoboken, NJ: Wiley-Interscience, 2005.

[169] Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, “Sequoia: Programming the memory hierarchy,” in Proc. Supercomputing 2006, Tampa Bay, FL, Nov. 2006.

[170] Sequoia: http://sequoia.stanford.edu [Nov, 2010].

[171] R. W. Numrich, J. Reid, “Co-arrays in the next Fortran standard,” ACM SIGPLAN Fortran

167

Forum, vol. 24(2), 2005, pp. 4–17.

[172] K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, A. Aiken, “Titanium: A high performance Java dialect”, Concurrency Pract. Exp., vol. 10(13), 1998, pp. 825–836.

[173] W. Thies, M. Karczmarek, and S. Amarsinghe, “StreamIt: A language for streaming applications,” in Proc. 11th Int. Conf. Compiler Construction, Grenoble, France, 2002, pp. 179–196.

[174] B. L. Chamberlain, D. Callahan, and H. P. Zima, “Parallel programmability and the Chapel language,” Int. J. High Perform. Comput. Applicat., vol. 21(3), 2007, pp. 291–312.

[175] E. Allen, D. Chase, C. Flood, V. Luchangco, J. Maessen, S. Ryu, and G. L. Steele, “Project Fortress: A multicore language for multicore processors,” Linux Mag., 2007, pp. 38–43.

[176] P. Charles, C. Donowa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, and V. Sarkar, “X10: An object-oriented approach to non-uniform cluster computing,” in Proc. 20th Conf. Object Oriented Programming Systems Languages and Applications, San Diego, CA, 2005, pp. 519–538.

[177] R. A. Sciampacone, V. Sundaresan, D. Maier, T. Gray-Donald, "Exploitation of multicore systems in a Java virtual machine'', IBM Journal of Research and Development, Vol. 54(5), 2008.

[178] C. G. Baker, M. A. Heroux, H. Carter, A. B. Williams, ''A Light-weight API for Portable Multicore Programming'', in proc. of 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, 2010.

[179] G. Blake, R. G. Dreslinski, T. Mudge, ''A Survey of Multicore Processors'', IEEE SIGNAL PROCESSING MAGAZINE , 2009, pp. 25-37.

[180] J. L. Manferdelli, N.K. Govindaraju, C. Crall, ''Challenges and Opportunities in Many-Core Computing'', in Proc. of the IEEE, Vol. 96(5), 2008, pp. 808-816.

[181] D. Geer, ''Multicore Chips Mean Multiple Challenges'', IEEE Computer Society, Sep. 2007.

[182] M. Mehrara, T. Jablin, D. Upton, D. August, K. Hazelwood, S. Mahlke, ''Multicore Compilation Strategies and Challenges'', IEEE SIGNAL PROCESSING MAGAZINE, vol. 55 NOVEMBER 2009

[183] J. Larson, ''Erlang for concurrent programming'', Commun. ACM, 2009.

[184] J. Larus, ''Spending Moore’s dividend'', Commun. ACM , 2009.

[185] R. Brightwell, M. Heroux, Z. Wen, J. Wu, ''Parallel Phase Model: A Programming Model for High-end Parallel Machines with Manycores'', in proc. of International Conference on Parallel Processing, 2009.

[186] M.D. Hill and M.R. Marty, ‘‘Amdahl’s Law in the Multicore Era,’’ Computer, vol. 41(7), 2008, pp. 33-38.

[187] M. J. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, D.I. August, '' Revisiting the Sequential Programming Model for Multi-Core'', in proc. of MICRO 40 Proceedings of the 40th Annual IEEE/ACM International Symposium on Micro-architecture, IEEE Computer Society

168

Washington, DC, USA, 2007.

[188] G. Ottoni, R. Rangan, A. Stoler, D. I. August, ''Automatic thread extraction with decoupled software pipelining'', In proc. of 38th International Symposium on Micro-architecture, 2005.

[189] G. Ottoni, R. Rangan, A. Stoler, M. J. Bridges, and D. I. August, ''From sequential programs to concurrent threads'', IEEE Computer Architecture Letters, 4, June 2005.

[190] M. K. Prabhu and K. Olukotun, ''Exposing speculative thread parallelism in SPEC2000''. In proc. of Symposium on Principles and Practice of Parallel Programming, 2005.

[191] N. Vachharajani, Y. Zhang and T. Jablin," Revisiting the sequential programming model for the multicore era", IEEE MICRO, JANUARY–FEBRUARY 2008.

[192] M.J. Bridges et al., ‘‘Revisiting the Sequential Programming Model for Multi-Core’’, in proc. of Int’l Symp. Microarchitecture (MICRO 07), IEEE CS Press, 2007, pp. 69-81.

[193] M. D. McCool, “Scalable Programming Models for Massively Multicore Processors", in proc. of the IEEE, Vol. 96(5), 2008.

[194] A. Agarwalm, S. Brehmer, M. Domeika, P. Griffin, F. Schirrmeister, ''software standards for the multicore era'', embedded multicore processors and systems, IEEE MICRO, 2009.

[195] Diego Andrade, Basilio B. Fraguela, ''Task-parallel versus data-parallel library-based programming in multicore systems'', in proc. of Parallel, Distributed and Network-based Processing, 2009

[196] E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, G. Zhang, ''The Design of OpenMP Tasks'', IEEE transactions on parallel and distributed systems, vol. 20(3), 2009.

[197] “Intel 64 and IA-32 Architectures Software Developer’s Manual,” Intel Developer Manuals, vol. 3A, Nov. 2008.

[198] Advanced Micro Devices Inc., “Software optimization guide for AMD family 10h processors,” AMD White Papers and Technical Documents, Nov. 2008 *Online+. Available: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546. pdf [Nov, 2010].

[199] J. Larus and H. Sutter, BSoftware and the concurrency revolution, ACM Queue, vol. 3, no. 7, pp. 54–62, Sep. 2005.

[200] K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, BThe landscape of parallel computing research: A view from Berkeley, UCB/EECS-2006-183, 2006.

[201] M. Azimi, N. Cherukuri, D. N. Jayasimha, A. Kumar, P. Kundu, S. Park, I. Schoinas, and A. S. Vaidya, BIntegration challenges and tradeoffs for terascale architectures, Intel Technol. J., vol. 11, no. 3, 2007.

[202] "D. B. Skillicorn and D. Talia, BModels and languages for parallel computation,[ ACM Comput. Surv., vol. 30, no. 2, pp. 123–169, Jun. 1998."

[203] "M.-Y. Wu and W. Shu, BMIMD programs on SIMD architectures,[ in Proc. IEEE 6th Symp.

169

Frontiers Massively Parallel Comput. (FRONTIERS ’96), Washington, DC, 1996, p. 162."

[204] "D. Tarditi, S. Puri, and J. Oglesby, BAccelerator: Using data parallelism to program GPUs for general-purpose uses,’’ in Proc. ACM Conf. Architect. Support Program. Lang. Oper. Syst., Oct. 2006, Microsoft Tech. Rep. 2005-184."

[205] "M. D. McCool, BData-parallel programming on the cell BE and the GPU using the RapidMind development platform,[ in Proc. GSPx Multicore Applicat. Conf., Oct.–Nov. 2006."

[206] "A. Ghuloum, E. Sprangle, and J. Fang Flexible parallel programming for tera-scale architectures with Ct, Apr. 26, 2007, Intel White Paper."

[207] "M. Creeger, ‘‘Multicore CPUs for the Masses,’’ ACM Queue, vol. 3, no. 7, 2005, pp. 63-64."

[208] "A. C. McKellar and J. E. G. Coffman, “Organizing matrices and matrix operations for paged memory systems,” Commun. ACM, vol. 12, no. 3, pp. 153–165, 1969.

[209] S. Akhter, J.Roberts, "Multi-Core Programming, Increasing Performance through Software Multi-threading, Intel press, 2006.

[210] B. Chapman, G. Jost, R. van der Pas, "Using OpenMP, portable Shared Memory Parallel Programming", MIT Press, 2008.

[211] Robinson, Sara, "Toward an Optimal Algorithm for Matrix Multiplication", SIAM News 38 (9), 2005. http://www.siam.org/pdf/news/174.pdf [Nov, 2010].

[212] Netlib: http://www.netlib.org/blas/ [Nov, 2010].

[213] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, “Introduction to Algorithms’’, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 28: Section 28.2: Strassen's algorithm for matrix multiplication, pp.735–741

[214] H. Cohn, R. Kleinberg, B. Szegedy, C. Umans. ‘’Group-theoretic Algorithms for Matrix Multiplication’’, in proc. of the 46th Annual Symposium on Foundations of Computer Science, 23–25 October 2005, Pittsburgh, PA, IEEE Computer Society, pp. 379–388.

[215] Mircosoft: http://msdn.microsoft.com/en-us/library/dd492418.aspx [Nov, 2010].

[216] UPRCC: http://www.upcrc.illinois.edu/ [Nov, 2010].

[217] HIPEAC: http://www.hipeac.net/ [Nov, 2010].

[218] UPMARC: http://www.it.uu.se/research/upmarc [Nov, 2010].

[219] G. Goumas, "Performance evaluation of the sparse matrix-vector multiplication on modern architectures", Journal of Supercomputing, pp. 1-42, Nov. 2008.

[220] R. Vuduc, H. Moon, "Fast sparse matrix-vector multiplication by exploiting variable block structure", Lecture notes in computer science, vol. 3726, pp. 807-816, 2005.

[221] H. T. Kung, C. E. Leiserson, “Algorithms for VLSI processor arrays”; in proc. of Introduction to VLSI Systems”, Addison-Wesley, 1979.

[222] G. C. Fox, S. W. Otto and A. J. G. Hey, “Matrix algorithms on a hypercube I: Matrix

170

multiplication”, Parallel Computing, vol. 4(1), pp.17-31, 1987.

[223] R. A. van de Geijn, J. Watts,” SUMMA_ Scalable Universal Matrix Multiplication Algorithm”, TECHREPORT, 1997.

[224] A. Ziad, M. Alqadi and M. M. El Emary, “Performance Analysis and Evaluation of Parallel Matrix Multiplication Algorithms”, World Applied Sciences Journal, vol. 5 (2), pp. 211-214, 2008.

[225] Z. Alqadi and A. Abu-Jazzar, ”Analysis of program methods used for optimizing matrix Multiplication”, Journal of Engineering, vol. 15(1), pp. 73-78, 2005.

[226] J. Choi, “Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers” in Proceeding of 11th International Symposium on Parallel Processing IPPS '97 IEEE, 1997.

[227] P. Alonso, R. Reddy, A. Lastovetsky, “Experimental Study of Six Different Implementations of Parallel Matrix Multiplication on Heterogeneous Computational Clusters of Multi-core Processors” in Proceedings of Parallel, Distributed and Network-Based Processing (PDP), Pisa, Feb. 2010.

[228] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, P. Palkar, “A three-dimensional approach to parallel matrix multiplication“, IBM Journal of Research and Development, vol. 39(5). pp. 575, 1995.

[229] A. Buluc, J. R. Gilbert, “Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication”, in proceedings of 37th International Conference on Parallel Processing, ICPP '08, Portland, Sep 2008.

[230] L . Buatois, Caumon, G. Lévy, “Concurrent number cruncher: An efficient sparse linear solver on the GPU”, in Proceedings of the High-Performance Computation Conference (HPCC), Springer LNCS, 2007.

[231] S. Sengupta, M. Harris, Y. Zhang, J.D. Owens, “Scan primitives for GPU computing”. In proceedings of Graphics Hardware, Aug. 2007.

[232] J. A. Stratton, S. S. Stone, Hwu, “M-CUDA: An efficient implementation of CUDA kernels on multicores”, IMPACT Technical Report 08-01, University of Illinois at Urbana-Champaign, 2008..

[233] K. Fatahalian, J. Sugerman, P. Hanrahan, “Understanding the efficiency of GPU algorithms for matrix-matrix multiplication” in Proceeding of the conference on Graphics hardware ACM SIGGRAPH/EUROGRAPHICS HWWS '04, 2004

[234] J. Bolz, I. Farmer, E. Grinspun, P. Schröoder, “Sparse matrix solvers on the GPU: conjugate gradients and multigrid”, ACM Transactions on Graphics (TOG), vol. 22(3), 2003.

[235] S. Ohshima, K. Kise, T. Katagiri and T. Yuba, “Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment”, High Performance Computing for Computational Science – VECPAR, 2006

[236] JavaMath: http://math.nist.gov/scimark2/ [Jan, 2011].

[237] BLAS: http://www.netlib.org/blas/ [Jan, 2011].

[238] ATLAS: http://math-atlas.sourceforge.net/ [Jan, 2011].

[239] LAPACK: http://www.netlib.org/lapack/ [Jan, 2011].

171

[240] MKL: http://software.intel.com/en-us/articles/intel-mkl/ [Jan, 2011].

[241] .NET Matrix Library: http://www.bluebit.gr/net/ [Jan, 2011].

[242] H. Kim, R. Bond, ''Multicore Software Technologies'' IEEE SIGNAL PROCESSING MAGAZINE, Nov. 2009

[243] N. Vachharajani, Y. Zhang and T. Jablin, "Revisiting the sequential programming model for the multicore era", IEEE MICRO, Jan - Feb 2008.

[244] M. D. McCool, "Scalable programming models for massively multicore processors", Proceedings of the IEEE, vol. 96(5), 2008.

[245] F. Glover, G.A. Kochenberger, "Handbook of Metaheuristics", Kluwer’s international series, 2003, pp. 475-514.

[246] D. L. Applegate, R. Bixby, V. Chvatal, W. J. Cook, "The Travelling Salesman Problem", Princeton University Press, 2006, pp. 29, 59-78, 103, 425-469, 489-524.

[247] E. Alba, "Parallel Metaheuristics a new class of algorithms", Willey, 2006.

[248] K. Helsgaun, “General k-opt submoves for the Lin–Kernighan TSP heuristic”, Math. Prog. Comp., vol. 1, pp. 119–163, 2009.

[249] E. L. Lawler, J. K. Lenstra, R. Kan, D. B. Shmoys, "The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization". Wiley, New York , 1985.

[250] S. Lin, , B. W. Kernighan, “An effective heuristic algorithm for the traveling-salesman problem”. Oper.Res. vol. 21, pp. 498–516, 1973.

[251] K. Helsgaun, “An effective implementation of the Lin–Kernighan traveling salesman heuristic” , EJOR 12, pp. 106–130, 2000.

[252] H.H. Hoos, T. Stützle, "Stochastic Local Search: Foundations and Applications". Morgan Kaufmann, Menlo Park , 2004.

[253] D. S. Johnson, “Local optimization and the traveling salesman problem”, LNCS, vol. 442, pp. 446–461, 1990.

[254] LKH: http://www.akira.ruc.dk/~keld/research/LKH/ [Feb, 2011].

[255] TSPLIB: http://www.tsp.gatech.edu/data/index.html [Feb, 2011].

[256] TSPLIB: http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/ [Feb, 2011].

[257] CACTI: http://quid.hpl.hp.com:9081/cacti/index.y [June, 2010].

[258] CACTI: http://www.hpl.hp.com/research/cacti [June, 2010].

[259] CACTI: http://www.cs.wisc.edu/arch/www/tools.html [June, 2010].

[260] B. Chandramouli, S. Iyer, “A performance study of snoopy and directory based cache- coherence protocols”, 2006.

[261] B.Saha, A.Reza, Q. Jacobson, “Architectural Support for Software Transactional Memory”, 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06) , 2006.

172

[262] D. Sima, T. Fountain, P. Kacsuk, "Advanced computer archtecture, a design approach", Pearson Eduction, 2004.

[263] M. herlihy, N. Shavit, "The art of multiprocessor programming", Morgan Kaufmann publisher, 2008.

[264] X. Wu, "Perfomance evaluation, predication and visualization of parallel system", Kluwer academic publishers, 2000.

[265] M. Domeika, "software devolpment for embedded mulit-core systems", Newnes publishiser, 2008.

[266] A. Grama, A. Gupta, G. Karypis, "Introdcution to parallel conputing", Pearson eduction, 2004.

[267] A. A. Jarraya, W. Wolf, "Multiprocessors systems on-chips", Morgan Kaufmann publisher, 2005.

[268] J. Dongarra, I. Foster, G. fox, "Sourcebook of parallel computing", Morgan Kaufmann publisher, 2003.

[269] Y. N. Sirikant, P. Shemker, "The compiler design handbook", CRC Press, 2008.

173