Characterizing the Impact of Context-Variables in Software Performance Factors: a Domain-Specific Perspective

Master Thesis Document

Presented in partial fulfillment to obtain the Title of Magister in Informatics and Telecommunications

by

Juan Manuel Ropero L.

Advisor: Gabriel Tamura, PhD

Department of Information and Communication Technologies Faculty of Engineering

2017 Contents

1 Introduction 2 1.1 Motivation and Background ...... 2 1.2 Problem Statement ...... 3 1.3 Challenges ...... 3 1.4 Research Objectives ...... 4 1.4.1 General ...... 4 1.4.2 Specific ...... 4 1.5 Methodology ...... 4 1.6 Contributions ...... 5

2 State-of-the-Art Background 6 2.1 Self-Adaptive Software (SAS) Systems ...... 6 2.1.1 The MAPE-K Loop ...... 7 2.1.2 Self-* Properties ...... 7 2.2 Service Component Architecture (SCA) ...... 8 2.3 The Performance Quality Attributes and Performance Factors ...... 8 2.3.1 Performance Factor Definitions ...... 9 2.4 Context-Aware Computing ...... 9 2.4.1 Context-Variable Definitions ...... 10 2.4.2 Design Patterns for Software Architecture ...... 14 2.5 Chapter Summary ...... 15

3 Problem Modeling 17 3.1 The Problem Context ...... 17 3.2 The Need for an Experiments Design ...... 18 3.3 Identification of Performance Factors ...... 18 3.3.1 Search and Identification Process ...... 18 3.3.2 Distribution Results of Performance Factors ...... 19 3.3.3 Throughput - Special Consideration ...... 19 3.4 Identification of Context Variables ...... 21 3.4.1 Search and Identification Process ...... 21 3.4.2 Context-variables Categorization ...... 22 3.4.3 Context Variables - Special Considerations ...... 22 3.5 Identification of Domain-Specific Design Patterns for Performance ...... 24 3.5.1 Overview of the Research Methodology ...... 24

i 3.6 Domain-Specific Design Patterns for Performance ...... 25 3.6.1 The Design Pattern Template ...... 26 3.6.2 Random Access Parser Design Pattern ...... 27 3.6.3 Reactor Design Pattern ...... 29 3.6.4 State-based Design Pattern ...... 31 3.6.5 Pool Design Pattern ...... 34 3.6.6 Master-Worker Design Pattern ...... 35 3.6.7 Separable Dependencies Design Pattern (Variant of 3.6.6) ...... 39 3.6.8 Fork / Join Design Pattern ...... 41 3.6.9 Producer-Consumer Design Pattern ...... 43 3.6.10 Sender Released ...... 46 3.6.11 Leader and Followers ...... 48 3.6.12 Half-Sync / Half-Async ...... 51 3.6.13 Sayl ...... 52 3.6.14 MapReduce ...... 55 3.7 Problem Model Instantiation ...... 58 3.7.1 Experiment Case Studies ...... 59 3.7.2 Domain-Specific Design Patterns for Performance - Selection ...... 61 3.7.3 Performance Factors - Selection ...... 63 3.7.4 Context Variables - Selection ...... 64 3.8 Chapter Summary ...... 65

4 Experiments Design 66 4.1 Experiment Environment Configuration ...... 66 4.1.1 Software Technologies ...... 66 4.1.2 The Warming Up Process ...... 67 4.1.3 Pilot Experiments for Special System Variables ...... 68 4.1.4 Hardware and Network Architecture ...... 69 4.1.5 Software Architecture for The Sorting Case ...... 69 4.1.6 Memory Architecture for The Sorting Case ...... 76 4.1.7 Variable Configuration: Pilot Experiments ...... 77 4.2 Experiments Design ...... 83 4.2.1 The Sorting Case - Experiments Design ...... 83 4.2.2 The Large XML Processing Case - Experiments Design ...... 90 4.3 Chapter Summary ...... 90

5 Analysis of Experiment Results 93 5.1 Impact of Context-Variables on Latency for the Sorting Case ...... 94 5.1.1 Design Patterns and Memory Structure Variations ...... 94 5.1.2 Network Bandwidth ...... 95 5.1.3 Communication Time ...... 97 5.1.4 RAM Memory Usage ...... 106 5.1.5 Buffer Size ...... 111 5.1.6 Batch Time Span ...... 114 5.1.7 File Length ...... 114 5.1.8 Task Granularity: Batch-Size ...... 121

ii 5.1.9 Number of Available Distributed Task Processors ...... 122 5.1.10 Number of Available Task Processors + Task Granularity ...... 125 5.1.11 Number of Available Task Processors + File Length ...... 125 5.1.12 Task Granularity + File Length ...... 129 5.1.13 Number of Task Processors + Task Granularity + File Length ...... 133 5.1.14 An Isolated Experiment for The UMA-NORMA Memory Structure ...... 144 5.2 Context-Variables Impact on Throughput for The Sorting Case ...... 144 5.2.1 Number of Service Requests ...... 145 5.2.2 Design Patterns and File Size ...... 145 5.2.3 Number of Available Distributed Task Processors ...... 146 5.2.4 The Impact of Two-Mergers ...... 150 5.3 A Comparative Analysis between Latency and Throughput for The Sorting Case . . 150 5.4 A Complementary Evaluation of The XML Processing Case ...... 153 5.4.1 Summary of the Throughput Analysis from ”Processing Large Volumes of Data Physically Stored in XML Files” ...... 155 5.4.2 Latency Performance Analysis ...... 157 5.4.3 Batch Time Span Evaluation in the XML Processing Case ...... 157 5.5 Best Performance Combinations ...... 157 5.5.1 Sorting Case ...... 159 5.5.2 Best Latency Results ...... 159 5.5.3 Best Throughput Results ...... 159 5.5.4 XML Processing Case ...... 159 5.6 Chapter Summary ...... 159

6 Summary and Conclusions 162

iii List of Tables

3.1 Context-Variables Categorization ...... 23

4.1 JVM Heap Size Experiment ...... 68 4.2 Hardware Technical Specifications ...... 69 4.3 Network Switch Hardware Specifications ...... 70 4.4 Latency Results for the NORMA Configuration of the Experimental Pilot for Batch Size Tests ...... 78 4.5 Latency Results for the UMA Configuration of the Experimental Pilot for Batch Size Tests...... 79 4.6 Latency Distribution Behavior Test for Master-Worker Under Different Architectures Configurations. All values are given in milliseconds ...... 80 4.7 Latency Distribution Behavior Test for Producer-Consumer Under Different Archi- tectures Configurations. All values are given in milliseconds ...... 81 4.8 Latency Distribution Behavior Test for Sayl Under Different Architectures Configu- rations. All values are given in milliseconds ...... 82 4.9 Producer-Consumer NORMA latency results using synchronous and asynchronous methods for triggering tasks. All latency data is presented in milliseconds...... 84 4.10 Producer-Consumer UMA latency results using synchronous and asynchronous meth- ods for triggering tasks. All latency data is presented in milliseconds...... 85 4.11 Sayl NORMA latency results using synchronous and asynchronous methods for trig- gering tasks. All latency data is presented in milliseconds...... 86 4.12 Sayl UMA latency results using synchronous and asynchronous methods for trigger- ing tasks. All latency data is presented in milliseconds...... 87 4.13 Experiments for the sorting case. Compendium of the variables variations for La- tency Experiments...... 89 4.14 Experiments for the sorting case. Compendium of the variables variations for Through- put Experiments...... 91 4.15 Experiments for the large XML processing case. Compendium of variables variations for the complementary Latency Experiments...... 91

5.1 Average latency and times for 1 Gbps and 100 Mbps experiments detailed by main sorting algorithm stages and memory structure. Results are shown in milliseconds. . 96 5.2 Latency average with 1 Gbps and 100 Mbps detailed by memory structure and design patterns. All results are shown in milliseconds...... 96 5.3 Communication Time Analysis ...... 99

iv 5.4 Average Latency for Large File Sizes in NORMA at 1 Gbps. All results are showed in milliseconds ...... 121 5.5 Average Latency for Large File Sizes in UMA at 1 Gbps. All results are showed in milliseconds ...... 122 5.6 Number of Service Requests Experiment ...... 145 5.7 Results Comparison between Latency and Throughput for The Sorting Case . . . . . 153 5.8 Summary of the principal Throughput results analyzed in the ”Processing Large Volumes of Data Physically Stored in XML Files” Document [9] ...... 156 5.9 Average Latency Results for the Selected XML Processing Case ...... 158 5.10 Average Latency for the XML Processing Case with different Batch Time Span configurations ...... 158 5.11 Best Average Latency Combinations for the Evaluated File Sizes for the Sorting case at 1 Gbps ...... 160 5.12 Best Average Throughput Combinations for the Evaluated File Sizes for the Sorting case at 1 Gbps ...... 161 5.13 Best Average Latency Combinations for the Evaluated File Sizes for the XML Pro- cessing case at 1 Gbps ...... 161

v List of Figures

2.1 The MAPE-K reference model [36] ...... 7

3.1 Distribution Results of Performance Factors Literature ...... 20 3.2 Filter-Process Stage Results ...... 26 3.3 Structure Diagram - Random Access Parser Design Pattern ...... 28 3.4 Processing Sequence Diagram - Random Access Parser Design Pattern ...... 29 3.5 Structure Diagram - Reactor Design Pattern ...... 30 3.6 Request Processing Sequence Diagram - Reactor ...... 31 3.7 Structure Diagram - State-based Pipeline Design Pattern ...... 33 3.8 Processing Sequence Diagram - State-based Pipeline Design Pattern ...... 33 3.9 Structure Diagram - Design Pattern ...... 35 3.10 Request Processing Sequence Diagram - Thread Pool ...... 36 3.11 Structure Diagram - Master-Worker Design Pattern ...... 37 3.12 Processing Sequence Diagram - Master-Worker Design Pattern ...... 38 3.13 Structure Diagram - Separable Dependencies Design Pattern ...... 40 3.14 Processing Sequence Diagram - Separable Dependencies Design Pattern ...... 41 3.15 Structure Diagram - Fork / Join Design Pattern ...... 42 3.16 Request Processing Sequence Diagram - Fork / Join ...... 43 3.17 Structure Diagram - Producer-Consumer Design Pattern ...... 44 3.18 Processing Sequence Diagram - Producer-Consumer Design Pattern ...... 45 3.19 Structure SOA Diagram - Sender Released Design Pattern ...... 47 3.20 Processing Sequence Diagram - Sender Released Design Pattern ...... 48 3.21 Structure Diagram - Leader and Followers Design Pattern ...... 49 3.22 Request Processing Sequence Diagram - Leader and Followers ...... 50 3.23 Structure Diagram - Half-Sync / Half-Async Design Pattern ...... 52 3.24 Request Processing Sequence Diagram - Half-Sync / Half-Async ...... 53 3.25 Structure Diagram - Sayl Design Pattern ...... 54 3.26 Processing Sequence Diagram - Sayl Design Pattern ...... 56 3.27 Structure Diagram - MapReduce Design Pattern ...... 57 3.28 Processing Sequence Diagram - MapReduce Design Pattern ...... 58

4.1 Impact of a Force Garbage Collection ...... 69 4.2 Software Architecture using the Master-Worker Design Pattern ...... 71 4.3 Software Architecture using the Producer-Consumer Design Pattern ...... 73 4.4 Software Architecture using the Sayl Design Pattern ...... 75

vi 5.1 Master-Worker behavior with a 14 million lines file using medium-granularity with 4-Nodes in NORMA at 1 Gbps ...... 100 5.2 Producer-Consumer behavior with a 14 million lines file using medium-granularity with 4-Nodes in NORMA at 1 Gbps ...... 101 5.3 Sayl behavior with a 14 million lines file using medium-granularity with 4-Nodes in NORMA at 1 Gbps ...... 102 5.4 Master-Worker behavior with a 14 million lines file using medium-granularity with 4-Nodes in UMA at 1 Gbps ...... 103 5.5 Producer-Consumer behavior with a 14 million lines file using medium-granularity with 4-Nodes in UMA at 1 Gbps ...... 104 5.6 Sayl behavior with a 14 million lines file using medium-granularity with 4-Nodes in UMA at 1 Gbps ...... 105 5.7 Average RAM Usage vs Latency by File Length and Design Pattern in NORMA at 1Gbps...... 107 5.8 Average RAM Usage vs Latency by File Length and Design Pattern in UMA at 1 Gbps...... 108 5.9 Average RAM Usage by Components of the Software Architecture of each Design Pattern in NORMA at 1 Gbps ...... 109 5.10 Average RAM Usage by Components of the Software Architecture of each Design Pattern in UMA at 1 Gbps ...... 110 5.11 Average Maximum Number of Tasks in Queue by Design Pattern in NORMA at 1 Gbps...... 112 5.12 Average Maximum Number of Tasks in Queue by Design Pattern in UMA at 1 Gbps 113 5.13 Batch Time Span Results for NORMA Configuration Part I ...... 115 5.14 Batch Time Span Results for NORMA Configuration Part II ...... 116 5.15 Batch Time Span Results for UMA Configuration Part I ...... 117 5.16 Batch Time Span Results for UMA Configuration Part II ...... 118 5.17 Average Latency by File Length in NORMA at 1 Gbps ...... 119 5.18 Average Latency by File Length in UMA at 1 Gbps ...... 120 5.19 Average Latency by File Length in NORMA at 100 Mbps ...... 120 5.20 Average Latency by File Length in UMA at 100 Mbps ...... 121 5.21 Latency by Task Granularity in NORMA at 1 Gbps ...... 122 5.22 Latency by Task Granularity in UMA at 1 Gbps ...... 123 5.23 Latency by Task Granularity in NORMA at 100 Mbps ...... 123 5.24 Latency by Task Granularity in UMA at 100 Mbps ...... 124 5.25 Latency by Number of Available Distributed Task Processors Configuration in NORMA at 1 Gbps ...... 125 5.26 Latency by Number of Available Distributed Task Processors Configuration in UMA at 1 Gbps ...... 126 5.27 Latency by Number of Available Distributed Task Processors Configuration in NORMA at 100 Mbps ...... 126 5.28 Latency by Number of Available Distributed Task Processors Configuration in UMA at 100 Mbps ...... 127 5.29 Average Latency by Number of Available Task Processors and Task Granularity in NORMA at 1 Gbps ...... 127

vii 5.30 Average Latency by Number of Available Task Processors and Task Granularity in UMA at 1 Gbps ...... 128 5.31 Average Latency by Number of Available Task Processors and Task Granularity in NORMA at 100 Mbps ...... 128 5.32 Average Latency by Number of Available Task Processors and Task Granularity in UMA at 100 Mbps ...... 129 5.33 Average Latency in Master-Worker by Task Processors Number and File Length in NORMA at 1 Gbps ...... 130 5.34 Average Latency in Producer-Consumer by Task Processors Number and File Length in NORMA at 1 Gbps ...... 130 5.35 Average Latency in Sayl by Task Processors Number and File Length in NORMA at 1 Gbps ...... 131 5.36 Average Latency in Master-Worker by Task Processors Number and File Length in UMA at 1 Gbps ...... 131 5.37 Average Latency in Producer-Consumer by Task Processors Number and File Length in UMA at 1 Gbps ...... 132 5.38 Average Latency in Sayl by Task Processors Number and File Length in UMA at 1 Gbps...... 132 5.39 Average Latency in Master-Worker by Task Granularity and File Length in NORMA at 1 Gbps ...... 133 5.40 Average Latency in Producer-Consumer by Task Granularity Number and File Length in NORMA at 1 Gbps ...... 134 5.41 Average Latency in Sayl by Task Granularity and File Length in NORMA at 1 Gbps 134 5.42 Average Latency in Master-Worker by Task Granularity and File Length in UMA at 1 Gbps ...... 135 5.43 Average Latency in Producer-Consumer by Task Granularity and File Length in UMA at 1 Gbps ...... 135 5.44 Average Latency in Sayl by Task Granularity and File Length in UMA at 1 Gbps . . 136 5.45 Average Latency in Master-Worker by Task Granularity, File Length, and Number of Available Task Processors in NORMA at 1 Gbps ...... 138 5.46 Average Latency in Producer-Consumer by Task Granularity, File Length, and Num- ber of Available Task Processors in NORMA at 1 Gbps ...... 139 5.47 Average Latency in Sayl by Task Granularity, File Length, and Number of Available Task Processors in NORMA at 1 Gbps ...... 140 5.48 Average Latency in Master-Worker by Task Granularity, File Length, and Number of Available Task Processors in UMA at 1 Gbps ...... 141 5.49 Average Latency in Producer-Consumer by Task Granularity, File Length, and Num- ber of Available Task Processors in UMA at 1 Gbps ...... 142 5.50 Average Latency in Sayl by Task Granularity, File Length, and Number of Available Task Processors in UMA at 1 Gbps ...... 143 5.51 Average latency by memory structure and design pattern at 1 Gbps ...... 144 5.52 Average Throughput by File Length in UMA at 1 Gbps ...... 146 5.53 Average Throughput by Number of Available Distributed Task Processors for the Master-Worker in UMA at 1 Gbps ...... 147 5.54 Average Throughput by Number of Available Distributed Task Processors for the Producer-Consumer in UMA at 1 Gbps ...... 147

viii 5.55 Average Throughput by Number of Available Distributed Task Processors for the Sayl in UMA at 1 Gbps ...... 148 5.56 Average Throughput by Number of Available Distributed Task Processors for the All Design Patterns in UMA at 1 Gbps ...... 149 5.57 Average Throughput with One and Two Mergers with UMA at 1 Gbps in Master- Worker...... 150 5.58 Average Throughput with One and Two Mergers with UMA at 1 Gbps in Producer- Consumer ...... 151 5.59 Average Throughput with One and Two Mergers with UMA at 1 Gbps in Sayl . . . 151 5.60 Linear Regression for the Master-Worker Results ...... 154 5.61 Linear Regression for the Producer-Consumer Results ...... 154 5.62 Linear Regression for the Sayl Results ...... 155

ix Abstract

Nowadays, computing systems are continuously exposed to unpredictable situations that may change their operational contexts, environments, and system requirements. In light of this, the soft- ware engineering and IT operations community are working in a viable solution to adapt software systems at runtime. One of the most affected quality attribute in software systems, given these context changes, is the performance. The performance is an important runtime quality attribute of software systems that is usually required to be satisfied by Service Level Agreements, which are fundamental in real-world business. In order to satisfy this kind of Service Level Agreements, and specifically, the performance of software systems, we study the impact of implementing performance design-patterns under different context scenarios from a quantitative viewpoint. Therefore, we focus our efforts in the study of the relationship between the application of design patterns and system performance, since design patterns has demonstrated that they can impact this particular quality attribute. In order to address the aforementioned relationship, we are faced to (i) select a set of performance-domain design patterns, (ii) select a set of context-variables, significant for the per- formance quality attribute, to study its impact in the performance of target systems, (iii) select meaningful performance factors to evaluate the performance impact on the target systems, (iv) configure the test environments to evaluate the system response to changes on context variables, considering the combinatorial of the values for the selected variables of study, and (v) analyze the inter-relationship of the involved variables of study. In this thesis, we address these challenges with the elaboration of (i) a Systematic Literature Review, which provides reliable information for the state-of-the-art of design patterns, performance factors, and context-variables; (ii) an experiments design, which defines the test environments of interest for this thesis project in order to obtain a measure of the impact that a context-variable variation and a design pattern have on a system’s performance; (iii) the data gathered from the execution of the designed experiments; and finally, a set of analysis obtaining reliable conclusions about the system performance behavior. This infor- mation is extremely valuable because it becomes an initial but significant experimental data to the engineering of self-adaptive software and IT operations community that allows to determine how to fulfill performance goals under changing context conditions at execution time. Chapter 1

Introduction

1.1 Motivation and Background

Different families of design patterns have been proposed in software engineering since the first catalog was published by Gamma et al. in 1994 [12]. Design patterns are general reusable solutions to commonly recurring problems within a given context in software design. According to this, a design pattern helps software engineers in the construction of better software systems since the solution it implies is usually well-proved and documented. Some researchers have studied the impact of domain-specific design patterns in the quality of software, confirming that design patterns can impact both positively and negatively the quality attributes of a system such as performance, security, availability, maintainability, among others [8][7][2][31][19][27]. One of the most important quality attributes in systems developed under the Service Component Architecture (SCA) [42], targeted in this thesis project, is performance. In this kind of systems, agreed Quality of Service (QoS) levels must be guaranteed to customers. Usually, QoS expectations are defined in terms of Service Level Agreements (SLA) [45][46]. SLAs are contracts between a service provider and customers that pay for the service with an expected quality, where most of the agreements and policies are described in terms of performance, availability, and reliability [36][45]. Performance is an important runtime quality attribute of software systems that characterizes the timeliness of services delivered by the system [42]. Therefore, the continuous satisfaction of SLAs is extremely important in real-world business because one single failure on these services may compromise significant amounts of money. However, computing systems are continuously exposed to unpredictable situations that may change their operational contexts, environments, and system requirements, where context is any variable used to characterize the state of an entity linked to the system [40]. These situations certainly hinder the achievement of SLA’s. These changes are usually induced, for example, by external disturbances, but also by the usual modification of the business goals in software systems. Thus, keeping satisfied SLAs is a critical complex and tedious task for software administrators at operation time, even more if we are aware of the ever-increasing complexity of current software systems. In light of this, adapting software systems at runtime to overcome these unanticipated situations becomes a formidable challenge but at the same time a viable solution for the software engineering and IT operations community. Self-Adaptive Software (SAS) systems —systems that are able to modify their behavior and/or structure in response to their perception of the environment and the system itself, and their high-level objectives— has become an important research topic for

2 software engineering [11][10][39][41][24][37]. In this project, we focus our efforts on the satisfaction of extra-functional requirements in software systems based on SCA. Specifically, we study the impact of implementing design patterns under different context scenarios in system performance. As mentioned before, even though it has been argued that particular design patterns can determine specific levels of software quality attributes [7], there are few studies addressing the quantitative relationship between the application of design patterns and system performance, as well as the use of design patterns as architectural solutions to dynamically satisfy performance on quantitative-basis. Our intent is also to explore the aforementioned relationship in order to provide initial but significant experimental data to SAS designers that allows the community to determine how to fulfill performance goals under changing context conditions at execution time. To achieve these goals, we carefully selected suitable design patterns that presumably improve performance. However, in order to select the adequate design pattern to perform any architectural reconfiguration, we are faced first to solve the problem of modeling the relationship between the system’s context variables and design patterns, as well as the relationship between the performance factors and design patterns. By characterizing this relationship, the goal is to make it possible to estimate the system’s performance behavior in response to the inception of these patterns into the target system. In this thesis, we focus only on the problem of characterizing this relationship, but having into account the wider context of applicability in the design of self-adaptive software systems.

1.2 Problem Statement

Considering the background described previously, we state the main problem addressed in this thesis project as follows: Given particular service-component (SCA) software applications, subject to given context condi- tions of execution and a given set of domain-specific design patterns for performance, characterize the response behavior of the software applications to the variations on the context conditions in terms of its performance.

1.3 Challenges

The problem statement is scoped by the following main challenge:

• To model the relationship between combinations of significant context-variables, including domain-specific design patterns, and the software performance response behavior in terms of performance factors. In other words, this is to model the expected performance impact that variations of context variables have on system performance.

The challenges associated to the main challenge are:

• To select a set of performance-domain design patterns.

• To select a set of context-variables, significant for the performance quality attribute, to study its impact in the performance of target systems.

• To select meaningful performance factors to evaluate the performance impact on the target systems.

3 • To configure the test environments to evaluate the system response to changes on context variables, considering the combinatorial of the values for the selected variables of study.

• To analyze the inter-relationship of the involved variables of study.

1.4 Research Objectives

1.4.1 General To characterize the system response of given software systems, in terms of the latency and through- put performance factors, to variations of context-variables. Among these variables, in particular to the inception of different domain-specific design patterns in these systems.

1.4.2 Specific • To select a subset of suitable domain-specific design patterns from those proposed in the literature for performance improvement.

• To select a subset of significant context-variables that directly affect system performance.

• To define a set of relevant case studies to evaluate the impact of context-variables and selected design patterns under different system configurations.

• To design and implement the base components required for realizing the selected design patterns in the given software systems.

• To establish appropriate values for the selected context-variables to experiment with.

• To measure the impact of context-variables variation and domain-specific design patterns on the performance factors of the system.

• To determine the best design patterns and system configuration combinations that produce the best performance response.

1.5 Methodology

This thesis project studies the relationship between the context conditions of execution of a soft- ware system and how the implementation of a specific design pattern in those contexts affects the system performance. Therefore, we are interested in determining the resulting behavior of the system after applying a design pattern, and varying the context conditions. To explore this rela- tionship, we aim at taking advantage of a mixed method approach, which combine qualitative and quantitative research. On the one hand, quantitative research explains a phenomenon by collect- ing numerical data that are analyzed using mathematical methods, for instance; we measure and evaluate the resulting performance level of a system after introducing a variation in the value of selected context-variables, having a design pattern implemented in the system. On the other hand, qualitative research explains a phenomenon from theoretical frameworks and exploratory observa- tions without incurring necessarily in precise mathematical methods. We analytically explore the conditions and constraints that govern the relationship between design patterns and performance

4 of software systems.

In order to achieve our specific goals, we define the following methodological steps:

Methodology Details The steps considered to achieve the enunciated goals are listed below.

1.5.1. Elaborate a Systematic Literature Review (SLR) to identify domain-specific design patterns that have been proposed for performance improvement.

1.5.2. From the SLR results, conduct an exploratory search to identify and define a key set of context-variables that impact performance, including infrastructure-variables. Some context- variables are for example: RAM memory usage, operating system, memory structure, size of processing batch, buffer size, among others.

1.5.3. Define a formal experiments design with the goal of producing systematic and reliable data results of the variables of interest.

1.5.4. Execute the experiments design plan.

1.5.5. Process and prepare the data gathered from the experiments according to the context- variables.

1.5.6. Analyze the relation between design patterns and the context-variables in terms of latency and throughput. The analysis will be based on the observation of average values and tendency lines of results.

1.5.7. Evaluate the context-variables that resulted most significant for determining system perfor- mance behavior in terms of latency and throughput, as well as their combined impact.

1.5.8. Evaluate and select the different execution contexts and configurations that produced the best system performance.

1.6 Contributions

The contributions of this project are the following: 1.6.1. A systematic literature review of domain-specific design patterns that have been proposed for performance improvement.

1.6.2. A characterization of the impact that design patterns and context-variables have in the system performance behavior, in terms of latency and throughput response, that allow to estimate the system performance when context-variables change.

1.6.3. A set of relevant considerations for the implementation of the selected design patterns for effectively improving performance.

1.6.4. An evaluation and selection of the context conditions that produced the best system perfor- mance levels, based of the experiments on the evaluated target systems.

5 Chapter 2

State-of-the-Art Background

In this chapter we present the basic concepts and state of the art in the fundamental knowledge areas required to solve the problem stated in this thesis, including the areas related to domain-specific design patterns, performance factors, and context-aware computing.

2.1 Self-Adaptive Software (SAS) Systems

We will make special emphasis in this topic because, as we stated in the Motivation and Background section (cf. Section 1.1), this thesis aims at contributing in the construction of the knowledge base component required by this kind of software systems, as explained in the following.

“Self-adaptive software modifies its own behavior in response to changes in its operating envi- ronment. By operating environment, we mean anything measurable by the software system, such as end-user input, external hardware devices and sensors, or program instrumentation” [29]. The adaptation process must occur dynamically and at runtime, and it is normally guided by high- level policies or system requirements defined by the systems administrators. Therefore, in order to keep high-level policies and requirements satisfied, the software system must be reconfigured, for instance, by augmenting or changing the system components and services at runtime [36]. The basis for automation of self-adaptation in computing and software engineering are the feedback loops, also called closed loops, whose origin comes from the classic control theory. This model is used to automate the control of dynamic systems [43]. Based on feedback loops, IBM researchers defined the autonomic element, a software artifact whose purpose is to manage itself by controlling its internal behavior and relationships in accordance to a set of high-level policies [18], for this purpose, the autonomic element introduces a feedback loop model in the form of Monitoring-Analysis-Planning-Execution and Shared Knowledge Base (MAPE-K) loop to adapt the managed element, that is, the so-called target system. Since the introduction of the MAPE-K reference model, different approaches have been proposed in the literature for applying feedback-loops to SAS systems, such as the Kramer and Magee’s Three Layer Architecture [21], the Autonomic Computing Reference Architecture (ACRA) of IBM [15], and the Villegas and Tamura’s DYNAMICO reference model [43]. Furthermore, several approaches for SAS systems have been proposed from very diverse disciplines [28][22][26], such as artificial intelligence, network and , and biological-inspired systems.

6 2.1.1 The MAPE-K Loop The elements and functionalities of the MAPE-K loop for achieving self-adaptation behavior are described below [15][18], referred to Figure 2.1:

Figure 2.1: The MAPE-K reference model [36]

Monitor. The monitor gathers information from the system (i.e., the managed element) and its context, and report relevant events in the form of control symptoms to the analyzer. The relevant events are those determined by the high-level policies or system requirements.

Analyzer. The analyzer, based on the high-level requirements and the reported context events, determines if changes need to be made. For example, if the performance policy is not longer sat- isfied, possibly a system modification is required the analyzer is the responsible for determining whether this modification is required, and in this case, it produces a change request that is passed to the planner.

Planner. Using the information passed by the analyzer and reviewing the shared knowledge base, the planner selects a suitable plan to be performed by the executor in the managed software system. The selected plan represents a set of desirable changes for the managed system.

Executor. The executor translates the plan action determined by the planner in a series of recon- figuration steps to be performed into the managed software system.

Knowledge Manager. The knowledge manager provides access to the relevant knowledge about the managed software system in different forms of types of data with architected syntax and se- mantics, such as symptoms, policies, change requests, and change plans. This knowledge is shared among all MAPE-K loop elements previously described.

2.1.2 Self-* Properties Self-* properties are the properties to be maintained by adaptation, introduced by Kephart and Chess in the vision of autonomic computing [18] in order to specify the aspects or attributes of self-managed systems [32], these are defined as follows:

Self-Configuring. The system will configure dynamically in accordance to high-level policies that define an expected behavior.

7 Self-Optimization. The system will automatically monitor and tune itself in order to improve its operation according to end-user or business needs.

Self-Healing. The system will detect, diagnose, and repair software malfunctions detected itself without disrupting its execution.

Self-Protection. The system will anticipate, detect, identify, and protect against threats that arise from its internal-context or environment.

In terms of the self-* properties, our project will help in the advance of the achievement of the Self-Configuring property, since the characterization of the system response in terms of latency and throughput performance factors of the relationship between design patterns and context-variables may guide the construction of the knowledge base component of SAS systems.

2.2 Service Component Architecture (SCA)

Service Component Architecture (SCA) is a set of specifications intended for the development of applications using Service-Oriented Architecture (SOA) [33]. SCA allows to create services and assemble them into composite applications, that is, applications based on components that implement the business logic [25][6]. As defined by Szyperski in [35], a software component is a unit of composition with contractually specified interfaces and explicit dependencies. A software component can be deployed independently and is subject to composition by third parties, this means that software components provide the flexibility and extensibility that programmers require to assemble software pieces depending only on the contracted services, without considering any implementation details.

2.3 The Performance Quality Attributes and Performance Factors

A Software Quality Attribute is defined as the degree of achievement in which a property of the software system is perceived by its users [4].

”Performance is a pervasive quality attribute of software systems; everything affects it, from the software itself to all underlying layers, such as operating system, middleware, hardware resources, network resources, among others” [44]. In general, performance can be understood as the time required for the system to respond or process events. Nevertheless, performance is usually described in terms of the factors that affect it: latency (i.e., the time that takes to respond to a specific event), throughput (i.e., the number of events that can be completed in a given interval of time), and capacity (i.e., a measure of the amount of work that the system can perform) [42][4]. A Performance Factor is a concrete dimension of the quality attribute by which the system is specified, measured, and evaluated.

8 2.3.1 Performance Factor Definitions Throughput Throughput refers to the number of events that have been completely processed by a given software system over a given observational time frame. However, it is worth noting that the number of events specified in the observational time frame may not describe the global behavior. For instance, a throughput of 60 requests processed in one minute does not imply that a request is processed every second, then one or more observational time frames should be specified in order to get a more detailed behavior.

Latency Latency refers to the time interval in which a request is completely processed. Since this time has a variability, it really refers to the time interval in which it is expected the response time to process completely an event. The minimum and the maximum of the time interval represent the minimum and the maximum latency, respectively [4].

Capacity Capacity refers to the maximum amount of work that the system can perform with its available resources at a particular moment. Usually, capacity is measured in terms of the maximum system throughput without violating a maximum latency specified [4].

Resource Utilization Resource utilization refers to the usage percentage of a computational resource within a time interval. Some computational resources are: CPU, RAM memory, network bandwidth, hard disk space, among others [4].

Success Rate Success rate refers to the percentage of events processed successfully from a given total number of events within a time interval.

Jitter Jitter refers to the variation in the measurement of a set of latency measurements. Because of this fact, jitter is usually a complementary measurement of latency, not analyzed independently.

2.4 Context-Aware Computing

As it is stated by Villegas in [40], software systems are highly affected by unpredictable situa- tions that occur in the system’s execution environment, to modification in system requirements and the software itself. This uncertainty is usually generated by (i) environmental uncertainty, the uncertainty produced by changing environmental conditions (e.g., hardware and network load fluctuations), and (ii) system uncertainty, the uncertainty generated by system requirements and changes in the system behavior (e.g., changing the number of users). The context-aware computing

9 is usually defined as ”software applications that dynamically change or adapt their behavior based on the context of the application and the user” [1]. Where context must be understood as any variable used to characterize the state of an entity, where an entity can be a person, place or object [1][40].

2.4.1 Context-Variable Definitions This section presents the definition of the most relevant context-variables that we found for this thesis project. We also present the definitions of some variables that are relatively unknown but were found in the literature. It is worthy to note that some context variables do not affect directly the system performance. However, they must be monitored to evaluate the real state of the system and its performance.

Distributed Task Granularity It refers to a quantitative measure of the ratio of computation time versus communication time [5]. That is, the amount of real work that a parallel task performs before it has to be synchronized with other task processors and the amount of data required to perform the synchronization operation. If the granularity is too fine, the performance may be affected by the communication overhead. If the granularity is too coarse, the performance may be affected by load imbalance [47].

How Is It Measured? Task granularity is generally classified in three relative values: fine, medium, or coarse. However, it is worth noting that a program may contain many different levels of granularity or grain size. Therefore, the goal is to determine the right granularity for parallel and distributed tasks, while avoiding load imbalance and communication overhead to achieve the best overall performance [47].

• Fine-grain. The software is decomposed into a large number of small tasks. Fine-grain facil- itates load balancing. However, it implies high communication overhead and less opportunity for performance improvement. In some cases, if the granularity is too fine the computation time can be less than the communication time which, in general, affects negatively the overall performance [47][13][5].

• Coarse-grain. The software is decomposed into a small number of large tasks. Coarse- grain implies more opportunity for performance increase due to it could reduce management overhead. However, there are less opportunities for load balancing [47][13][5].

• Medium-grain. The software is decomposed into tasks of relative size between the fine-grain and coarse-grain levels.

Batch Size Related to task granularity, the batch size defines the different task granularity levels for a batch of work to be processed by each of the available distributed processors or even CPU cores. A batch must be understood as the minimum unit of data to be processed in a distributed software. It affects the performance in the same manner of task granularity.

10 How Is It Measured? Similarly to task granularity the batch size is classified in the same three relative values. These are: (i) fine, (ii) medium, and (iii) coarse batch size.

Batch Time-Span It refers to the time span that must separate two batches for processing in the same task processor [30]. The Batch time span is expected to impact the system performance since it might help to control the data flow in the program.

How Is It Measured? This variable is measured in units of time.

Memory Architecture It refers to the different strategies to organize the shared memory structure of multiprocessors. It also can be understood as the memory architecture of the software. The memory architecture can impact the system performance directly since each associated configuration may arise with improvements or losses of performance, creating a tradeoff decision. Additionally, the memory architecture usually implies particular software and hardware configurations that impact the system performance.

How Is It Measured? Shared memory architecture is generally classified in three different categories:

• Uniform Memory Access (UMA). In UMA, a physical unit of memory is configured to be shared among the set of distributed processors. This memory architecture may arise scalability problems, due to when there is a considerable number of processor the shared memory turns into a bottleneck, decreasing the overall performance.

• Non-Uniform Memory Access (NUMA). To avoid the bottleneck problems of shared physical memories, NUMA proposes an architecture where the memory and the data to be stored is divided and distributed between the processors. However, the whole memory is still addressed with a unique set of logically addresses. Therefore, this memory architecture is logically shared, and physically distributed.

• No Remote Memory Access (NORMA). In NORMA, accesses to remote memory (i.e., the memory associated to another processor) are only possible through the interconnection of processors that share a communication network. That is, this architecture does not allows direct remote access to memory.

Buffer Size According to the Java documentation, a queue is a collection designed for holding elements prior to processing1. Queues typically store elements to be processed in a FIFO (fist-in-first-out) manner.

1https://docs.oracle.com/javase/7/docs/api/java/util/Queue.html

11 The buffer size refers then to the capacity for the queue to not overrun. The buffer size can also be associated with other data structures. The buffer size introduces a factor of balance between the velocity at which tasks are generated versus the velocity at which they are processed.

How Is It Measured? Since a big task buffer size can be counterproductive because it can consume more resources than the expected to guarantee space reservation, a small task buffer size can cause lost of data and affect negatively the rate of processed tasks.

Number of Available Distributed Tasks Processors It refers to the number of distributed processing nodes (i.e., computational resources not in the same physical CPU) available to perform the tasks to execute concurrently an application required. The primary assumption is that a distributed program will be faster as more available distributed processing nodes it has, within some limits.

How Is It Measured? The number of available task processors can be configured before the program starts in order to take advantage of them and register relevant metrics, or can be monitored dynamically.

Network Bandwidth It refers to the amount of data that can be transported from one distributed tasks processor to another in a given period of time. The network bandwidth is usually measured in mega bits per second (Mbps).

How Is It Measured? The network bandwidth must be parameterized in the execution environment as part of the network configuration. This configuration implies both the network card in the task processor side, and the network switch or respective intercommunication device.

Number of Concurrent Service Requests It refers to the number of concurrent service requests that a program must attend. An increased number of service requests can affect negatively the system performance.

How Is It Measured? This variable is measured by counting the number of service requests that arrive to be serviced.

RAM Memory Usage It refers to the amount of RAM memory required for a program to be executed. The RAM consumption is an important aspect that impacts performance since an overloaded RAM memory implies less space for each execution, thus negatively impacting performance. Programs developed with Java are executed by the Java Virtual Machine (JVM), which interprets and executes Java

12 binary code, and administrates the available hardware RAM memory. This variable can also be configured within the limits of the physically RAM available.

How Is It Measured? The RAM memory is measured in megabytes (MB).

Communication Time It refers to the time used in data transportation between distributed task processors. Commu- nication time is important to compute the ratio of computation to communication of a program. The communication time is important because it allows to determine how much of the total time program execution was spent in data transportation. Although, the communication time is affected by multiple variables, the ratio and in general the measure of the communication time by stages of the program will allow to determine possible bottlenecks and optimization points.

How Is It Measured? It is the amount of all times registered when any data structure is sent or received by any processor through the network during the software execution.

Network Usage Related to the network bandwidth, it refers to the real amount of data transported over the network in a given period of time relative to the network’s bandwidth. It represents the percentage of utilization of the network.

How Is It Measured? It is monitored by the operating system and can be registered through a middleware framework. It is measured in Kbps or Mbps.

Power Consumption It refers to the amount of kilowatts (kW ) consumed to perform a task. Computing at maximum allowable speed and capacity requires a CPU to consume more power.

How Is It Measured? The power consumption is measured in kilowatts per hour (kWh).

Shared Resources It refers to any significant computational resource that have to be shared to perform a tasks. It involves software, hardware, and network resources. Shared resources are potential bottlenecks in systems where performance is critical. Some examples of common shared resources are: CPU processors, network bandwidth, data, and storages devices.

13 How Is It Measured? Every shared resource has a particular form to be measured. Additionally, according to the system some shared resources might impact the performance in a greater or lesser extent.

2.4.2 Design Patterns for Software Architecture Buschmann et al. provide the following definition of pattern for software architecture [8]. “A pattern for software architecture describes a particular recurring design problem that arises in specific design contexts, and presents a well-proven generic scheme for its solution. The solution scheme is specified by describing its constituent components, their responsibilities and relationships, and the ways in which they collaborate”.

Categories of Design Patterns According to its scale and abstraction level, the patterns for software architecture can be classified in three different categories [8]:

, which “expresses a fundamental structural organization schema for software systems. It provides a set of predefined subsystems, specifies their responsibilities, and includes rules and guidelines for organizing the relationships between them”[8].

• Design Pattern, which “provides a scheme for refining the subsystems or components of a software system, or the relationships between them. It describes commonly-recurring struc- ture of communicating components that solves a general design problem within a particular context”[8].

• Idiom, which “describes how to implement particular aspects of components or the relation- ships between them using the features of a given language”[8]. Idioms are the only category restricted to a specific programming language.

Architectural patterns and design patterns, by their implicit definitions, provide adequate lev- els of abstraction to be used for service component architecture (SCA) specification. However, the difference between these two categories lies in their respective scope. Mainly, architectural patterns significantly determine the global structure of the system design, while design patterns have only localized influence. Design patterns, in contrast with the other two categories (i.e., Architectural patterns and Idioms) provide an abstraction level where solutions may be partially given in terms of components; that is, they may affect only the necessary components or subsystems, and the rela- tions between them to address an expected behavior; all of this without incorporating unnecessary components or subsystems that may overload the software system, or without recurring to specific technologies or programming languages that limit the scope of the solution. However, most of the design patterns presented in the literature were created targeting the object-oriented paradigm. Even so, most of the times it is possible to abstract these, in order to adapt them according to the component paradigm, which is the scope of this thesis.

Domain-Specific Design Patterns Gamma et al. documented a set of design patterns in [12]. These patterns were proposed inde- pendently of their application domain. That is, they provide solutions for a wide variety of design

14 concerns. However, this broad variety makes it difficult to define their optimal application-area. In contrast, domain-specific design patterns were proposed later to address specific design con- cerns that pertain to particular application domains and quality attributes, such as performance, security, availability, maintainability, among others. In this direction, it has been strongly argued that design patterns can determine specific levels of software quality attributes [7][2][31][19]. Thus, domain-specific design patterns provide solutions for a well-defined application domain in which the pattern provides its proposed fit.

Domain-Specific Design Patterns for Performance This master thesis is focused on the study of domain-specific design patterns that have been pro- posed in the literature to improve the quality-attribute of performance, as well as the different factors through which it is measured and expressed, such as latency, throughput, and capacity. Thus, we can evaluate the system performance response, providing enough information to build an initial knowledge base for practitioners and to be used in MAKE-K implementations to select an adequate design pattern according to the system’s context.

Design-Pattern Elements In general, a design pattern can be defined in terms of four essential elements according to Gamma et al. [12], which are:

2.4.1. Pattern name is an identifier that describes in a word or two the design problem, its solutions, and consequences.

2.4.2. Problem describes when to apply the pattern. That is, it describes the problem that arises repeatedly in the given context. Sometimes the problem description includes a set of con- straints, so called forces, that should be considered in order to apply the pattern.

2.4.3. Solution describes how to solve the problem in terms of a specified set of design elements, their relationships, responsibilities, and collaborations. Every solution is expressed mainly through two essential aspects: (i) structure, the pattern elements and its inter-relationships, and the (ii) runtime behavior and interactions for those elements.

2.4.4. Consequences refers to the results and trade-offs of applying the pattern. Usually, these consequences are described as the impact in different quality attributes, such as performance, availability, portability, among others.

2.5 Chapter Summary

In this chapter we have presented the state-of-the-art background of this master thesis, we intro- duced in depth the fundamental concepts of the knowledge areas required to develop and fulfill its goals: performance quality attributes, design patterns, and context-aware computing. We also introduced Self-Adaptive Software Systems, that is, systems that are able to modify its own be- havior in response to the changes in its environment. We expect to provide enough information in this thesis project in such way that it can be used as a guide for the evaluation of the performance

15 of other domain-specific design patterns not evaluated here. In this way, we can construct an initial knowledge base for the SAS systems that are focused in the achievement of self-configuring properties. Finally, we also introduced the Service Component Architecture as the architecture by excellence for the development of this kind of systems. In the next chapter, we present how the introduced concepts and knowledge-areas are considered in this thesis project, and how we work with them, establishing the basis for our contribution.

16 Chapter 3

Problem Modeling

This chapter describes the context of the problem that concerns to this master thesis project and how this problem was modeled. This chapter also describes the work realized to establish and define the project scope in terms of the domain-specific design patterns, performance factors, and context variables considered in this project.

3.1 The Problem Context

The main concern of this thesis is the relation between the application of design patterns and sys- tem performance through variations of the context-variables, and to the extent of our knowledge there is not enough literature that support this relation from the quantitative point of view. There- fore, practitioners and researchers do not have sufficient tools that provide designs with relevant information to make decisions about the selection of the most suitable design pattern to provide an expected system level performance according to the the current system environment. More concretely, this thesis project aims to characterize the system response, in terms of the latency and throughput performance factors, to variations of the context-variables and the inception of different domain-specific design patterns in the system. As we stated in chapter 2 (State-of-the-Art Background), it has been illustrated that design patterns impact the quality attributes of software systems. However, there is not enough infor- mation of how much that impact is. Similarly, there is no information of how context-variables affect the system performance. This lack of quantitative information imply some other questions involving our topic of interest.

• Which design patterns have been proposed to improve system performance?

• How much can a design pattern improve the performance respect to another one?

• Which context-variables impact significantly the system performance?

• How much a variable configuration can impact the system performance with respect to another variable? or other variable value?

• How much a context-variable and a design pattern configuration can improve the performance of a software system?

17 • What performance in terms of latency or throughput can we expect from a system configu- ration and under a given context? However, these are no trivial questions. To answer these questions and determine the quan- titative relationship between domain-specific design patterns, context-variables, and performance factors, we must do multiple experiments that allow us to draw valid conclusions. Considering this, we determined that the best way to solve the questions and objectives stated in this thesis is through a design of experiments (DoE).

3.2 The Need for an Experiments Design

Considering the questions formulated above, there is a large number of variables involved in this thesis project that influence the system performance. In order to formulate valid conclusions we must define a suitable but systematic strategy to consider the multiple possible values for the several number of variables involved. The best strategy is to use a methodology of design of experiments [3]. In an experiment design, we deliberately make changes in the input variables, also called factors, and then we observe how the response (or outcome) varies accordingly. Since there is a lack of quantitative information on how the context-variables and design patterns impact the performance of a system, we must consider that not all variables will affect the system performance in the same manner. Some may have a strong impact while others may have no significant impact at all. Therefore, our objective is to identify and plan the experiments design according to those variables and system configurations that have a significant impact in performance. However, in the case that we identify in early stages of the experimentation a variable with minimum impact, we will document its observed behavior but the variable will not be analyzed in deep. Design of Experiments refers to the process of planning, designing and analyzing an experiment so that valid and objective conclusions can be drawn effectively and efficiently [3]. Current exper- iments methodology that follow the DoE principles do not require to test all possible values of a variable. However, due to the lack of information from the behavior of the variables involved in this project we decide to use the classic method, where experiments are conducted by changing one factor at a time. We selected a practical methodology to develop the experiments design that is divided into four phases: (i) planning, (ii) design, (iii) execution, and (iv) analysis. These phases are explained in detail in the following sections. Considering that measurements are subject to experiment environment variations and mea- surement uncertainty, we decided to use the replication principle of DoE to improve the reliability and validity of the experimental results. The replication principle consist in the repetition of a entire experiment under specific conditions. This permits us to obtain a more precise estimate of the impact of the involved variables and increase the confidence of conclusions obtained from the analysis of data results.

3.3 Identification of Performance Factors

3.3.1 Search and Identification Process We establish a methodological process in order to identify a set of relevant performance factors through which we satisfy the objectives of this thesis project. • Perform an exploratory search of performance factors in the literature.

18 • Summarize the number of occurrences in which a performance factor is used in the literature, and sort them in descending order. From the resulting list, we establish the more relevant performance factors to analyze.

• Evaluate and filter the prioritized performance factors according to their compatibility with the selected design patterns.

The Exploratory Search We define the following conditions to perform the exploratory search of performance factors.

Key Words

Context variable; Performance factor; Performance or System performance; Changing Condi- tions; Self-managed systems; Impact, affect, determine; Design pattern; Performance prediction; Variables to model performance; Performance characteristics, performance properties; Software performance engineering; Performance variables.

Selected Data Bases

• Google Scholar

• ACM

• IEEE Computer Society

• Elsevier

• Springer

After an iterative process, we decided to use the following search string for the definition of the performance factors to analyze:

• (performance factor*) AND (software systems). Papers published from 2000 to 2015.

3.3.2 Distribution Results of Performance Factors Figure 3.1 summarizes the percentage of incidences of each performance factor found in the litera- ture according to the exploratory search process. In total there was 22 occurrences of performance factors found with the exploratory search.

3.3.3 Throughput - Special Consideration In this thesis, we distinguish between two different categories of throughput: global throughput and immediate throughput. Unlike global throughput that match with the formal definition given in section 2.3.1, the immediate throughput refers to the number of events that have been com- pletely processed at a particular moment within the observational time frame. Nevertheless, we will measure only global throughput in our experiments.

19 Figure 3.1: Distribution Results of Performance Factors Literature

20 3.4 Identification of Context Variables

This thesis project aims to characterize the context of execution of a software system under a set of context-variables in order to provide suitable information to the system to perform a possibly required adaptation. A context variable is an entity that is modeled or controlled inside the software system or its execution environment, and which we expect to evaluate in its impact to the software’s performance. In this exploratory search process we note that some variables are named differently depending of the authors, even though they present similar definitions. In light of this, we could decide either to adopt one of the names or define a new one.

3.4.1 Search and Identification Process We establish a methodological process in order to identify a set of relevant context variables that pertain to the objectives of this thesis project. Although there is a huge number of context- variables in the literature we only are interested in those that may affect the system performance in the context of selected design patterns according to their definition.

• Perform a formal exploratory search of context-variables in the literature. We search for context-variable definitions that allow us to model the system performance response of a design pattern, that is, we require to find a set of relevant context-variables from which a resulting performance behavior can be explained and determined.

• Evaluate if the context-variables are measurable and reasonably controllable in a experiment with the selected design patterns.

The Exploratory Search Similar to the process carried out for the definition of performance factors. We define the following conditions to perform the exploratory search of context-variables. Some aspects of the exploratory search were replicate from the exercise on the performance factors search.

Key Words

Context variable; Performance factor; Performance or System performance; Changing Condi- tions; Self-managed systems; Impact, affect, determine; Design pattern; Performance prediction; Variables to model performance; Performance characteristics, performance properties; Software performance engineering; Performance variables.

Selected Data Bases

• Google Scholar

• ACM

• IEEE Computer Society

• Elsevier

21 • Springer

After an iterative process, we decided to use the following search string for the definition of the context-variables to analyze:

• (context variable*) AND (impact performance OR affect performance) AND (software sys- tems). Papers published from 2000 to 2015.

Variables’ Categorization To help in the selection of suitable context-variables coherent with this thesis’ goals, we define the following categorization of the variables found.

• Internal and External Context-Variables, internal variables are those whose values are changed or configured during normal system operation or system programming, respectively. Such as, batch size, memory structure, among others. Contrary to internal variables, external variables are those that are out of the scope of the operating system or software, but whose variations are able to affect the software. Such as, number of available CPU cores, bus width, among others.

• Classification, we classify each variable depending on what causes its variation. We iden- tify the following causes: (i) software, like the number of concurrent users and task buffer size, (ii) hardware, like the cache configuration and power consumption, (iii) network, like communication time, and (iv) memory, like memory structure.

3.4.2 Context-variables Categorization From the search process we identify 28 potential context-variables for this thesis project. The list and categorization of the variables is presented in table 3.1.

3.4.3 Context Variables - Special Considerations Buffer Size For the purpose of this thesis project the buffer size is associated only with queue collections. We will not set size of the queues in the case studies instead of that we will use collections that grow dynamically. We will monitor the queue growth.

Number of Distributed Task Processors This thesis explores the hypothesis enunciated in the variable’s definition, that a distributed program will be faster as more available distributed processing nodes it has within some limits. We will determine the validity of this hypothesis and its associated impact.

RAM Memory Usage As case studies are developed in the Java programming language, we also evaluate the best config- uration for the JVM in the experiments environment.

22 Table 3.1: Context-Variables Categorization

Internal or Id Context Variables Category External Context 1 Number of available CPU cores External Hardware 2 Number of available distributed task processors External Hardware 3 Power consumption External Hardware 4 Network usage Internal / External Software / Network 5 Core’s frecuency External Hardware 6 Number of service requests External Software 7 Number of concurrent Users External Software 8 Bus width External Hardware 9 Cache configuration External Hardware 10 Operating system External Software 11 Processor usage Internal / External Software / Hardware 12 RAM memory usage Internal / External Software / Hardware 13 CPU processor temperature External Hardware 14 Network bandwidth External Network 15 Task buffer size (Queue) Internal Software 16 Batch time span Internal Software 17 Task dependency Internal Software 18 Synchronous or asynchronous communication Internal Software 19 Synchronous or asynchronous task generation Internal Software Software / Hardware 20 Shared resources Internal / External / Network 21 Batch size (Coarsed, fine, and medium) Internal Software 22 Centralized or distributed control mechanism Internal Software 23 Heterogeneous or homogeneous tasks Internal Software 24 Number of task generators Internal Software 25 Multiple buffers Internal Software Memory / Hardware 26 Memory structure (UMA, NUMA, NORMA) Internal / Network 27 Communication protocol between components Internal Software 28 Communication time Internal Software / Network

23 3.5 Identification of Domain-Specific Design Patterns for Performance

This section describes the process followed to identify the design patterns for performance that have been published in the literature. To accomplish this task we decided to perform a System- atic Literature Review (SLR). As it was stated in [20], an SLR is a methodologically rigorous review of research results. The aim of an SLR is not just to aggregate all existing evidence on a research question; it is also intended to support the development of evidence-based guidelines for practitioners. For the research methodology used to conduct the SLR we first defined the research questions together with the research protocol. Then, an extensive literature search is conducted and further literature is selected. Based on the resulting data set (i.e., the set of papers), we define a filter to identify the relevant literature. Later, we define a data extraction and synthesis template to finally define a criteria to select the set of domain-specific design patterns to contemplate in this project.

3.5.1 Overview of the Research Methodology This section illustrates the most relevant aspects (i.e., a summary) of the methodology enunciated above. However, if more details are required, we invite the reader to study our related technical report [38]. For information related on how to perform a SLR, we recommend the technical report: Guidelines for performing Systematic Literature Reviews in Software Engineering [17].

Research Questions The research question addressed in the SLR document were:

3.5.1. Which are the domain specific design patterns that have been proposed to improve the per- formance in software systems?

3.5.2. In what amount do the specific domain design patterns applied to system design enhance the software performance?

3.5.3. Which are the metrics or methods used to evaluate the improvement obtained by the imple- mentation of design patterns of specific domain in the performance of software systems?

Keywords and General Search String Keywords Performance, Throughput, Scalability, Design pattern, Load balance, Quality attributes, Quality improvement, performance metrics, quality metrics, concurrency, distributed systems, distributed processing, parallel systems, quality attributes assessment, quality attributes eval- uation, reference models (for performance), architectural styles (for performance).

General String (i.e., without editorial format) (performance OR concurrency OR through- put OR scalability OR ”load balanc*”) AND (”design pattern*”) AND (distribut* OR parallel OR quality OR improvement OR measur* OR evaluat* OR assessment) AND publication- year≥2006

24 Inclusion Criteria 3.5.1. Published papers since 2006, we decided to take papers from the last 10-years because we considered this is a suitable period to evaluate the advances in software engineering.

3.5.2. Studies that contain experiments regarding design patterns of specific domain that improve the performance in software systems.

3.5.3. Studies with theoretical models about the measurement of the performance in a software system.

Quality Assurance Process To determine if a study is relevant for this project, this should accomplish some of the next char- acteristics:

3.5.1. The article must have at least one design pattern or strategy that improve performance.

3.5.2. The article should present experiments that support the performance improvement achieved when the design patterns or strategies are implemented.

3.5.3. The article should present a mathematical model that allows to predict the performance improvement achieved when the design patterns or strategies are implemented. Additionally, the article should present experiments that support the real performance improvement.

3.5.4. The patterns or strategies presented are focused on improving the performance directly.

3.5.5. More than one article presents the same pattern or strategy.

The Filter Process The filter process consists on selecting and refining the results obtained from the used search strings in the corresponding data bases. The general aim of this process is to filter the irrelevant papers from the search. The process started with the 248 resulting papers from the search process. Subsequently, the papers were evaluated based on the content of the abstract. If these fulfills the inclusion criteria they were accepted. Finally, the papers were studied in depth with the purpose of finding relevant design patterns descriptions that fulfill the inclusion criteria and the quality assurance statements. From the resulting papers of the final process, a snowball process were conducted to selected the most relevant references of the papers that may emphases or provide extra information for a better understanding of the same. Figure 3.2 summarize the results of the filter process by each data base considered.

3.6 Domain-Specific Design Patterns for Performance

Even though this section should be included in the state-of-art chapter, the work that implied for us to describe these domain-specific design patterns in a standardized and technically detailed way made us to consider this more a contribution than a state-of-art description. This is especially true for the analysis that we had to perform in the papers and the synthesis of the characterization of the structure and behavior of the patterns.

25 Title and Abstract Review Full-Text Review

Digital Library IEEE Xplore Digital Library IEEE Xplore Digital Library IEEE Xplore Total=62 Accepted=16 Accepted=6

ACM Digital Library ACM Digital Library ACM Digital Library Total=41 Accepted=15 Accepted=7

Springer Springer Springer Relevant Studies Total=107 Accepted=24 Accepted=6 Total=26

Elsevier Elsevier Elsevier Total=38 Accepted=4 Accepted=1

Snowball Papers Snowball Papers Total=15 Accepted=6

Initial Search Abstract Accepted Full-Text Accepted

Figure 3.2: Filter-Process Stage Results

3.6.1 The Design Pattern Template This section describes the template used in the next sections to characterize the design patterns that were considered as relevant for this thesis project according to section 3.5. We made special emphasis in the construction of these templates since all selected articles present the design patterns in a not standardized way in structure and behavior; however, this thesis project standardizes these in UML class diagrams and UML sequence diagrams, respectively. This template is based on the design patterns elements described in chapter 2.

Intent The intent summarizes the intention of the design pattern, that is, it summarizes the problem and context in which it is applicable and a brief understanding of the design pattern.

Problem The problem describes when to apply the pattern. That is, it describes the problem that arises recurrently in the given context and for which the design pattern was created.

Context The context indicates the situations and conditions that must hold within the target problem and that maximizes the pattern applicability.

26 Forces The forces describe the constraints that govern the problem’s possible applicators, in light of the pattern. These must be strictly considered in order to apply the design pattern and its variations.

Structure The structure describes the solution offered by the design pattern to the recurrent problem in terms of a set of specific design elements, their relationships, responsibilities and collaborations. The structure is presented through an UML class diagram.

Behavior The behavior presents the runtime behavior and interactions of the elements introduced in the structure section. The behavior is presented through an UML sequence diagram.

3.6.2 Random Access Parser Design Pattern Intent The Random Access Parser design pattern creates a navigable data structure from plain and large files that follows certain format, making it easier to access data records in both directions, back- ward and forward, while operating over them. Navigating the data structure produces per-record snapshots that are written to disk, which reduces reading time and memory consumption.

Problem Usually, processing a large file of structured data (e.g., XML or JSON) requires different kinds of operations, for example, to insert, remove, and update data records in a data base management system. The structured data can follow a standard format, typically some conceptual variation of the WebRowSet format. The WebRowSet format comprises three parts: properties, metadata, and data. In the context of the previous example, the properties section contains details regarding the Relational Database Management System (e.g., synchronization provider, isolation level, and rowset type). The metadata section contains information about the database structure (e.g., column numbers, their names, and types). Finally, the data section contains the application data. For the processing, data records might be required to be accessed randomly and refer to other records, located possible far before or after the current one. Another example is if a file can be processed per table, this requires to read not only the concrete data but also the associated metadata and properties. In case the metadata or properties are modified, it is likely the data would need to be also modified; performing such modifications imply moving forward and backwards in the file, therefore this can introduce performance issues. These issues are commonly caused because an efficient processing would imply to load the whole file into memory, which is not possible.

Context Typical processing of data sets requires to iterate over the data records. Even though these data sets may be very large, a common approach is to load all the data into main memory. However, this affects negatively the overall application performance, and in some cases, it is not even possible to load the whole file into memory. For this reason, on-demand reading is desired and sometimes,

27 ConcreteParser

1 1 <> RandomAccessParser DomainSpecificParser +data: DataStructure +readFile(path:String) +parse(content:FileData): DataStructure +get(recordId:Integer): DataStructureRecord +loadFileContent(content:FileContent): DataStructure +update(recordId:Integer,newRecord:DataStructureRecord) 1 1 1 1 +delete(recordId:Integer) LookUpTable Memento +insert(record:DataStructureRecord) +storeState(elementId:Integer) * +restoreState(elementId:Integer) 1 +writeUpdatedFile(path:String) TableElement +elementId: Integer +findElement(elementId:Integer): Memento

Figure 3.3: Structure Diagram - Random Access Parser Design Pattern required. Moreover, this strategy allows the processing to behave scalable and stable, in terms of time and memory consumption.

Forces Efficiently random access to a large structured data file. This design pattern proposes the use of two lookup tables. On the one hand, one table is used to remember the already read, updated, inserted, or deleted data records through parser snapshots using the memento design pattern. On the other hand, the other table is used to maintain the original data structure. These snapshots allow an efficient and manageable random access and avoid memory overflow caused by the processing of large files.

Structure Diagram - Figure 3.3 Participants

Domain Specific Parser: An abstract class that contains the abstract method parse. This method allows to transform the contents of a file into a data structure, for instance, the corresponding data structure for the XML or JSON formats.

Concrete Parser: A concrete implementation of the DomainSpecifParser class. It is responsible for implementing the parse method. Some well known implementations of the ConcreteParser class can be found for XML-parsing libraries such as SAX, Dom and XPath.

Lookup Table: This component allows to navigate inside memento snapshots.

Table Element: This component is responsible to encapsulate and identify each memento ele- ment.

Memento: Memento is a design pattern designed to save a specific object state. Given that the object states are saved, it can roll back to a previous state.

Random Access Parser: This is the main component. It is responsible for reading a file, trans- forming it into a data structure by means of a ConcreteParser, performing operations (i.e.,

28 <> ConcreteParser:parser LookUpTable TableElement Memento

readFileContent(String path):void loadFileContent(FileData content):DataStructure

getElement(Integer elementId):Memento

get(Integer recordId):DataStructureRecord search(Integer recordId):Memento findElement(Integer elementId):Memento

AppClient getMemento():Memento

Figure 3.4: Processing Sequence Diagram - Random Access Parser Design Pattern

update, delete, insert) over the data structure, storing and restoring states of the data struc- ture.

Behavior Processing Scenario - Figure 3.4 . This scenario describes the normal processing behavior of RandomAccessParser design pat- tern.

• When any record is requested, no matter its location, RandomAccessParser searches into its LookUpTable the required element. • The LookUpTable searches in its TableElements the right element according to the elementId. • Once the correct TableElement is found, it gets its associated Memento object, and the memento is returned to the RandomAccessParser, which returns the DataStructureRe- cord.

3.6.3 Reactor Design Pattern Intent The Reactor design pattern handles different types of concurrent service requests that are delivered by one or several clients. Service requests are received by a service-specific event handler, working separately from the service implementation. These event handlers are registered into a dispatcher, which is in charge of executing the corresponding services.

Context A server application concurrently serves several types of service requests from one or more dis- tributed clients. These requests are internally handled as events by event handlers.

Problem In distributed environments, servers offer different services, and a single server can receive different request types. Processing a request can imply locks in the requests arrival point, while client requests are serviced. These locks can impact negatively the system performance. Additionally, different request types require different handlers, hence it is necessary to select the adequate ones. This selection performed dynamically increases the time to respond each request.

29 1handlers N InitiationDispatcher RequestHandler

+initDispatcher(): void +handle_request(type:RequestType,request:Request) +handle_requests(type:RequestType,request:Request) +register_handler(handler:Handler) +remove_handler(handler:Handler)

ConcreteRequestHandler SynchronousRequestDemultiplexer

+select(): RequestType +handleRequest(request:Request): void

Figure 3.5: Structure Diagram - Reactor Design Pattern

A typical solution is to use a thread for each request, however this solution implies system over- head when threads finish processing and become idle, especially when requests are not uniformly distributed among request types.

Forces Favor service availability. This pattern uses a dispatcher to redirect requests to the corre- sponding handlers, therefore the server does not block while attending a single request.

Increase server efficiency. By using a request dispatcher, and by avoiding idle threads in re- quest processing this pattern aims at reducing unnecessary use of CPUs and minimizing latency in service requests, and maximizes throughput.

Ease service adaptability. Given that handlers are registered with a single request dispatcher, this pattern eases modifying or adding handlers for different types of service requests.

Structure Diagram - Figure 3.5 Participants Request Processor: Represents the requests reception point. This class contains the necessary information to redirect the request to the corresponding handler to process the service request. Request Handler Interface: An interface specifying the required service to handle requests. The service specified by this interface must be implemented by concrete request handlers. Concrete Request Handler: A concrete class implementing the request handler interface. Con- crete request handlers are registered with the initiation dispatcher. When a request corre- sponding to the request handler type arrives, it is called by the initiation dispatcher. Synchronous Request Demultiplexer: Waits for a request to occur, and returns a handler without blocking the initiation dispatcher. Thus allowing the system to continue serving requests. Initiation Dispatcher: Responsible for both registering and removing request handlers, and dis- patching service requests. It waits for requests to be processed by the synchronous request demultiplexer component, and calls the concrete request handler component to process each request.

30 Behavior

requestDemultiplexer:SynchronousRequestDemultiplexer dispatcher:InitiationDispatcher handler:ConcreteRequestHandler

loop

handle_request(Request):void select():RequestType

AppClient handle_request(RequestType, Request):void handle_request(Request):void

Figure 3.6: Request Processing Sequence Diagram - Reactor

Processing Scenario - Figure 3.6 .

• When an Event handler is registered in the Initiation dispatcher, a type of event is specified by the application registering the handler; when an event of this type occur on the associated Handle, the Event handler will be notified. • The Initiation dispatcher gets the associated Handle from Event handlers once they are registered. • After all the Event handlers are registered, the event loop is started in the Initiation dispatcher by an application. So, the Synchronous event demultiplexer is executed and it is put on wait for events. • When a new event arrives (i.e., a Handle becomes “ready”), the Synchronous event demultiplexer notifies the Initiation dispatcher. • When the Initiation dispatcher is notified about a new event, it calls the corresponding Event handler callback method. The initiation dispatcher uses the Handles to find the appropriate Event handler callback method. • The Initiation Dispatcher calls back to the handle event hook method of the Event Handler to perform application-specific functionality in response to an event.

3.6.4 State-based Pipeline Design Pattern Intent The State-based Pipeline design pattern takes advantage of the simplicity of the Pipeline design pattern, and the efficient execution of a master/slave structure. This variant of the original pipeline design pattern reduces prominent downsides for coarse–grained applications in achieving good per- formance, namely: first, when the pipeline is not full (i.e., at the beginning and end of the pipeline) stages are idle; second, load balancing is crucial to achieve good performance, as expensive stages cause that less expensive stages stay idle; and third, it is difficult to incrementally add more pro- cessors into an existing pipeline, given that concurrency in a pipeline is tightly coupled with the set of stages.

31 Problem The pipeline structure has demonstrated to be a useful way to solve many practical problems in parallel computing programming; however, it presents three serious problems, especially for coarse- grained applications. • Processors are idle when the pipeline is at the beginning and the end of the process. • Traditional pipeline implementations are sensitive to load imbalance, this means that some pipelines stages are more time-consuming than others. The slowest stage will become a bottleneck, decreasing the performance. • Traditional pipeline implementations imply static assignment of the stages to the nodes, making difficult to take advantage of new nodes.

Context A pipeline consists of a set of ordered stages, where each stage receives data from its predecessor, transforms that data, and finally sends it to the next stage. Usually, for data transference it is necessary to place buffers between the stages. The paramount characteristic in pipeline is that each stage is independent. That is, each stage can perform different computations on different parts of the data, simultaneously. However, Pipeline presents serious performance problems as ramp-up and ramp-down time, and load imbalance. The state-based pipeline should be used when: • Load balance must be guaranteed. • New processor or stages could be available at any time and should be taken advantage of.

Forces Decouple the concurrency from the pipeline structure. By using the State-based Pipeline design pattern, stages (i.e., state transitions) are independent of the execution thread, improving load balancing between pipeline stages and reducing idle times.

Clarify transitions between stages. There should be a clear agreement on the order in which the transformations occur from one stage to another. Thus if it is required, this makes it simple to add, modify or reorder states in the pipeline.

Structure Diagram - Figure 3.7 Participants AbstractStage An abstract representation of the stages, containing an abstract method to trans- form one stage into another. ConcreteStage Represents a stage within the pipeline. Its method returns a concrete stage (the subsequent stage). Pipeline This class contains the configured concrete stages, the queues storing intermediate states (i.e., instances of ConcreteStage), and the slave threads consuming objects from the queues.

32 Pipeline -requests: Queue -stages: Queue AbstractStage -threads: ThreadPool -nextStage: AbstractStage +process(Request): Response +<> transform(): AbstractStage

ConcreteStage

+concreteStage(AbstractStage)

Figure 3.7: Structure Diagram - State-based Pipeline Design Pattern

Main Pipeline ConcreteStage Worker

Loop

createStage():ConcreteStage

[while there are ConcreteStages]

Loop putRequestObject() storeRequest()

getRequestObject():ConcreteStage transform():ConcreteStage

putStateObject(ConcreteStage):void

[while there are request objects in the queue]

getFinalObject()

Figure 3.8: Processing Sequence Diagram - State-based Pipeline Design Pattern

Behavior Processing Scenario - Figure 3.8

• Put the request objects into the input buffer.

• An idle thread from the thread pool will find and execute the transform() method of any object request from any buffer placed between stages, in order to obtain the next state object.

• The state object resulting from transform () method is placed into an output buffer proper of its runtime type.

• The final buffer holds the resulting output objects from the pipeline execution.

33 3.6.5 Thread Pool Design Pattern Intent The thread pool design pattern facilitates thread management. In parallel environments, where some tasks can be executed at same time but perhaps in different instantiations, in the same device, threads should be used. Each thread executes an own task and shares device resources. When many threads are executed in the same device concurrently, resources could not be enough and overrun device capacity.

Problem Concurrency is usually realized having several threads executed in a computing device. However, each of these threads consume device resources, and eventually, many threads might overload the device. Managing these threads represents a problem. For example, Can threads be reused? What would be the maximum or minimum amount of threads for device?

Context In general, each time a problem solution requires using concurrent execution of threads, this pat- tern can be used. Threads are used to execute several independent computing tasks at the same time, however these threads imply costs in time (i.e., time to start a thread) and resources (mem- ory and CPU). This pattern proposes a solution that balances the implications of using threads. Furthermore, many other patterns that imply the use of threads implement this pattern as part of their solution. This is the case for instance with the leader/followers and master/workers de- sign patterns.Despite of this pattern allows to take advantage of the multi-core environment of a CPU and this can improve the system performance, it has limitations, such as, difficult to scale the pattern implementation to a distributed environment (i.e., applying this pattern using many processing nodes).

Forces All tasks should be independent. Each of the tasks should be executable fully and indepen- dently in one thread. If there are dependencies, deadlocks can happen. For instance, if all pool’s threads is waiting for other task, this task will not be executed and thus threads could enter in deadlocks.

Thread creation cost is relatively high. To create a thread for each task should be costlier than maintaining and waiting for idle threads, both in terms of time and resources.

Optimal threads amount. Designer should configure the optimal threads amount depending on the device resource availability and execution capability at the same time. This relation should be maintained dynamically, as the device load evolves.

Threads not reusable. If a task lasts indefinitely, the thread that executes this task is not reusable because it may never terminate to execute the task. This task type should be executed by a thread out of pool.

34 <> Executor Worker objects get Runnable objects from their ThreadPool +execute(runnableObject:Runnable)

<> 1 ThreadPool Runnable

+returnPool() +execute(runnableObject:Runnable) executes

1 Worker

Figure 3.9: Structure Diagram - Thread Pool Design Pattern

Structure Diagram - Figure 3.9 Participants

Worker: Responsible to execute tasks of ThreadPool. This gets Runnable objects from its Thread- Pool, by executing the Runnable method.

Runnable: Represents the interface implemented by tasks that ThreadPool must process. This interface specifies the run method, which is responsible for defining the task of the Runnable object.

ThreadPool: ThreadPool class is responsible for managing workers and tasks and their execution. This class assigns to each worker one task. When a worker finishes a task execution, it can ask the threadpool for a new task.

Executor: Represents the interface implemented by ThreadPool class. This interface specifies the execute method, which is responsible for executing the task of the Runnable object.

Behavior Processing Scenario - Figure 3.10 . Every time a request arrives it is verified if idle threads exists, otherwise, if the maximum thread amount is not exceed, a new thread is created. If exists at least one thread available to process a request, the pool sends the request to be executed, otherwise the request must wait until a thread become idle. Finally, the thread returns to the pool as idle thread when processing of the request is finished.

3.6.6 Master-Worker Design Pattern Also known as:

• The Embarrassingly Parallel Pattern

35 objOne:Runnable objTwo:Runnable threadPool:ThreadPool

par

execute() workerOne:Worker

execute()

returnPool()

workerTwo:Worker

execute() execute()

returnPool()

Figure 3.10: Request Processing Sequence Diagram - Thread Pool

• Task Queue

• Master-Slave

Intent The Master-Worker design pattern describes how to execute a collection of independent tasks (i.e., tasks than can be executed concurrently) in a group of available processors, named workers. Furthermore, the distribution of tasks among processors can be performed statically or dynamically in order to promote a balanced computational load.

Problem Many computational problems are solved by splitting them in independent subproblems, such that their solution implementations can be executed independently and concurrently. The independence among subproblems imply that tasks associated to each solution implementation do not share read and write data and tasks must not wait for other task results. Practitioners should take advantage of available computational resources and the inherent concurrency without incurring in unnecessary overhead. Given this situation, a program should be designed procuring to load balance among available processing units. To exemplify this type of problems, take the vector-addition problem. Given vectors A, B, and C where C=A+B, each element of C is given by adding the corresponding elements of A and B, Ci= Ai + Bi. Each element of C can be calculated in a concurrent way as a set of subproblems of vector elements addition problem.

Context Problems whose solutions can be split in independent tasks can take advantage of this design pattern. However, some particularities should be evaluated before to implement it, (i) cost to

36 Master

-queue: Queue -global_results: Queue -workers: ThreadPool Worker 1 1..* +create_tasks(): void -request_task(): void +launch_workers(): void +process_task(): void +getTask(): Object +addPartialResult(Object): void +processPartialResults(): Object +shutdown_workers(): void

Figure 3.11: Structure Diagram - Master-Worker Design Pattern initialize workers (including e.g., data transmission) must be lower than the task cost, (ii) number of tasks must be greater than available processing units, and (iii) distribution should be dynamic if load for each task is unknown, or varies unpredictably or when the available load supported by each worker is unknown.

Forces Task must be independent. The Master-Worker design pattern applies when tasks do not have dependencies among them. Otherwise, the tasks must be redesigned to eliminate these dependen- cies.

Tradeoffs between data communication and load. To apply this design pattern the task size must be optimal according to tradeoffs between distribution overhead and task implied load. Given that the size of tasks in Master-Worker can vary from one task to another.

Unpredictable tasks number and processor nodes. Most of the time explicit predictions of the hardware and software runtime environment are not possible. However, Master-Worker pro- cures to achieve load balancing even under uncertain environments, by allocating tasks to idle processor nodes. This scenario corresponds to the dynamic version of Master-Worker and also to the Fork-Join design pattern.

Number of tasks and processor nodes are known. When the number of tasks and load of processor nodes is known prior to execution, practitioners can program the software to assign statically tasks to the most suitable processor node guaranteeing load balancing. This scenario corresponds to the static version of Master-Worker1.

Structure Diagram - Figure 3.11 Participants

Master The Master contains the shared collection of tasks, usually a queue, where tasks are stored after being split. It has a second shared queue where results of worker computations

1 Some authors do not consider the behavior described by the static version of the Master-Worker design pattern as part of this design pattern, instead they prefer to describe this behavior as a whole new design pattern.

37 Master Worker

create_tasks()

launch_workers()

Loop request_task() getTask()

process_task()

addPartialResult(Object) [While there are tasks]

processPartialResult()

shutdown_workers()

Figure 3.12: Processing Sequence Diagram - Master-Worker Design Pattern

are stored. In summary, the Master holds registered tasks, launches the processors (workers) and collects the workers’ results to produce the final result. Worker Workers request a task from the shared collection of tasks registered in the master’s queue, and process it. Finally, workers return partial results of the computation to the master.

Behavior Processing Scenario - Figure 3.12 This scenario describes the usual processing behavior of Master / Worker design pattern.

• When a master launches workers, they request tasks to the master. • When a task is assigned to a worker, it processes the task and returns the result to the master. • When the worker finishes to process its current task, it will request for a new task to the master until all tasks have been processed and then the master shutdown all workers.

Finishing - Figure 3.12 This scenario describes how the Master / Worker design pattern is fin- ished.

• When all tasks have been processed, the master processes the collected partial results and returns the final result, if necessary, the master shuts down all workers.

Special circumstances and variations.

• Usually problem’s tasks return their results to the master, however this pattern can use a shared data structure to accumulate the partial results.

38 • Termination condition is met usually when all tasks are completed, however there are problems whose final result can be obtained before all tasks are completed. For instance, a search in a database where each worker has an independent search space is finished as soon as the first worker finds the searched element. • In some cases, not all tasks are known initially, that is, new tasks are generated while other tasks are in execution. In these cases, is very important to assure a termination condition.

3.6.7 Separable Dependencies Design Pattern (Variant of 3.6.6) Intent Usually, complex tasks can be split in more simple tasks. This partition must be performed based on dependency analysis of tasks and shared data. This pattern eases decomposition by eliminating dependencies among simple tasks through, (i) data replication of global data, and (ii) merging individual task results in global computations.

Problem One way to address concurrency is through task-based algorithms. However, these algorithms have two main challenges, (i) distributing tasks among processing nodes, and (ii) managing dependen- cies among tasks (including resource use). The separable dependencies design pattern supports problems where these challenges can be addressed separately, that is, dependencies can be factored out of the set of concurrent tasks allowing to take advantage of concurrency.

Context This pattern should be used when the problem can be solved with a set of concurrent tasks where, (i) only one or none of the tasks modify the global data, and other tasks need only its initial value (replicated data) and (ii) the final result can be constructed through the combination of independent tasks results.

Forces Removal of dependencies among tasks. Separating dependencies among tasks and resolving how to share required data allows to take advantage of concurrency. First, tasks are classified according to their dependencies: the ones that can be executed at the same time; and the others that need to wait for other tasks to finish. And second, defining a mechanism that allow shared data among concurrent tasks (in this case, through replication). Of course, not all problems admit solutions with these characteristics.

Structure Diagram - Figure 3.13 Participants

Master This class has the same responsibilities that the master in the Master / Worker design pattern. (of Section 3.6.6) - ”The Master contains the shared collection of tasks, usually a queue, where tasks are stored after being split. It has a second shared queue where results

39 Master

-queue: Queue <> -globalResult: DataMergedStructure -workers: ThreadPool +create_tasks(): void 1 +launch_workers(): void Task 1..* +separateDependencies(): void -data: Data +getTask(): Task +createTask(Data): Task +addPartialResult(Object): void +processResults(): void +shutdown_workers(): void 1

1..* Worker

-requestTask(): Task +processTask(Task): void

Figure 3.13: Structure Diagram - Separable Dependencies Design Pattern

of worker computations are stored. In summary, the Master holds registered tasks, launches the processors (workers) and collects the workers’ results to produce the final result.” In this pattern, this class has an additional process to carry out. The Master must create new tasks with the data replicated.

Worker This class has the same responsibilities that the worker in the Master / Worker design pattern. (of Section 3.6.6 ) - ”Workers request a task from the shared collection of tasks registered in the master’s queue, and process it. Finally, workers return partial results of the computation to the master.”

Task It is responsible to define and create the independent tasks based on the replication of the needed data to carry out each of them and previously defined by the Master. This is the main difference between Master / Worker design pattern and this variation.

Behavior Figure 3.14 - Separable Dependencies Design Pattern

The scenario described in Master/Worker design pattern are almost the same for this pattern (Section 3.6.6). The main difference is in the Task class and Master class. At creation time, tasks must be executed with two jobs (i) to define task functionality and (ii) to replicate required data. The Master will define the data to each task.

40 Master Task Worker

separateDependencies():void

Loop createTask(Data):Task

[While there are separated dependencies]

launchWorkers (): void

Loop requestTask():Task

getTask():Task processTask(Task):void

addPartialResult(Object):void [While there are tasks]

shutdownWorkers (): void

Figure 3.14: Processing Sequence Diagram - Separable Dependencies Design Pattern

3.6.8 Fork / Join Design Pattern Intent The Fork/Join is a pattern designed to take advantage of concurrency in problems that can be split in several independent tasks and the tasks have the particularity of being created dynamically.

Problem Designers always try to take advantage of concurrent independent tasks. However, these tasks could be known prior to system execution allowing a static assignment, or they could be known only at runtime, thus requiring a dynamic assignment. Usually, the dynamic assignment uses iterative and recursive loops, tasks queues, or division of functions in order to vary the number of concurrent tasks according to computing needs. On the one hand, iterative loops and task queues are handled by Master/Worker pattern. On the other hand, other dynamic assignment strategies need an efficient way to be handled, for example, recursive loops. For instance, the mergesort problem could be split in a recursive way. The original file to sort is split in halves until reaching a threshold (i.e., the maximum size that is worth to be sorted by a single processing node). Therefore in this case, the number of sorting tasks is known when the splitting phase is finished or even when the amount of items to be sorted is known. However, while the splitting phase finishes, processing units can start the execution of tasks that already have reached the threshold. After the sort phase finishes, a merge phase is needed for merging the partial sort results.

41 ForkJoinMaster

-pool: ThreadPool 1 1 ThreadPool +ForkJoinMaster(ForkJoinTask) +invoke(): void 1 1

* * ForkJoinTask Thread

+fork(): void 1 +join(): Object +compute(): Object *

Figure 3.15: Structure Diagram - Fork / Join Design Pattern

Context This pattern should be used when the problem can be split into a set of independent tasks taking advantage of concurrency but the number of tasks is usually unknown before execution making it difficult to use simple control structures to manage them. The dynamic process of task creation is named Fork, and the termination process of join a task with his parent task or other tasks created by the same fork is named Join.

Forces Relationship amongst generated tasks. Due to the nature of the addressed problems, complex or recursive relationships between tasks are created. Hence, it is very important ensure that all tasks will finish and deadlocks are not generated.

Processing units address the forking of tasks. Traditionally, tasks are mapped one-to-one into processing units, however and in the context of multi-core processor, it is important to consider load and capacity.

Creation, destruction, and assignment of tasks to processing units can be costly. If too many tasks are created, this could affect the overall system performance due to the use of unnecessary resources, however, if too few to be created, resources can be underutilized.

Structure Diagram - Figure 3.15 Participants

ForkJoinMaster: This class is responsible of managing a threadpool where tasks can be executed. Additionally, it starts execution through the invoke method.

ForkJoinTask: It is responsible of evaluating when a task should be forked, joined or executed. Thus, this class is responsible to avoid overhead because of over or under forking and implied performance loss.

42 ForkJoinMaster:master ThreadPool:pool Thread:thread ForkJoinTask:task

ForkJoinMaster(ForkJoinTask)

invoke():void execute(ForkJoinTask):void run():void AppClient compute():Object fork():void alt

join():Object

Figure 3.16: Request Processing Sequence Diagram - Fork / Join

ThreadPool: This class has the same responsibilities described in the ThreadPool design pattern (see section 3.6.5).

Thread: It is responsible for processing ForkJoinTasks assigned by the ThreadPool.

Behavior Processing Scenario - Figure 3.16 .

Once the invoke method is invoked, ThreadPool assigns to one thread, which executes the compute method of the ForkJoinTask. This method evaluates if the task should be processed, forked or joined. If the task is forked, it is split according to logic determined by the program- mer and new tasks are executed in threads assigned by the threadpool (aiming at reusing threads). This point is especially critical because it must determine the point until which tasks must be forked to ensure efficiency.

3.6.9 Producer-Consumer Design Pattern Intent This pattern generalizes a solution for the producer-consumer problem. This problem exposes the need to guarantee synchronization in systems where many concurrent processes share a common resource (e.g., a fixed-buffer or queue). The pattern allows to coordinate the production and consumption of information generated and processed asynchronously.

Problem In a wide rang of situations, processing requests can be performed in a concurrent and asynchronous way, which implies several questions. First, the arrival point should avoid requests loss. Second, the production and consumption of requests should be coordinated in some way. For example, suppose a restaurant where you order your food according to arrival order and there is more than one cook to attend you. Any number of clients may arrive and order at any given time. If you arrive and there is at least one available cook, you will be attended immediately. However, if there

43 Producer Queue Consumer

Loop alt enqueue() [available consumers]

alt [queue size=0]

Loop size() [while size=0]

[queue size>0] dequeue()

process()

store() [no available consumers]

Figure 3.17: Structure Diagram - Producer-Consumer Design Pattern are no available cooks, you will be put in a queue, where you will wait to be attended by the cook who becomes available soon. In this way, requests are not lost, and attended as soon as possible.

Context This pattern should be used when the problem can be separated in three parts: first, generation of requests, which can be concurrent and asynchronous. Second, processing or attendance of requests, of which there can be more than one; and finally, a queue that is responsible for coordinating the first two parts.

Forces Objects/Data are produced and consumed in an asynchronous way.

Objects/Data may be produced even when there are no available consumers to process them. Objects are produced at any time. These objects are stored in a common shared data structure where they wait to be processed by any available consumer.

Structure Diagram - Figure 3.17 Participants

44 Producer Queue Consumer

Loop alt enqueue() [available consumers]

alt dequeue()

process()

[queue size>0]

[else] Loop size()

[while size=0]

store() [no available consumers]

Figure 3.18: Processing Sequence Diagram - Producer-Consumer Design Pattern

Producer The entities that produce objects/data asynchronously (i.e., threads) to be processed. Sometimes, Producer objects are created when all consumers are busy processing other ob- jects. In those cases, Producer objects are stored in a queue while the client that generates them continue with its normal execution.

Queue Stores the objects/data produced by the Producers until a Consumer object dequeues them for processing. If the queue reaches its maximum size, it can use the pattern, which force the producer thread to wait until a consumer thread dequeues an object from the queue.

Consumer Dequeues objects from the queue to process them. If the queue is empty, Consumer objects available must wait until a Producer enqueues an object.

Behavior Processing Scenario - Figure 3.18 Producers and consumers can work concurrently and syn- chronously. The Producer enqueues tasks in the queue with the only restriction of the max- imum queue size that limits the storage of tasks. A producer might wait until there is free space in the queue to enqueue new tasks. On the other hand, Consumer dequeues tasks from the queue and process them. If there are no tasks in the queue, consumers wait for tasks.

45 3.6.10 Sender Released Intent The Sender Released is a pattern designed to improve the performance and reliability of SOA appli- cations. This pattern is based on two well-known message-oriented Service-Oriented Architecture (SOA) design patterns: Reliable Messaging and Asynchronous Queueing. Sender Released is a useful mechanism to ensure the delivery of messages without overloading the sender, which means increased robustness and performance.

NOTE: This is the only design pattern analyzed in this thesis that targets 100% a SOA environ- ment. Therefore, its structure will be presented as a deployment diagram. However, this pattern can be implemented independently of SOA.

Problem Inter-service message exchange is an important functionality in SOA systems to guarantee ro- bustness. The Reliable Messaging pattern address this functionality with robustness when service communication cannot be guaranteed due to the presence of unreliable environments. Unreliable environment refers to hardware and system configurations where the software reliability is not com- pletely guaranteed. However, using reliable SOA frameworks with The Reliable Messaging pattern does not guarantee reliability; SOA systems are still subject to failures that can crash reliable func- tionalities. Additionally, Reliable Messaging introduces a processing overhead that affects service performance. The problem addressed by the Sender Released is to guarantee that messages are delivered reliably to their destination with minimum overhead.

Context This pattern should be used when performance and reliability are the most important concerns in a SOA application. The integration of heterogeneous applications (i.e., the communication and interoperation of different software components) is a challenging problem for software architecture design and development. The purpose of this pattern is to guarantee service communication in a SOA application when services are implemented in unreliable environments, avoiding performance overhead.

Forces Unreliable environments overhead. SOA presents a model to solve problems related to com- munication and interoperability of components of heterogeneous applications. Nevertheless, those components are usually deployed in unreliable environments that causes an overhead in the service activity function. The overhead is caused because the sender must hold and deliver the same mes- sage more than once in case of failure.

Sender requires to be released. Sender must be free of any traceability function (i.e., the sender must not implement the reliability function to guarantee a successful message delivery) once it sends a message, in order to process other tasks or send new messages to components.

46 <> Provider <> : Service Service : Service Provider Service

Order Service Order Service <> <> <> ServiceConsumer ServiceAgent Service

<> :TemporaryStorage

<> :TemporaryStorage <> :PersistenceStorage

<> <> Back-upStores Queue

<> :PersistenceStorage

Figure 3.19: Structure SOA Diagram - Sender Released Design Pattern

Structure Diagram - Figure 3.19 Participants

Service Consumer Represents the element that requests for services.

Service Agent Intercepts the message (service request), and sends it to the queue.

Service Represents the particular service or collection of services to be provided.

Back-up Store Stores the messages when the buffer (queue) is full.

Queue Transmits the messages to the service, and manages the retransmission of the message if there is no response from the service.

Behavior Processing Scenario - Figure 3.20 The Service consumer sends a message to the Service that it is intercepted by the Service Agent. In order to release the Sender (i.e., Service Consumer) from the waiting cost. The Service Agent sends the message at the same time to both the Queue and the Service. Nevertheless, if the Queue is full, it sends the message to the Back- Up Store to guarantee its persistence. At the moment the Service receives the message it sends back an ACK message to the Service Agent who re-transmits this one to the Service Consumer. This is a scenario where the reception of the message is good. However, in the

47 ServiceConsumer ServiceAgent Queue Back-up Stores Service

sending message capturing message

par sending message

alt sending message [Queue Full] persistence storage

temporary storage [Else]

sending message

alt ack ack

[Message Reception OK]

loop[1,N] sending message sending message

[Else]

Figure 3.20: Processing Sequence Diagram - Sender Released Design Pattern

scenario where the Service does not receive the message, the Service Agent will attempt to re-transmit the message (N times if required) with help of the intermediary buffer (i.e., Queue and/or Back-Up Store) on behalf of the Service. In the case, the Service Consumer does not receive an ACK after an established period of time, it will consider that the message was not successfully delivered.

3.6.11 Leader and Followers Intent The Leader/Followers design pattern allows to implement high-performance multithreading appli- cations where a set of threads attend multiple and diverse incoming service requests that are stored in a shared queue to later decide the appropriate request handler for each request.

Problem Implementing high-performance applications able to process multiple types of events or service requests concurrently can be hard even using multi-threading, due to race conditions, deadlocks, and synchronization overhead. For example, suppose an scenario where multiple service requests of type A and B arrive to the application at any time, and there is a restricted number of available processors (i.e., threads) to process those requests. How to process all requests efficiently without incurring in the previously mentioned issues related with concurrency?

48 Thread Pool +Synchronizer +join() +promote_new_leader() 1 <>

* Runnable Request Handler 1 1 * Request 1 +handle_request() Request Set +get_handle() 1 +handle_requests() +select()

Concrete Request Handler A Concrete Request Handler B

Figure 3.21: Structure Diagram - Leader and Followers Design Pattern

Context An application where multiple and diverse service requests occur at any time. These must be processed efficiently by a defined set of threads that share a common queue where incoming requests are stored.

Forces Demultiplex and process efficiently requests through a thread set. The leader / followers pattern improves the demultiplexing process of requests that arrive and are stored in a shared resource (queue). Given that these requests can be of different types, this pattern proposes that the thread set appoints a leader thread and followers threads in order to manage the processing of requests. To process requests only the leader thread accesses the shared queue and selects a request. Then, it selects the adequate request handler to attend the request. This way to process the requests allows demultiplexing associations between requests and their handlers.

Avoid overhead caused by concurrency and the thread set. Each request is completely processed by only one thread (i.e., current leader). This strategy avoids to implement context switching when only one thread processes different requests, synchronization, and cache coherence management. Therefore, the overhead related to concurrency management is reduced.

Avoid race conditions caused by the shared request set. The leader thread is responsible for: (i) monitoring the shared requests queue, (ii) promoting a new leader before processing a request, and (iii) processing a request completely. Given that the current leader thread is the only who accesses the shared requests queue and it attends the request completely, race conditions are avoided.

Structure Diagram - Figure 3.21 Participants

49 <> <> <> <> <>

request handle_requests() select() AppClient1

processRequest():void

promote_new_leader()

BECOME_NEW_LEADER_THREAD handle_request() request handle_requests() join() select() AppClient2 BECOME_FOLLOWER_THREAD

Figure 3.22: Request Processing Sequence Diagram - Leader and Followers

Request Represents the objects to be processed concurrently using the appropriate requests han- dler.

Request Set The shared collection (usually a queue) used to store request objects.

Request Handler The interface that expose the valid set of types of operations available to process requests.

Concrete Request Handler Concrete request handler implements the specific behavior that the application exposes through interfaces by request handlers. The concrete request handler is associated with a request of the request set in order to process it.

Thread Pool A pool of threads that share a synchronization method, such as a semaphore or condition variable, to coordinate their transition between three different roles (i.e., leader, follower, and processing thread). Follower threads are queued in the thread pool waiting to become the leader thread. The leader thread waits for a request in the request set or selects one if there are pending requests in the queue. When a request is selected to be processed, the following activities occur:

• The current leader thread promotes a follower thread to become the new leader. • The original leader starts to play the processing role (processing thread), which takes the request and associates it with the appropriate request handler in order to process the request. • After the request is processed, the processing thread returns to play the follower role and waits on the thread pool synchronizer for its turn to become the leader again.

Behavior Processing Scenario - Figure 3.22 .

50 This scenario shows how the pattern processes the requests.

• The leader thread waits for incoming requests that arrive to the requestSet. When a request arrives, the leader thread is notified. • When the leader thread is notified that a request arrives, it promotes a new leader through the threadPool. Next, the thread selects the concrete handler to process the request. Finally, after processing the request, the thread pool assigns the thread as a follower. • The new leader thread waits for new incoming requests.

3.6.12 Half-Sync / Half-Async Intent Concurrency implies to manage synchrony and asynchrony. Half-sync / Half-Async design pattern can receive asynchronous requests and processes them in a synchronous way. This pattern uses a queue to communicate asynchronous and synchronous layers.

Problem Managing synchrony and asynchrony in a software system is not a trivial task, due to differences between these programing models. Asynchrony implies that it is not necessary to wait until a task is finished to start another, while in synchrony it is. These models have implications on re- sources (e.g., in asynchronous models many tasks could access memory at the same time, while, in synchronous model only one task accesses memory); and on dependency management (i.e., in asynchronous models should not exist dependency among tasks, while, in synchronous models it could be necessary to wait until a task have finished its execution or be suspended before start- ing another task). Half-sync / Half async supports both programming models synchronous and asynchronous to leverage concurrency in an efficient way.

Context This pattern should be used when the system must perform tasks in response to asynchronous events and it is inefficient to dedicate one synchronous thread to each event, and to perform tasks in synchronous threads simplify task execution. Another scenario where this pattern can be applied is when one task must run a control thread, while other tasks can run in multi-threaded environment.

Forces Balance simplification of programing asynchronous models while adding efficiency of synchronous models.Programming asynchronous models could be complex because of input and output operations are triggered by events or interrupts. This kind of triggering can cause schedul- ing problems and race conditions when the current control thread is interrupted. Additionally, debugging asynchronous programs is hard given that events occur at different moments and points of execution. However, this model can increase efficiency of a program by allowing communica- tion and computation to proceed simultaneously. In synchronous model, it reduces information

51 AsyncThread

+notification(msg): void +run(): void 1 Enqueue

1 Queue SyncThread *Dequeue 1 +enqueue(msg): void +run(): void +dequeue(): msg

Figure 3.23: Structure Diagram - Half-Sync / Half-Async Design Pattern needed to maintain program status. Thus, there are characteristics that can be leveraged from both models.

Structure Diagram - Figure 3.23 Participants

AsyncThread This is the only one thread responsible for attending requests of events and for enqueing events to be processed by synchronous threads. Queue Maintains the messages from the asynchronous thread to be processed by synchronous threads, that is, it is a bridge that communicates asynchrony and synchrony for processing events. SyncThread It is responsible for processing events. The synchronous side of this pattern can use the ThreadPool design pattern to manage synchronous threads efficiently.

Behavior Processing Scenario - Figure 3.24 . On the one hand, when an event arrives to the asynchronous thread through the notification method, the asynchronous thread calls its own run method, which is responsible to enqueue the event. On the other hand, synchronous threads always are requesting events from the queue to process. Thus, when an event arrives at the queue, a synchronous thread dequeues it and processes it completely. If there are no synchronous threads available to process,the request waits in the queue until a synchronous thread is available.

3.6.13 Sayl Intent The Sayl design pattern describes a form to turn a sequential application into a concurrent version in order to improve performance. Sayl exploits the benefits of some well-known techniques and design patterns, such as dynamic task graph, data-flow dependency, task-graph , and task parallelism to achieve this goal.

52 Queue:queue ExternalEvent:event AsyncThread:asyncThread SyncEvent:syncEvent

notification(msg):void run():void

enqueue(msg):void run():void dequeue():msg

Figure 3.24: Request Processing Sequence Diagram - Half-Sync / Half-Async

Problem Sequential applications that suffer from slow performance, sometimes can be turned into a concur- rent application, thus leveraging available distributed computing power.

Context This pattern should be used in sequential applications with performance problems that can be broken down into a collection of independent tasks, and whose major characteristics are:

• Heterogeneity of tasks and their interdependencies: tasks performing different oper- ations with variations in their completion times and functionalities, and possibly depending on different computational resources.

• Dynamic set of active tasks: different subsets of tasks may run in different iterations and computing nodes during the execution of the program.

Forces Tasks can only be spawned when their dependencies are fulfilled. In order to reduce over- head of common polling, tasks are only spawned when dependencies are fulfilled.

Tasks are only allocated in memory when they are prepared to be executed. Task are only scheduled for its execution as soon as their parameters and required resources become available, making an efficient use of memory.

The order of queuing or executing tasks does not affect the program correctness. Sayl allows to use any task container (i.e., data structure for task collection) designed to achieve high- performance. This implies that, the tasks must not have functional inter-dependencies.

53 Java.Collections

1 Ready Container Task Dependency 0..* Prepare Container +addTask(Task) +addTaskParam(task:Task) +removeTask(Task) +getTask(): Task 0..*

Resource Worker 1 +requestResource() +wakeAnyThread() Task +loadResource() +prepareResource() +executeTask(Task) +taskMethod: String +taskId: String 1..* +paramsCount: int +ParamId: int +paramValue: Object +createTask(taskMethod,taskId,paramsCount, paramId,paramValue)

Figure 3.25: Structure Diagram - Sayl Design Pattern

Structure Diagram - Figure 3.25 Participants

Task What it needs to be processed. Task may or may not have parameters and required resources for its execution. This pattern considers that all parameters required for executing the task may not be available at the same time than the required use resources.

Task Dependency Once a task has been created, the parameters are added to the task through the task dependencies.

Prepare Container Stores the tasks and its parameters until all of them become available. This helps to reduce the overhead associated to common polling. The Prepare Container request for the required resources to execute the tasks if they are needed.

Ready Container When all parameters and resources of a task are available, the task is moved to this container to proceed with its execution when a thread is available.

Worker Manage the pool of workers (i.e., available threads) to execute the tasks. A worker must be able to process any task.

Resource Some tasks may require some critical resources for its execution. A Resource must be understood as any computational device required for the task execution. To avoid deadlocks of resources; this patterns proposes a way to guarantee the resource availability to perform a task. When a task is stored in the Prepare Container or the Ready Container (if only one or no parameters are required) a request for the resources are enqueued. Unlike the tasks parameters, the resources required for a task execution are requested only once. This pattern considers a special resource management that differs from the resources management of the operative system.

54 Behavior Processing Scenario- Figure 3.26 There are two distinguishable scenarios for processing re- quests. First, the task has more than one parameter, and second, the task has only one or no parameters. In the first scenario, the task is created and added to the Prepare Container, waiting for all of its required parameters to be available. Once the parameters are available, the task is removed from the Prepare Container and added to the Ready Container, where an idle worker will process it. In the second scenario, the task is added directly to the Ready Container where it waits for a worker to execute it.

3.6.14 MapReduce Intent MapReduce can be seen as a variant of the Master-Worker pattern. It considers a special type of problems that unfolds in two distinct phases: (i) a large collection of independent and concrete com- putations, and (ii) the synthesis and summarization of the independent results. MapReduce solves this problem hindering the details of parallelism and distributed computing to the programmer.

Problem Many problems require processing of large amounts of data. Problems where data can be processed in a concurrent way and then collected and summarized to produce the expected results can be solved efficiently with the MapReduce pattern. MapReduce improves the performance of programs that solve this kind of problems taking advantage of concurrency and available computational resources.

Context Problems that can be split in a set of tasks where an independent function is applied by each task and then in a next phase build a summary from the results. The first operation of applying the independent function corresponds to the context of the Master-Worker design pattern (of section 3.6.6). However, the second operation is loosely synchronous since in general it must be applied to the whole set of results obtained by the first operation. It is worth noting that the reduce operation requires, in general, high communication usage in order to produce the summary results.

Forces Number of tasks vs communication time. This is the most important tradeoff in the MapRe- duce design pattern. On the one side, there must be a large number of tasks to keep all processing nodes busy; however too many tasks can increase significantly the communication times. A com- mon variant to overcome the communication problem is not to send the data but the software code of the task, and execute it where the data is.

The Map function and the Reduce function must be clearly defined. MapReduce is not an option to use in problems that cannot be split in the two functions: map and reduce, clearly applied in two phases.

55 Prepare Main Task Task Ready Workers Resource Dependency Container Container

createTask()

task created

alt addTaskParam(Task) addTaskParam(Task)

opt requestResource()

prepareResource()

[A resource is needed]

opt

addTask(Task)

removeTask(Task) [All params are available] [There is more than one parameter]

addTask(Task) [There is only one parameter or there are not parameters]

Loop AreIdleThreads? [Always that exist a Yes/No task in the Ready Container] wakeAnyThread()

getTask()

taskToExecute

opt loadResource

resourceLoaded

[Task declared a resource need]

executeTask()

Figure 3.26: Processing Sequence Diagram - Sayl Design Pattern

56 Master -queue: Queue <> -globalResult: Queue Task -workers: ThreadPool * +create_tasks(): void 1 -data: Data +launch_workers(): void +createTask(Data): Task +getTask(): Task +addPartialResult(Object): void +partitionResults(): void +shutdown_workers(): void +sendFinalResult(Object): void 1 Reduce_Worker * -requestTask(): Task +reduce(Task): void * Map_Worker

-requestTask(): Task +map(Task): void

Figure 3.27: Structure Diagram - MapReduce Design Pattern

Structure Diagram - Figure 3.27 Participants

Master The Master contains the shared collection of tasks, usually a queue, where tasks are stored after being split. The Master also launches the processors (workers) and collects the workers’ results to produce the final result (if required). Unlike the Master of the Master-Worker design pattern, the consolidation of results may or may not be executed by the Master after a reduce phase, since the Reduce Workers might be more appropriate to execute this operation.

Map Worker and Reduce Worker We distinguish between two kind of workers: Map Workers and Reduce Workers; since each one of them has specific responsibilities. Both workers request tasks from the shared collection of tasks registered in the master’s queue, and process them by performing its particular function (i.e., map or reduce). Both workers can be deployed in the same computational resource.

Task Defines the task functionality and it is also responsible for splitting the required data.

Behavior Processing Scenario - Figure 3.28 The master enqueues the different tasks to be processed. The different available Map workers ask for a task to process, they get the task from the Master’s tasks repository, and start to execute the defined Map () function. Once the worker finishes the work, it can ask for another task, since workers may have a local repository of the results. Once they all finish, they send the results to the master. Once the master has results from all of the Map Workers, the Reduce workers can execute the Reduce () function over the mapped data and (i) produce an output or (ii) send the results to the master who performs a data consolidation process before producing the final output.

57 Map_Worker:worker Master:master Task:task Reduce_Worker:worker

createTask(Data):Task

launchWorkers():void

requestTask():Task

loop getTask():Task

map(Task)

addPartialResult(Object):void requestTask():Task

getTask():Task

reduce(Task)

sendFinalResult(Object):void shutdownWrokers():void

Figure 3.28: Processing Sequence Diagram - MapReduce Design Pattern

Variant In-Mapper Combiner This variant of the MapReduce design pattern introduces a new element which could be understood as an additional phase to the structure of the pattern, named combiner. The combiner is responsi- ble for local aggregation (i.e., a partial reduce on the Map function’s output). Combiners help to reduce the communication time since they reduce the amount of data that must be sent across the network. However, two considerations must be noted with the use of this variant. First, combiners might not be useful in all mappers according to the data, and second, although combiners reduce the amount of data sent across the network, combiners do not reduce the number of key-value pairs generated by the mapper (Map Worker) in first place.

Variants’ Behavior Each time a key-value pair is processed, an entry is added to the map; if the key already exists, the values are combined and updated in the map. Once the mapper has finished mapping all the key-value pairs, they are written to the mapper has finished mapping all the key-value pairs, they are written to disk.

3.7 Problem Model Instantiation

Our experiments involve a set of three different areas of study: (i) domain-specific design patterns for performance, (ii) context-variables, and (iii) performance factors. A deep analysis of these areas and the variables that compound them were described in chapter 2. In this section we will describe the process followed to define the limits of the experiments design require to achieve the proposed thesis goals.

58 3.7.1 Experiment Case Studies This section presents the selected case studies and its related information. The case studies used in this thesis project were selected searching for common theoretical and real-world problems where several design patterns seems to be applicable, these case studies also have the particularity that performance is a critical concern. Therefore, they can have potential benefit (i.e., in terms of performance) through the correct implementation of selected domain-specific design patterns.

Sorting Case The sorting case (hereinafter sorting) consists in placing elements of a collection in some kind of order. Most of the found design patterns introduced in section 3.6 try to take advantage of concurrency and parallelism of concurrent systems in order to improve performance of software systems. In consequence, we selected one of the best known solution algorithms for sorting that allow us to take advantage of these characteristics, that is merge sort. Merge Sort is an O(n∗log2(n)) time complexity and accurate sorting algorithm based on the divide-and-conquer method [23]. Given that the merge sort algorithm is based on the divide-and-conquer method, it becomes in a suitable case study that allows us to split the unsorted collection in a set of tasks that will be processed (i.e., sorted) concurrently by the different processors taking advantage of concurrency and parallelism.

Functioning Layout This section presents an overview of the normal process (i.e., without considering the behavior given by the selected design pattern) that follows a text file to be sorted.

• Input The program reads a text file that contains a specified number of lines without a given order, each line is a combination of twenty alphanumeric characters.

• Process The program splits each file in a set of blocks that contains a determined number of lines. Each block is processed by a specific node that sorts the lines contained within the block. Once all blocks have been processed, a merge function is called to merge the set of ordered blocks.

• Output The program prints a new text file that corresponds to the sorted version of the file given in the input.

Replication In order to improve and guarantee the validity of results, we define a replication mechanism to perform the experiments. This replication process is required since the runtime environment is not totally stable. The replication process consist on:

• Deploy the software system.

• Execute a warming-up process for the latency tests.

59 • Perform the sorting of an unordered text file.

• Register the metrics associated with the recently ordered file.

• Each size of text file (i.e., each quantity of string lines) is repeated 10 times for the latency tests.

• Considering that there is not warming-up process for the throughput tests, the replication process consist on repeat the throughput experiment 3 times with each size of text file.

Note: The performance factors of interest in this thesis project are: latency and throughput, the process followed for its selection is described in section 3.7.3.

Large XML Processing Case The large XML processing case is a real-world business problem of a multinational company with headquarters in Colombia. This multinational company has a business model where multiple clients sent its data for processing and storage to centralized servers and this company returns it usually to mobile devices. In this case study, we understand a large XML file as any file larger than 1 MB. The information is transmitted throughout individual XML files that do not only represent data to be processed but are also used as representations of data models hosted on client devices, this process is known as synchronization. The main objective of this company is to optimize and improve the performance of perform the synchronization process since the current performance it is not the expected by clients. This is a relevant case study since the main objective is to improve the performance of system by an architectural mechanism. For more information about this case study please refer to the related bachelor graduation document project developed by Cordoba and Mejia [9].

Functioning Layout This section presents an overview of the normal process (i.e., without considering the behavior given by the selected design pattern) that follows a XML file to be synchronized in the company servers.

• Input The program receives from a mobile device a ZIP file that contains one or multiple XML files compressed. The ZIP file is decompressed in the server which starts the processing of each XML individually.

• Process The XML processing consist on read the XML parts and queue the associated operations or processes required for the data read. These operations are usually stored procedures that have been adequate in a high-level programming language where data base communication is limited to basic DB operations (i.e., select, insert, delete, and update).

• Output When the file is completely processed the server sends a confirmation to the client and mobile device that started the synchronization.

60 Replication The replication process of this case study is similar to the replication process of the sorting case. The replication process consist on:

• Deploy the software system.

• Simulate the sending of the file by the client.

• Start the processing of the file in the server and BD server.

• Register the metrics associated with the recently ordered file.

• Ten files of same size are sent to the server.

• The process is repeated for multiple file sizes.

Note: The information related with this case study is divided in two different documents, the bachelor graduation document of Cordoba and Mejia [9] and the present master graduation project. In this document we complete the work started by Cordoba and Mejia that was focused in the improvement of the throughput of this case study, in this, throughput measurements were taken under specific experiment design guidelines. The present document completes this work measuring the latency of the most relevant evaluated experiments.

3.7.2 Domain-Specific Design Patterns for Performance - Selection After the research process described in section 3.5. We identify a set of 14 domain-specific design patterns for performance. These patterns were developed with the primary concern of improving the system performance. However, all found patterns do not fit in the selected case studies.

Criteria Selection We define the following criteria for the selection of design patterns for the sorting and large XML processing cases:

• The design pattern must allow to take advantage of the concurrency of individual task (i.e., sort tasks of the merge sort algorithm and the stored procedures translated to high-level programming in the large XML processing case).

• If the design pattern was developed in terms of parallelism this have to be able of being abstracted in a concurrency environment.

• The context of the design pattern should be similar to the problem context related with each case study.

• The application of the design pattern must not change the basic logic of the merge sort solution algorithm or business logic established by Carvajal T&S.

61 Selected Patterns According to the criteria defined above the design patterns that match with all criteria for the sorting case study are:

• Master-Worker

• Producer-Consumer

• Sayl

On the other hand, the design patterns that match the criteria for the large XML processing case study are:

• Reactor

• Producer-Consumer

Discarded Patterns In this section, we document the main reasons to discard some design patterns that according to its context and the selection criteria defined seems to be at first sight suitable for its application to the case studies.

Half-Sync/Half-Async for Sorting Case This design pattern presents an arrangement of two different groups of threads (also called layers), one for synchronous services and another for asynchronous services. Both layers are communicate through a queue in a producer-consumer way. The behavior of this pattern establish that when an asynchronous event occurs, it triggers a notification. Then, the monitor thread of asynchronous tasks reads the request and it optionally processes this request before queuing it. Finally, the worker threads pull tasks synchronously from the queue, that is, the synchronous threads read and process requests. Another important consideration of this pattern is that the tasks of higher levels are processed by the synchronous layer while the tasks of lower levels are processed by the asynchronous layer. From the description of case studies, we know that the sorting problem presents two important tasks or stages (i.e., sort and merge) that are communicated between them. There are some issues that prevent the application of the Half-Sync/Half-Async design pattern to this particular problem. First, both stages can be considered as high-level tasks, this means that, they should be processed by the synchronous layer. However, the design pattern only establish one synchronous layer. Moreover, neither sorter tasks nor merger tasks can be modeled as asynchronous tasks, their control flow is always controlled by another component of the software system (e.g., the controller or the communication queue).

Leader and Followers (LF) for Sorting Case The LF design pattern specifies a threadpool, where any thread of the pool is able to handle any event (asynchronous event). These events have to be processed in a synchronous way to avoid issues like race conditions and concurrency overhead. To achieve this behavior a thread of the pool is

62 promote as leader. The leader is responsible for attend and handle the event, but before to process it, the leader must promote a new thread as leader that until that moment was being a follower. The sorting case study does not have an asynchronous generation of events from a set of handlers. In this case study, although there is a separation of the work to take advantage of distributing processing, the ”events” generated from this separation are synchronous and known. Therefore, it makes no sense to implement this pattern for the mentioned case study, since events must wait from the beginning of execution in a container to a thread becomes leader only for processing it.

The State-Based Pipeline for Sorting Case This design pattern specifies a set of ordered stages, where each stage receives data from its prede- cessor. The stages have the particularity that are independent, that is, stages can be performing computations simultaneously each of them with its own data. Initially, the sorting case study could be though in terms of two ordered stages (i.e., sort and merge, in that respective order); however, this prevents to take advantage of the distribute computing since a merge stage will require many computations of sort stage before this could be started and additionally a new behavior must be implemented in order to get a solution using this pattern. This behavior is certainly far from the enunciated in the design pattern description and do not match with one of the criteria defined.

Sayl for Large XML Processing Case The Sayl design pattern specifies that a set of independent tasks could be generated by the program dynamically and those tasks might perform different operations and possibly depend of different computational resources. Taking this fact in consideration, the design pattern drives a special behavior where tasks are only spawned when their dependencies are completely fulfilled. Although in the large XML processing case tasks are produced dynamically as the XML is read, those tasks have already the information and dependencies required to be executed immediately; therefore, although the design pattern might be applicable, this would be being underutilized it and probably will decrease the performance of the system since more computations are require according to the pattern behavior.

3.7.3 Performance Factors - Selection From the research process described in section 3.3. We identify a set of six performance factors. In the following sections we describe the process to select an appropriate set of performance factors to be monitored and measured.

Criteria Selection We define the following criteria for the selection of the performance factors:

• Relevance of the factor, according to the definitions given in section 2.3.1, we determined the relevance of the factor by the amount of information it can provide about the current performance state of the software system.

• Ease of measuring factor, in order to register the measurements associated to the performance factors we are obliged to monitor and intervene the software system.

63 • Representation in literature, from the research we found a clear distribution of the perfor- mance factors in the literature.

• Expert judgment of the advisor.

Selected Performance Factors Considering the criteria defined above, the performance factors selected to be studied in this thesis project are:

• Throughput

• Latency

Additionally, we decide to also monitor the resource utilization of the RAM memory in the tests executions.

3.7.4 Context Variables - Selection In the research process carried out and described in section 3.4. We identify a set of 28 context- variables that we should have in consideration to understand the performance behavior of a software system. However, due to time constraints and the scope of this master thesis document we must limit the number of context-variables for study. In the following sections we document the criteria used to select a germane set of context-variables for this thesis.

Criteria Selection We identify a set of 28 different context-variables that may affect the performance of software systems. Therefore, we define the following criteria to select the variables that will be analyzed in our thesis.

• The variable must be controllable in the test environment (LIASOn Lab).

• The variable must be able to be measured accurately (Accuracy).

• The variable must be relevant for the selected design patterns (Relevance).

• The variable’s variation must affect significantly the performance of the software system.

• Expert judgment of the advisor.

Selected Context-Variables Considering the criteria defined above, the context-variables selected for this thesis project are:

• Number of available distributed task processors

• Number of service requests

• RAM memory usage

64 • Network Bandwidth

• Task buffer size (Queue)

• Batch time sp

• Batch size

• Memory structure

• Communication time

The context-variables listed above where all considered for the sorting case. However, according to the nature of the large XML processing case it is not possible to evaluate some of these variables. Therefore, the context-variables evaluated for the large XML processing case are:

• Number of available distributed task processors

• Number of service requests

• RAM memory usage

• Batch time span

• Communication time

3.8 Chapter Summary

In this chapter we have presented the context of the problem of this master thesis and how we modeled it. We identified that in order to analyze the relation between the application of design patterns and system performance through variations of the context-variables, we must developed and performed a design of experiments, in this, we deliberately made changes in the input vari- ables to posteriorly observe how the output varied accordingly. We also described the exploratory researches performed to identify the performance factors and context-variables that would be pos- teriorly selected for its evaluation in this thesis. We introduced the formal steps carried out to elaborate an SLR document of the domain-specific design patterns that have been proposed in the literature for improving the performance quality attribute. Finally, we presented the theoretical and real-world business case studies in which we evaluated our experiments. In the next chapter, we present how the experiment designs where defined and executed, as well as how the experiments data was measured and recorded.

65 Chapter 4

Experiments Design

In the previous chapter we stated the need for an experiment design in order to fulfill the research goals stated in this thesis project. We started the definition of the experiments design with the identification and selection of the different variables that will be involved in the experiments. This chapter describes the process followed to design the experiments and their execution; it also describes how to measure and record the experiments data. Our expected solution for this project is to provide enough information to characterize quantitatively the existing relationship between design patterns and context-variables in terms of latency and throughput. Note: This chapter focuses principally in the experiments design elaborated for the sorting case; since the complementary document of Cordoba and Mejia [9] details the large XML processing case experiments design.

4.1 Experiment Environment Configuration

In this section we document the processes followed to: (i) ensure the controllability of significant variables of the environment in which the experiments are to be executed, (ii) ensure the adequacy of measurement gathering, (iii) select significant values for the selected variables, and (iv) select the different architectures to be used for the experiments.

4.1.1 Software Technologies This section describes briefly the software technologies required for the correct execution of the experiments.

FraSCAti ”OW2 FraSCAti is a component framework providing runtime support for the Service Component Architecture (have a look at SCA specifications). The OW2 FraSCAti runtime supports SCA composite definitions which are conform to the SCA Assembly Model V1.0 specification, Java component implementation (SCA Java Component Implementation V1.0 and SCA Java Common Annotations and APIs V1.0), remote component bindings using Web Services (Soap or RESTful) and Java RMI protocols”1. All implementations of this thesis project were realized using the

1http://frascati.ow2.org/doc/1.4/ch01.html

66 FrasSCAti framework.

PASCANI ”PASCANI is a component-based and statically-typed Domain Specific Language for specifying dynamic performance monitors for component-based software systems. It is tightly integrated with the Java type system, which allows the integration of already existing libraries into the language [...]. From Pascani specifications, the language implementation generates the artifacts for the Dynamic Monitoring Infrastructure, including their deployment specifications, [...]. The generated artifacts are composed of elements from the Pascani runtime library and the SCA library,[...]. To complete the automation of the dynamic performance monitoring, the deployment specifications are executed by our second DSL, AMELIA” [16]. We use the PASCANI language to implement the monitors required to measure the different performance factors and track the values and their changes in context-variables.

AMELIA ”AMELIA is a declarative and rule-based Domain Specific Language for automating the deploy- ment of distributed component-based software systems. It is [...] a build automation tool, providing commands to facilitate the execution of deployment tasks across multiple computing nodes. From Amelia specifications, the language implementation generates executable deployment artifacts that perform the tasks required to transfer, install, and configure the software components to deploy on each of the specified processing nodes, using the SSH and SFTP protocols, and executing the commands specified in AMELIA rules” [16]. Although we did not use the AMELIA language, we made use of the library that supports this language. To deploy automatically the experiment tests.

Precision Time Protocol (PTP) ”The Precision Time Protocol (PTP) is a protocol used to synchronize clocks throughout a com- puter network. On a local area network, it achieves clock accuracy in the sub-microsecond range, making it suitable for measurement and control systems”2. We make use of The Linux PTP project which is an implementation of the protocol according to the IEEE standard 1588.

4.1.2 The Warming Up Process As it is stated in [34], ”Java-based systems deliver great performance when running compiled and optimized code. The JVM needs time to warm up, or optimize frequently-used code, so the application can run at top speed. Why does this happen? Java was designed to start up quickly, then improve performance over time based upon actual usage. The JVMs just-in-time (JIT) compilers depend upon profile data that describes which parts of the application are called the most (the hot code). JIT compilation allows the JVM to optimize performance, but it takes time”. Therefore, in order to get reliable results of the experiments, we conduct a warming-up process consisting in the execution of 100 iterations of a text file of smaller size that is executed previously to the execution of the real text files that will be really evaluated. It is worthy to note that the warming-up process is only executed for the latency experiments of the sorting case. In the

2https://en.wikipedia.org/wiki/Precision Time Protocol

67 Table 4.1: JVM Heap Size Experiment

Total RAM JVM Heap Size Operational Time Memory Comsuption 6 GB 4,650 MB 11,061 ms 5 GB 4,659 MB 15,920 ms 5.5 GB 4,650 MB 24,000 ms large XML processing case, it is not possible to conduct a warming-up process since the program requires data base operations that might be affected by the execution of previous XML files.

4.1.3 Pilot Experiments for Special System Variables The Java Virtual Machine (JVM) Tuning ”The Java heap is where the objects of a Java program live. It is a repository for live objects, dead objects, and free memory. When an object can no longer be reached from any pointer in the running program, it is considered ”garbage” and ready for collection. [...] The JVM heap size determines how often and how long the VM spends collecting garbage. An acceptable rate for garbage collection is application-specific and should be adjusted after analyzing the actual time and frequency of garbage collections. If you set a large heap size, full garbage collection is slower, but it occurs less frequently. If you set your heap size in accordance with your memory needs, full garbage collection is faster, but occurs more frequently”3. We performed a pilot experimental test in order to set a suitable value for the heap size of the JVM of the processing nodes to execute the experiments. This experiment was performed with processing nodes having 8 GB RAM. The experiment consisted on creating a set of Java objects and start to store them in a Java Collection, in this case an ArrayList. The objective of the test is to evaluate how many objects can be stored in the ArrayList before Java JVM crashes because of an exhausted memory exception without affecting the operating system performance and how much time takes this store operation. The results allow to define the best JVM heap size configuration for our experiments considering that both case studies (section 3.7.1) involve the creation of several Java Objects. Table 4.1 shows the results of the experiment described above. According to this, the best alternative for the JVM is to configure the heap size in 6 GB in order to obtain the best performance. This value was set as minimum and maximum Java heap size in such a way that the Java garbage collector minimizes its executions. It is worth to note that we performed the experiment with Java heap size configurations of more than 6 GB; however, these configurations crashes the operating system. Therefore, we cannot register the corresponding results.

Impact of Forcing Garbage Collection Considering the analysis of the sub-section above (4.1.3), we conducted a quick experiment based in the sorting case to observe if the forced use of the garbage collector may affect negatively the system performance (in terms of latency factor). The experiment consisted on sort files of different size under two different configurations, one of them forcing the garbage collection in some critical

3https://docs.oracle.com/cd/E15523 01/web.1111/e13814/jvm tuning.htm#PERFM156

68 20,000 18,000 16,000 14,000 12,000 10,000 8,000

Latency(ms) 6,000 4,000 2,000 - 6,500,000 7,000,000 7,500,000 6G without GC 14,082 15,014 15,685 6G with GC 16,619 17,703 18,557 File Size

Figure 4.1: Impact of a Force Garbage Collection

Table 4.2: Hardware Technical Specifications

Processing Commercial Operative RAM Processor HDD Node Reference System Memory Computer DELL Intel Core i7 - 3770 Fedora 21 16 GB 320 GB Node OPTIPLEX 7010 @3.40Ghz 64 bits Dell Power Intel Xeon E5-2403 Microsoft Windows NAS 18 GB 4 TB Vault NX400 Quad Core Storage Server 2012 R2 points the code. Results are depicted in figure 4.1. Results show that it is preferable to let the garbage collection operation decision exclusively to the JVM.

4.1.4 Hardware and Network Architecture Our experiments were conducted in the LIASOn Laboratory of the Icesi University. The LIASOn laboratory provides students with an homogeneous architecture and a local area network that allow us to perform controlled experiments. The most relevant technical specifications of the processing nodes of the laboratory are showed in table 4.2. Note: At the beginning of the study of this master thesis project, the LIASOn laboratory was provided with the processing nodes described in table 4.2 with a RAM memory of 8 GB, some pilot experiments were performed under this hardware configuration. However, previous to the experiments design execution, the processing nodes were updated to the showed configuration. The network topology adapted in the laboratory is a star topology which is achieved by means of a network switch with the specifications detailed in table 4.3. Processing nodes are linked to the network at 100 or 1,000 Mbps depending of the experiment. On the other side, the NAS is always linked at 10,000 Mbps.

4.1.5 Software Architecture for The Sorting Case The following deployment diagrams depict the software architecture defined for the experiments considering each one of the selected design patterns. These diagrams depict the NORMA memory

69 Table 4.3: Network Switch Hardware Specifications

Commercial Number Port Attributes Reference of Ports Dell Networking 10/100/1,000/10,000 Mb/s 24 N4000 Series version of the patterns. However, due to the UMA memory version does not differ significantly from this, we do not depict the respective diagrams of that version. The UMA deployment only differs from the NORMA deployment in the inclusion of an extra computational device, namely Network Attached Storage (NAS) to which each processing node has access. We detailed the deployment details related to memory structures in next section. As it was stated before, in this thesis project all design pattern implementations where devel- oped using the FraSCAti framework. Therefore, we decided to take advantage of the tools offered by the same. For example, for the remote bindings of components, we used the native communication- protocols supported by FraSCAti, in our case RMI. Since, the communication protocol was not a context-variable selected for its analysis in this thesis, we did not evaluate its impact in the perfor- mance.

Note: Deployment diagrams only show one sorter node; however, according to the context-variable Number of Available Distributed Task Processors, the number of sorter nodes can vary from four to ten. For more information please refer to section 4.2.

Note: We will make reference through this document to the base architecture as the one that uses only one component by each type of component required for the execution of the algorithm. The base architectures correspond with the deployment diagrams showed in figures 4.2, 4.3, and 4.4.

Master-Worker Architecture Figure 4.2 shows the software architecture deployed in the experiments using the Master-Worker design pattern. The Master-Worker implementation used in this thesis project makes use of three principal components: (i) control, (ii) sorter and (iii) merger. The control component is the Master role, this divides the file in a determined number of tasks according to the batch size set and the size of the file; these tasks are sent to a bag of tasks. Sorter components (i.e., worker components) take tasks from the bag if they are not busy sorting another task (i.e., one task at the time). When a task is finally sorted, this is returned to the control and saved in a temporary collection. When the whole set of task is completely sorted; then, the control sent the set of tasks to the merger (another worker component) to merge all of the tasks to produced a final output that contains the content of the original file completely sorted. The Master-Worker architecture will replicate the sorter node. The number of these kind of nodes in the architecture will depend of the configuration of the rest of evaluated context-variables. The rest of the nodes will have only one instance of the same.

70 sorter_node

mq1 control_node memory-monitor

mqControl sorter1 ns1 startMonitor memory-monitor

memory-monitor r control compositeName nodeName nodeName startMonitor componentName memory-monitor compositeName nodeName componentName nodeName availableNodes batchSize batchTimeSpan

merger_node

mqMerge memory-monitor

merger1 nm1 startMonitor Network Switch memory-monitor 71

standardizer_node compositeName nodeName nodeName memory-monitor integrador componentName

startMonitor memory-monitor estandar standardizer

nodeName reporter_node

external-probe reporter

r external-probe generateReports reporter rabbit_node

<> Rabbit.jar

Figure 4.2: Software Architecture using the Master-Worker Design Pattern Producer-Consumer Architecture Figure 4.3 shows the software architecture deployed in the experiments using the Producer-Consumer design pattern. The Producer-Consumer implementation makes use of multiple components to achieve the ex- pected behavior. The principal components involved in this architecture are: (i) control, (ii) queue, (iii) intermediate-control, (iv) sorter and (v) merger. As the sorting algorithm requires two clearly defined stages (i.e., sorter and merger). We observe the need to have two Producer-Consumer in- stances. The first instance is given between the control component (named the Producer) and the sorter component (named the consumer). Here the control divides the file in a determined number of tasks according to the batch size set and the size of the file; then the control sends these tasks to the first queue component which storage task until they are requested by the sorter. The sorter sorts the content of the task and put this into a second queue component (i.e., the sortedQueue). At this point, the second instance of the Producer-Consumer was started, the sorter plays a role of Producer and the merger which request tasks from the second queue is the Consumer. The merger waits until all tasks have been sorted and then merge the content of the task to produce the final output that has the content of the original file completely sorted. Therefore, the sorter component plays both roles producer and consumer. Finally, the intermediate-control is a component that helps to control the launching of the sorters and the merger components. The Producer-Consumer architecture will replicate the sorter node. The number of these kind of nodes in the architecture will depend of the configuration of the rest of evaluated context-variables. The rest of the nodes will have only one instance of the same.

72 queue_node

control_node queue1 mqueue

queue sorter_node

mq1 mqControl memory-monitor compositeName nodeName ns1 control componentName sorter1 startMonitor memory-monitor r

sortedQueue compositeName compositeName nodeName sortedQueue nodeName nodeName componentName componentName availableSorterNodes batchSize compositeName batchTimeSpan nodeName componentName

memory-monitor memory-monitor merger_node

startMonitor memory-monitor mqMerge memory-monitor startMonitor memory-monitor

nodeName merger1 nm1 startMonitor memory-monitor nodeName

compositeName 73 nodeName nodeName componentName

hgrid6

intermediate_node memory-monitor Network Switch integrador mIntermediateControl memory-monitor startMonitor memory-monitor estandar estandarizador intermediate-control r startMonitor memory-monitor nodeName compositeName nodeName nodeName componentName availableSorterNodes

rabbit_node

reporter_node

<> external-probe Rabbit.jar reporter

r external-probe generateReports reporter

Figure 4.3: Software Architecture using the Producer-Consumer Design Pattern Sayl Architecture Figure 4.4 shows the software architecture deployed in the experiments using the Sayl design pat- tern. The Sayl implementation makes use of five important components in order to achieve the expected behavior. The principal components involved in this architecture are: (i) control, (ii) container, (iii) worker-pool, (iv) sorter and (v) merger. Similar to the previous described design patterns, the control component divides the file in a determined number of tasks according to the batch size set and the size of the file; then the control starts to send these tasks to the container component, specifically to the Ready Container. Immediately, the worker-pool will assign an idle worker to perform the task. Given that the task (at this point of the algorithm execution) is a sorter task, a sorter will be assigned to execute this. After the sorter sorts the content of the task, it will put a new task in the Prepare Container of the container component. The Prepare Container will wait until have all its parameters (i.e., all required tasks of the original file), when all tasks are in the Prepare Container a new task is created and moved to the ready container, where the merger component (a worker) is waiting to process this type of task. The merger will merge the content of the task to produce the final output that has the content of the original file completely sorted. The Sayl architecture will replicate the sorter node. The number of these kind of nodes in the architecture will depend of the configuration of the rest of evaluated context-variables. The rest of the nodes will have only one instance of the same.

74 control_node

mqControl memory-monitor r

control startMonitor memory-monitor

compositeName nodeName nodeName sorter_node componentName availableSorterNodes mq1 batchSize memory-monitor batchTimeSpan ns1 sorter1 startMonitor memory-monitor

compositeName nodeName nodeName container_node componentName

mcontainer memory-monitor

container container startMonitor memory-monitor merger_node

mqMerge compositeName memory-monitor nodeName nodeName componentName Network Switch merger1 nm1 startMonitor memory-monitor

compositeName 75 nodeName nodeName componentName

rabbit_node

<> Rabbit.jar workerPool_node

mWorkerPool memory-monitor

startMonitor memory-monitor r workerPool

nodeName availableSorterNodes standardizer_node

memory-monitor integrador

startMonitor memory-monitor standardizer estandar reporter_node

nodeName external-probe reporter

r external-probe generateReports reporter

Figure 4.4: Software Architecture using the Sayl Design Pattern The Monolithic Architecture The monolithic architecture corresponds to the one that uses only one component of each type of component required for the execution of the algorithm with the restriction that only one processing node is used to deploy the components. However, it must be clarified that additional processing nodes can be used to deploy components that are not directly related with the algorithm execu- tion(e.g., memory monitors).

The Distributed Architecture The distributed architecture corresponds to the one that uses the complete set of available dis- tributed task processors to perform the execution of the sorting algorithm.

4.1.6 Memory Architecture for The Sorting Case In this section, we briefly describe the behavior of the used memory architectures, this complements the description of the software architecture and the design pattern behavior. The definitions of each memory architecture were presented in section 2.4.1.

NORMA The NORMA memory makes use of the standardizer component (refer to the deployment diagrams presented in section 4.1.5) to load into memory the data read from the text files. The data loaded is transmitted to the control component which has the business logic to start the behavior of the corresponding design pattern. Among the different type of components of a design pattern implementation, the data will be always transmitted throughout the network. That is the reason why the communication time is a notorious context-variable of this memory architecture scheme. When the program finish the sorting process, the sorted data is transmitted back to the standardizer component which writes the output file.

UMA The UMA memory makes use of the standardizer component which reads the text file. However, according to the value set for the batch size context-variable, the original text file is split to create a set of new text files of smaller size that will be sorted separately until all new files are completely sort. In this memory architecture, the data transmitted between components is not the data to be sort; instead of that the data transmitted corresponds to the path of the file be to sort. Additionally, in this memory architecture scheme, the sort and merge components are able to perform read and write operations over the set of files. The writing of the sorted output file is left in charge to the merge component.

UMA-NORMA We explore a variation of the UMA memory architecture. For the purpose of this thesis project we will call it, UMA-NORMA. The UMA-NORMA consists on a combination of behaviors from the memory architectures described above. On this, the same load and split behavior of the UMA memory is carried out up to the sorter component. However, when the sorter ”sorts” the data read from the file, the behavior starts to be similar to NORMA, that is, the sorted data is transmitted

76 over the network. When the program finish the sorting process, the merge component writes the output file as in the UMA behavior.

4.1.7 Variable Configuration: Pilot Experiments In this section we present a set of different tests performed with the purpose of defining suitable values for the selected variables of analysis as well as define the candidate architecture configuration for each of the selected design patterns.

Batch Size Value Definition for The Sorting Case For the batch size variable we require to define three different values associated to representative task granularities, as was mentioned in section 2.4.1. In order to define these values we perform the process described below: First, we select a size of file (i.e., a determined number of lines) that was neither too long nor too short to perform the tests (this definition was made based on exploratory latency tests). Second, we vary the batch size variable from 500,000 to 4’000,000 lines each 500,000 lines. According to the selected file size (i.e., 4’000.000 lines), we define a range associated to the different granularities, so the associated values are:

• Fine granularity: From 500,000 to 1’000,000 lines.

• Medium granularity: From 1’500,000 to 2’500,000 lines.

• Coarse granularity: From 3’000,000 to 4’000,000 lines.

Third, we conducted the experimental latency tests for each memory architecture (i.e., NORMA and UMA) and design pattern combination, the latency results from these pilot tests are depicted in tables 4.4 and 4.5. Tables 4.4 and 4.5 show for every batch size configuration its respective obtained latency. This allows us to identify how much the granularity and the range defined for them impact this particular performance factor. As an extra information, the mentioned tables provided a detailed data of how every latency result is obtained under each of the memory architectures. Latency times in NORMA architectures are defined by the time that takes to: (i) transport the data (i.e., communication time), (ii) the time that takes to sort all the batches (tasks), and (iii) the time that takes to merge all sorted tasks. In the UMA architectures, we additionally have (iv) the time that takes to read each file associated to a task (i.e., sort or merge), and (v) the time that takes to write in a file the output of each task. For a better comprehension please review sections 4.1.6, 4.1.5, and 3.6. To increase the magnitude of the impact and determine if this variable has a real impact in the system performance we decided to select the best, worst, and average latency values from the experiments data presented in the tables since they match with the defined granularity range (not in that precise order). However, an exception in the range were made for the Producer-Consumer in the UMA version, where the fine range is extended by 500,000 lines. These values will be used in the set of experiments. From tables 4.4 and 4.5 we observed that almost all combinations of design pattern and memory architecture for this case study provides best latency results with fine-granularity configurations. However, for Master-Worker and Sayl with UMA memory the best latency results are provided with medium-granularity configurations.

77 Table 4.4: Latency Results for the NORMA Configuration of the Experimental Pilot for Batch Size Tests

Sort Selected Batch Communication Design Pattern Latency (ms) (Sum) Merge (ms) Batch Size Size Time (ms) (ms) Master-Worker 500,000 9,090 5,647 2,124 1,180 Master-Worker Fine 1,000,000 9,091 6,193 2,184 602 Master-Worker 1,500,000 9,457 6,671 2,220 452 Master-Worker 2,000,000 9,682 7,002 2,257 316 Master-Worker Medium 2,500,000 9,912 6,878 2,238 302 Master-Worker 3,000,000 10,153 7,413 2,337 288 Master-Worker 3,500,000 10,297 7,544 2,363 274 Master-Worker Coarse 4,000,000 10,612 7,879 2,445 169 Producer-Consumer Fine 500,000 11,487 11,825 2,036 1,194

78 Producer-Consumer 1,000,000 12,430 12,917 2,205 618 Producer-Consumer 1,500,000 12,910 13,550 2,132 455 Producer-Consumer Medium 2,000,000 14,292 14,302 2,241 317 Producer-Consumer 2,500,000 14,363 13,858 2,292 307 Producer-Consumer 3,000,000 14,334 14,150 2,316 298 Producer-Consumer 3,500,000 14,476 13,225 2,410 284 Producer-Consumer Coarse 4,000,000 16,678 14,036 2,474 160 Sayl Fine 500,000 11,559 11,175 2,123 1,199 Sayl 1,000,000 12,128 11,898 2,276 618 Sayl 1,500,000 13,152 12,484 2,303 456 Sayl Medium 2,000,000 13,839 13,157 2,320 318 Sayl 2,500,000 14,385 13,113 2,338 302 Sayl 3,000,000 15,108 13,330 2,441 303 Sayl 3,500,000 16,201 13,762 2,626 293 Sayl Coarse 4,000,000 16,456 13,847 2,447 152 Table 4.5: Latency Results for the UMA Configuration of the Experimental Pilot for Batch Size Tests

Sort Selected Batch Latency Communication Merge Reading Writing Design Pattern (Sum) Batch Size Size (ms) Time (ms) (ms) (ms) (ms) (ms) Master-Worker Fine 500,000 7,590 16 1,966 1,122 1,909 2,540 Master-Worker 1,000,000 6,851 9 1,962 560 1,821 2,481 Master-Worker 1,500,000 6,794 7 2,059 406 1,907 2,399 Master-Worker 2,000,000 6,854 5 2,132 276 2,058 2,373 Master-Worker Medium 2,500,000 6,723 6 2,169 272 1,865 2,401 Master-Worker 3,000,000 6,842 6 2,224 262 2,024 2,316 Master-Worker Coarse 3,500,000 7,174 7 2,326 255 2,228 2,348 Master-Worker 4,000,000 7,448 4 2,394 104 2,617 2,324 Producer-Consumer 500,000 7,533 44 1,917 1,131 1,915 2,503

79 Producer-Consumer 1,000,000 6,892 24 1,973 560 1,929 2,396 Producer-Consumer Fine 1,500,000 6,788 18 2,045 400 1,887 2,430 Producer-Consumer 2,000,000 7,267 96 2,151 275 2,302 2,436 Producer-Consumer Medium 2,500,000 7,170 96 2,165 283 2,186 2,434 Producer-Consumer 3,000,000 7,099 13 2,184 269 2,153 2,474 Producer-Consumer 3,500,000 7,023 12 2,297 253 2,146 2,308 Producer-Consumer Coarse 4,000,000 7,588 8 2,396 104 2,722 2,356 Sayl Fine 500,000 7,703 32 2,060 1,127 1,976 2,496 Sayl 1,000,000 6,904 18 2,076 561 1,779 2,461 Sayl Medium 1,500,000 6,837 16 2,156 407 1,846 2,407 Sayl 2,000,000 7,016 11 2,201 266 2,145 2,387 Sayl 2,500,000 6,886 11 2,267 272 1,990 2,341 Sayl 3,000,000 7,091 11 2,342 259 2,104 2,371 Sayl Coarse 3,500,000 7,351 12 2,539 245 2,162 2,389 Sayl 4,000,000 7,367 7 2,402 105 2,506 2,342 Table 4.6: Latency Distribution Behavior Test for Master-Worker Under Different Architectures Configu- rations. All values are given in milliseconds

Master-Worker NORMA UMA Best Best File Size Base Monolithic 4 Nodes Base Monolithic 4 Nodes Latency Latency 500,000 917 1,011 - 999 898 1,000,000 2,047 1,889 1,977 Monolithic 2,093 1,929 1,580 4 Nodes 1,500,000 3,002 2,843 2,920 Monolithic 3,130 2,746 2,195 4 Nodes 2,000,000 4,135 3,823 3,908 Monolithic 4,129 3,831 2,805 4 Nodes 2,500,000 5,374 5,114 5,078 4 Nodes 4,370 4,454 3,684 4 Nodes 3,000,000 6,442 6,202 5,697 4 Nodes 5,348 5,252 4,015 4 Nodes 3,500,000 7,776 7,439 6,561 4 Nodes 6,336 6,253 4,625 4 Nodes 4,000,000 8,871 8,515 7,266 4 Nodes 7,259 7,249 5,018 4 Nodes 4,500,000 10,395 9,992 8,132 4 Nodes 8,321 8,394 5,491 4 Nodes 5,000,000 11,844 11,364 8,990 4 Nodes 9,573 9,549 5,884 4 Nodes 5,500,000 13,218 12,416 9,545 4 Nodes 11,812 10,604 6,303 4 Nodes 6,000,000 14,889 14,010 10,001 4 Nodes 12,940 11,784 6,731 4 Nodes 6,500,000 16,355 15,502 10,360 4 Nodes 15,291 13,120 7,148 4 Nodes 7,000,000 18,281 16,831 10,864 4 Nodes 17,518 14,430 7,725 4 Nodes 7,500,000 19,752 18,311 11,456 4 Nodes 17,843 15,794 8,120 4 Nodes 8,000,000 21,221 19,666 12,303 4 Nodes 18,433 17,100 8,678 4 Nodes

Distribution Curve for The Sorting Case We performed some pilot experiments to explore the behavior of each design pattern to analyze under the different memory architectures selected for the experiments. These experiments allow us to compare the latency obtained by the system when a monolithic, base, and a 4-nodes distributed architectures are used. Tables 4.6, 4.7, and 4.8 show the latency obtained with the different ar- chitecture configurations (architectures were described in section 4.1.5) the purpose of these tables is to easily compare the best architecture configuration under a specific range of file sizes (i.e., a determined number of lines). Additionally, we pretend to show when the monolithic architecture ends up being worse than a distributed architecture; surprisingly, the monolithic architecture it is only better than a distributed architecture in a few cases for files of minimal length. Even in the Producer-Consumer and Sayl architectures, there was not possible to perform the monolithic ex- periments due to lack of resources. The monolithic architecture present poor latency performance due to the load of all required components.

Experiment Settings

For the distribution curve experiments the network bandwidth was setup in 1 Gbps. The dis- tributed architecture uses 4 processing nodes for the sorting task. Finally, these experiments were conducted with the processing nodes having a 16 GB RAM memory. All results are showed in milliseconds.

In general, we observed that the 4-nodes distributed architecture presents the best latency results in almost all combinations (i.e., from 1’000.000 onwards), with few exceptions:

80 Table 4.7: Latency Distribution Behavior Test for Producer-Consumer Under Different Architectures Con- figurations. All values are given in milliseconds

Producer-Consumer NORMA UMA Best Best File Size Base Monolithic 4 Nodes Base Monolithic 4 Nodes Latency Latency 500,000 1,536 1,481 896 896 1,000,000 2,879 2,721 2,664 4 Nodes 1,878 1,878 1,720 4 Nodes 1,500,000 4,220 3,999 3,953 4 Nodes 2,753 2,753 2,549 4 Nodes 2,000,000 5,532 5,281 5,087 4 Nodes 3,744 3,744 3,114 4 Nodes 2,500,000 7,076 6,679 6,715 Monolithic 4,351 4,311 3,771 4 Nodes 3,000,000 8,416 8,084 7,998 4 Nodes 5,351 5,170 4,189 4 Nodes 3,500,000 9,839 9,350 9,334 4 Nodes 6,229 6,219 4,622 4 Nodes 4,000,000 11,331 10,975 11,010 Monolithic 7,376 7,174 5,122 4 Nodes 4,500,000 13,302 39,811 12,237 4 Nodes 8,419 8,183 5,522 4 Nodes 5,000,000 14,884 100,374 12,732 4 Nodes 9,585 9,352 5,862 4 Nodes Monolithic 5,500,000 11,090 10,473 6,308 4 Nodes Limit Monolithic 6,000,000 12,019 11,738 6,576 4 Nodes Limit Monolithic 6,500,000 13,944 13,280 7,031 4 Nodes Limit Monolithic 7,000,000 14,933 14,739 7,733 4 Nodes Limit Monolithic 7,500,000 16,220 15,992 8,030 4 Nodes Limit Monolithic 8,000,000 17,631 17,546 8,772 4 Nodes Limit

81 Table 4.8: Latency Distribution Behavior Test for Sayl Under Different Architectures Configurations. All values are given in milliseconds

Sayl NORMA UMA Best Best File Size Base Monolithic 4 Nodes Base Monolithic 4 Nodes Latency Latency 500,000 1,536 1,481 834 910 1,000,000 2,879 2,721 2,951 Monolithic 1,665 1,856 1,598 4 Nodes 1,500,000 4,220 3,999 4,532 Monolithic 2,510 2,611 2,263 4 Nodes 2,000,000 5,532 5,281 6,134 Monolithic 3,450 3,572 2,724 4 Nodes 2,500,000 7,276 7,325 7,576 Base 4,351 4,001 3,362 4 Nodes 3,000,000 8,756 8,724 8,669 4 Nodes 5,370 4,885 3,898 4 Nodes 3,500,000 10,241 10,288 9,844 4 Nodes 6,300 5,869 4,441 4 Nodes 4,000,000 11,653 11,950 11,312 4 Nodes 7,416 6,804 4,701 4 Nodes 4,500,000 13,770 45,497 12,803 4 Nodes 8,452 7,918 5,411 4 Nodes Monolithic 5,000,000 9,593 8,977 5,702 4 Nodes Limit Monolithic 5,500,000 10,756 10,176 5,981 4 Nodes Limit Monolithic 6,000,000 11,816 11,357 6,577 4 Nodes Limit Monolithic 6,500,000 13,131 12,510 7,110 4 Nodes Limit Monolithic 7,000,000 14,489 13,832 8,134 4 Nodes Limit Monolithic 7,500,000 15,672 14,972 8,687 4 Nodes Limit Monolithic 8,000,000 17,373 16,458 8,495 4 Nodes Limit

82 • In the Master-Worker NORMA the distributed version presents performance improvements over the monolithic one from 2’500.000 lines onwards.

• In the Sayl NORMA the improvement of the distributed version is only observed from 2’000.000 lines onwards versus the monolithic version.

Synchronous Vs Asynchronous Queue Filling for The Sorting Case In the Producer-Consumer and Sayl design patterns we faced a design decision not explicitly iden- tified in their descriptions. This design decision is related to the operational behavior of the request queue in order to improve the system performance. For this purpose, we developed some pilot ex- perimental tests that allowed us to take a decision with respect to how the tasks must be spawned to the queue (i.e., synchronously or asynchronously). The pilot experimental tests consists on set a standard configuration for the selected context variables and evaluate the latency behavior obtained with each design pattern using threads for the asynchronous behavior, or not using threads for the synchronous behavior (i.e., wait for the queue response when a enqueue task operation occurs). Tables 4.9, 4.10, 4.11, and 4.12 show the comparison between both strategies for the Producer-Consumer and Sayl patterns under the NORMA and UMA memory architectures and some relevant parameters of the configuration for these experimental tests.

Note: Experiments were performed at 1 Gbps. We used a set of ten different files of the same file size in these tests.

Results show that for this case study and the selected design patterns is recommendable to avoid the use of threads to enqueue tasks, this means that it is preferable to implement a synchronous behavior between the control and queue components in order to provide a better system perfor- mance. The selection of the synchronous behavior does not impact the configuration or selection of values for other context-variables in the experiments design. Previous to the decision of using threads or not, we evaluated two forms of concurrency to implement the asynchronous behavior, namely threads or oneway annotations. The @OneWay is a SCA annotation which indicates that the method is non-blocking and communication with the service provider to use a buffer for the requests and send them at some later time [14]. For the ex- perimental test realized with both mechanisms we found a little improvement in using threads with respect to the annotation, in terms of latency average threads are approximately 700 milliseconds faster than the SCA annotation.

4.2 Experiments Design

This section presents the definition of the different experiments that will be considered in this thesis project in order to accomplish the proposed research goals.

4.2.1 The Sorting Case - Experiments Design Table 4.13 and 4.14 show the different variables and values selected for this specific case study in the two evaluated performance factors. The performance factors were detailed in section 2.3.1. The

83 Table 4.9: Producer-Consumer NORMA latency results using synchronous and asynchronous methods for triggering tasks. All latency data is presented in milliseconds.

Latency Latency Best Batch File Size Nodes using without Latency Size Threads Threads Result 10,000,000 8 500,000 18,001 15,446 No Threads 10,000,000 8 500,000 17,961 16,379 No Threads 10,000,000 8 500,000 19,157 15,898 No Threads 10,000,000 8 500,000 19,349 15,770 No Threads 10,000,000 8 500,000 19,093 18,212 No Threads 10,000,000 8 500,000 17,710 15,999 No Threads 10,000,000 8 500,000 20,389 17,500 No Threads 10,000,000 8 500,000 19,996 17,423 No Threads 10,000,000 8 500,000 19,718 16,735 No Threads 10,000,000 8 500,000 20,410 17,250 No Threads 13,000,000 6 500,000 23,110 22,097 No Threads 13,000,000 6 500,000 26,300 22,031 No Threads 13,000,000 6 500,000 27,340 25,672 No Threads 13,000,000 6 500,000 26,785 21,893 No Threads 13,000,000 6 500,000 27,839 23,488 No Threads 13,000,000 6 500,000 24,677 23,521 No Threads 13,000,000 6 500,000 28,643 24,613 No Threads 13,000,000 6 500,000 26,463 21,840 No Threads 13,000,000 6 500,000 26,279 25,260 No Threads 13,000,000 6 500,000 27,511 22,220 No Threads

84 Table 4.10: Producer-Consumer UMA latency results using synchronous and asynchronous methods for triggering tasks. All latency data is presented in milliseconds.

Latency Latency Best Batch File Size Nodes using without Latency Size Threads Threads Result 10,000,000 8 1,500,000 11,630 11,367 No Threads 10,000,000 8 1,500,000 9,684 9,620 No Threads 10,000,000 8 1,500,000 9,386 9,282 No Threads 10,000,000 8 1,500,000 9,025 8,886 No Threads 10,000,000 8 1,500,000 10,046 8,856 No Threads 10,000,000 8 1,500,000 9,832 9,849 Threads 10,000,000 8 1,500,000 10,685 9,948 No Threads 10,000,000 8 1,500,000 10,051 9,506 No Threads 10,000,000 8 1,500,000 9,411 8,554 No Threads 10,000,000 8 1,500,000 8,996 8,712 No Threads 13,000,000 6 1,500,000 16,242 15,101 No Threads 13,000,000 6 1,500,000 14,750 13,869 No Threads 13,000,000 6 1,500,000 14,371 13,316 No Threads 13,000,000 6 1,500,000 17,454 15,881 No Threads 13,000,000 6 1,500,000 13,897 12,727 No Threads 13,000,000 6 1,500,000 17,102 15,511 No Threads 13,000,000 6 1,500,000 14,084 13,204 No Threads 13,000,000 6 1,500,000 14,297 15,038 Threads 13,000,000 6 1,500,000 14,681 13,907 No Threads 13,000,000 6 1,500,000 14,202 15,590 Threads

85 Table 4.11: Sayl NORMA latency results using synchronous and asynchronous methods for triggering tasks. All latency data is presented in milliseconds.

Latency Latency Best Batch File Size Nodes using without Latency Size Threads Threads Result 10,000,000 8 500,000 21,708 20,470 No Threads 10,000,000 8 500,000 21,993 21,133 No Threads 10,000,000 8 500,000 23,653 21,144 No Threads 10,000,000 8 500,000 22,227 22,486 Threads 10,000,000 8 500,000 23,899 23,635 No Threads 10,000,000 8 500,000 21,543 22,241 Threads 10,000,000 8 500,000 25,448 24,233 No Threads 10,000,000 8 500,000 23,949 21,258 No Threads 10,000,000 8 500,000 23,656 23,228 No Threads 10,000,000 8 500,000 24,251 24,312 Threads 13,000,000 6 500,000 28,400 28,587 Threads 13,000,000 6 500,000 31,458 28,845 No Threads 13,000,000 6 500,000 34,474 33,977 No Threads 13,000,000 6 500,000 32,186 28,839 No Threads 13,000,000 6 500,000 32,967 33,915 Threads 13,000,000 6 500,000 31,361 29,164 No Threads 13,000,000 6 500,000 33,997 33,388 No Threads 13,000,000 6 500,000 32,221 31,594 No Threads 13,000,000 6 500,000 33,151 31,780 No Threads 13,000,000 6 500,000 31,069 30,755 No Threads

86 Table 4.12: Sayl UMA latency results using synchronous and asynchronous methods for triggering tasks. All latency data is presented in milliseconds.

Latency Latency Best Batch File Size Nodes using without Latency Size Threads Threads Result 10,000,000 8 500,000 10,827 10,034 No Threads 10,000,000 8 500,000 11,777 9,790 No Threads 10,000,000 8 500,000 10,826 10,037 No Threads 10,000,000 8 500,000 12,074 9,764 No Threads 10,000,000 8 500,000 10,985 10,454 No Threads 10,000,000 8 500,000 12,994 9,792 No Threads 10,000,000 8 500,000 11,364 11,091 No Threads 10,000,000 8 500,000 11,528 9,756 No Threads 10,000,000 8 500,000 12,800 10,032 No Threads 13,000,000 6 500,000 17,548 26,436 Threads 13,000,000 6 500,000 17,117 16,727 No Threads 13,000,000 6 500,000 17,568 16,277 No Threads 13,000,000 6 500,000 20,286 16,884 No Threads 13,000,000 6 500,000 15,034 17,588 Threads 13,000,000 6 500,000 19,846 16,707 No Threads 13,000,000 6 500,000 16,349 18,456 Threads 13,000,000 6 500,000 19,455 16,317 No Threads 13,000,000 6 500,000 15,332 19,445 Threads 13,000,000 6 500,000 18,033 16,307 No Threads

87 tables depict more variables than those listed in section 3.7.4 because these tables depict the whole set of variables involved in the experiments rather than just the context-variables.

88 Table 4.13: Experiments for the sorting case. Compendium of the variables variations for Latency Experiments.

Imply Number of Variables Variation Values Architectural Variations Modifications Number of available distributed task processors 4 Start in 4 nodes to 10 varying each 2 Yes We will monitor the behavior of this variable RAM Memory usage 1 No through the different planned executions. Network Bandwidth 2 100 Mbps and 1000 Mpbs No We use a dynamic queue (LinkedBlockingQueue), Buffer size (Queue) 1 No We will monitor the growth of the same. 89 Batch Time Span 3 Defined by Experiment No Batch Size 3 Defined by Experiment No Memory Structure 2 UMA and NORMA Yes We will monitor the behavior of this variable Communication Time 1 No through the different planned executions. From 1 million lines to 20 million lines, each line has 20 alphanumeric characters. From 1 million Size of File to Sort 30 lines to 10 million lines, we make variations each No 500,000 lines. From 10 million lines variations are each 1 million lines. Design Patterns 3 Producer-Consumer, Sayl, and Master-Worker Yes Performance Factor 1 Latency Yes Latency Experiments According to the table 4.13 data, we must perform a total of 25,920 experiments. However, con- sidering the replication method described in section 3.7.1 to provide validity to the experiments results, we must perform a theoretically total of 259,200 experiments. Nevertheless, given the progress make in the experiments execution and their preliminary analysis, several of these exper- iments could lose relevance and it will not be required to execute the total of them. This happens because of two principal facts: (i) the combination of the context-variables and their values. For example, a file that is split in four different tasks will be sorted in very similar times under the combination of 4, 6, 8, and 10 task processors. According to the example, only the combination with 4 task processors will be required for the experiments in order to save time. (ii) Some of the defined context-variables must be evaluated under the experiments environment given that the iso- lated experiments did not provide enough information, some of these variables were determined as not relevant for the experiments since it always impact the performance in a negative way. Further sections of chapter 5 will present the corresponding analysis.

Throughput Experiments The throughput experiments will be guided by the execution and preliminary results of the latency experiments. As we noted before, it is possible that some variables could not have a significant impact in the system performance; therefore they will be eliminated from future experiments. Given the number of experiments required to evaluate the latency of the system and the time required for the execution of the same, we will select for the throughput experiments a more restricted set of variables and variables values. The variables and its values selected for the throughput experiments after preliminary latency experiments are depicted in table 4.14.

4.2.2 The Large XML Processing Case - Experiments Design As we stated in section 3.7.1, we performed a complementary evaluation for the bachelor graduation project of C´ordobaand Mej´ıa[9].4 Table 4.15 shows the specific context-variables and its values for the latency performance factor. This complementary evaluation is based on the results presented in the mentioned document. Therefore, the experiment design is well-bounded.

4.3 Chapter Summary

In this chapter we have presented how the experiments design were defined in order to fulfill the research goals of this master thesis. We presented the experiment environment configurations: used software technologies, the hardware and network architectures as well as the software architectures, the modeling of the memory structure under the different software architectures given by the design patterns, and some important pilot experiments performed to evaluate particular behaviors of the systems under a restrictive group of conditions. Finally, we presented in detail the experiments design formulated for each case study.

4In their graduation project, “large” refers to the relative size of the data files to be updated in the centralized data base. Usually, in the context of the organization that posed the problem, these files are in the order of kilobytes, and in this case the files are in the order of megabytes.

90 Table 4.14: Experiments for the sorting case. Compendium of the variables variations for Throughput Experiments.

Variables Variation Values Number of available distributed Start in 4 nodes to 10 varying each 2 task processors 8 for 5 million, 5 for 8 million, Number of service request 3 for 11 million, 3 for 14 million, 2 for 17 million, and 2 for 20 million lines. We will monitor the behavior of this variable RAM Memory usage through the different planned executions. Network Bandwidth 1000 Mpbs We use a dynamic queue Buffer size (Queue) (LinkedBlockingQueue), We will monitor de growth of the same. Batch Size Medium-Granularities Memory Structure UMA We will monitor the behavior of this variable Communication Time through the different planned executions. Size of File to Sort 5, 8, 11, 14, 17, and 20 million lines Design Patterns Producer-Consumer, Sayl, and Master-Worker Performance Factor Throughput

Table 4.15: Experiments for the large XML processing case. Compendium of variables variations for the complementary Latency Experiments.

Variables Variation Values Number of available distributed task processors 4, 8, and 12 Number of components by task processor 1, 6, and 12 (i.e., number of consumer per node) File Size 1 MB and 5 MB files Design Patterns Producer-Consumer and Reactor

91 In the next chapter, we present the analysis performed over the experiments to observe the behavior of the performance factors under the variation of the context-variables and the domain- specific design patterns.

92 Chapter 5

Analysis of Experiment Results

In the previous chapter, we presented the design of experiments we conceived to characterize the relationships among the different variables that are significant for improving the performance of software systems, mainly from a domain-specific design patterns perspective. In turn, in this chapter we present the analysis of the results obtained from the execution of that experiments design. More concretely, we analyze the system response in terms of the latency and throughput performance factors, based on the data gathered from the experiments execution, applied in the two specific case studies described previously. The analysis of results is presented as follows. First, we consider the sorting case study in terms of its latency and throughput performance factors. For each of these performance factors, an analysis is performed over the variation of its respective context-variables, according to the defined experiment design for this case study. The respective variation values were presented in Tables 4.13 and 4.14 of chapter 4. Second, once we have analyzed both performance factors, we performed a comparative analysis between them. Third, we evaluated the large XML-file processing case study, fundamentally in terms of its latency, as we defined in Table 4.15. A throughput analysis for this case study is summarized from our collaborative work presented in [9]. Finally, we present the best performance combinations for each case study.1 It is worth noting that for the sorting case study, our most intensive experiment design, we decided to address an analysis that evaluates the impact of each variable independently (i.e., ig- noring the impact of other context-variables) and, subsequently, we evaluated the impact of several variables together. This allows us to understand the behavior of how a particular variable, or a combination of them, impact the performance of the software system. In addition, according to the selected variables for this thesis project, all analysis are presented from the point of view of a design pattern, subject to a memory structure configuration. These two variables are the ones that impact most significantly the architecture of the application, making it difficult to analyze them separately. On the other hand, it was not possible to address a similar analysis for the large XML processing case, since most of the selected variables were not applicable for it. This condition is described in further sections.

1All the relevant data and analysis performed in this chapter was registered in three main Excel files: (i) Latency Analysis Sorting.xslx, (ii) Throughput Analysis Sorting.xlsx, and (iii) Throughput Analysis XML Processing.xlsx. All files are available at https://gforge.icesi.edu.co/docman/?group id=14&view=listfile&dirid=51 and contain a set of sheets with the name of the analysis, which are easily interpretable.

93 5.1 Impact of Context-Variables on Latency for the Sorting Case

This section presents the analysis of the results obtained from the execution of experiments de- signed to evaluate the performance factor of latency for the Sorting case. As we stated before, we present initially the impact of the context-variables independently, and subsequently, an analysis of combinations of them. The selected design patterns, as well as the other context-variables involved in this experiment design, are illustrated in Table 4.13 of Chapter 4. Nonetheless, given the interrelationship among variables, the difficulty of analyzing their isolated effects, and in general, the complexity of the experiment execution, we chose the two most significant variables, design patterns and memory structure, as the main references to group, organize and present the analysis on the variations of the other variables. Therefore, this subsection is organized as follows: we first present the description of these two most significant context-variables, along which we perform the analysis of the other variables. Then, we present the analysis of the system response behavior under the variation of the other variables, sorted in descending order of impact on the system latency. Finally, we analyze the impact of combinations of context-variables, and present the global summary by design pattern and memory structure. The resulting presentation order of the context-variables analysis, after describing the two main variables in subsection 5.1.1, is the following:

• Network Bandwidth

• Communication Time

• RAM Memory Usage

• Buffer Size

• Batch Time Span

• File Length

• Task Granularity: Batch-Size

• Number of Available Distributed Task Processors

• Number of Available Task Processors + Task Granularity

• Number of Available Task Processors + File Length

• Task Granularity + File Length

• Number of Task Processors + Task Granularity + File Length

5.1.1 Design Patterns and Memory Structure Variations Design patterns as well as memory structures are the two main variables that govern the analysis in the whole chapter. These variables affect significantly the architecture of the application under analysis, making them difficult to analyze completely independently of the others. These variables are present through all sections in order to characterize the behavior of the latency in the sorting case. As was defined in chapter 4, the domain-specific design patterns analyzed in this thesis

94 project were: (i) Master-Worker, (ii) Producer-Consumer, and (iii) Sayl. A formal description of these patterns is presented in section 3.6 of chapter 3. On the other hand, the evaluated memory structures were: (i) UMA and (ii) NORMA. A formal description of these is presented in section 2.4.1 of chapter 2, and how they were implemented for the experiment design is presented in section 4.1.6 of chapter 4.

5.1.2 Network Bandwidth We analyzed the impact of two network bandwidth configurations in this thesis project (1 Gbps and 100 Mbps). In theoretical terms, there is a differential factor of 10 between both network band- widths. However, how much difference can be observed in the experiments in terms of latency? The obtained results shows that (ignoring the impact of other context-variable variations), in average, a application running at 100 Mbps would be just 3.57 times slower than the same application running in a 1 Gbps network in a NORMA memory structure, and 4.85 times in an UMA memory struc- ture. Table 5.1 presents the average latency and some time-records with both network bandwidth configurations by each memory structure that helps us to guide the performance analysis.

• Communication Time: The total time spent in communication. This total is the sum of communication times among all software components involved in the execution of the program. That is, the sum of all communication times even if they are concurrent 2. This is the reason why this value is greater than the total-time.

• Sort(Sum): The total time spent sorting by all sort distributed components.

• Reading Time (Sum): The total time spent reading a sort or merge task. That is, the time spent loading the required data to perform a sort or merge task. This total is the sum of reading times among all software components involved in the execution of the program even if they are concurrent.

• Writing Time (Sum): The total time spent writing a sort or merge task. That is, the time spent writing the required data after performing a sort or merge task. This total is the sum of writing times among all software components involved in the execution of the program even if they are concurrent.

• Merge: The total time spent merging by the centralized merge component.

• Total-Time: The total time elapsed to sort completely a file.

According to results shown in table 5.1, we observed as expected that the processing times of sorting and merging tasks were not affected by the network bandwidth. However, the communi- cation time is clearly and directly impacted by the network bandwidths, but not in its theoretical measure (i.e., in a factor of 10). In NORMA, the differential factor in the communication time between an application running at 100 Mbps and 1 Gbps is 5.32, and for UMA is 2.20. However, in UMA the communication times are negligible3. In UMA, the reading and writing times are directly

2In contrast to the measured times for the other variables in this experiment, the communication time is aggregated as occurring sequentially, even when actually occurred concurrently. This is, of course, because in this experiment this is exactly the variable of analysis. 3A further analysis of this statement is presented in the section dedicated to the communication time variable.

95 impacted by the network bandwidth since these times require that the data be transmitted over the network. The differential factors between both network bandwidths are: 6.24 for the reading time and 5.64 for the writing time.

Table 5.1: Average latency and times for 1 Gbps and 100 Mbps experiments detailed by main sorting algorithm stages and memory structure. Results are shown in milliseconds.

NORMA UMA 1 Gbps 100 Mbps Rate 1 Gbps 100 Mbps Rate Communication-Time 21,174 112,706 5.32 20 44 2.20 Sort (Sum) 3,780 3,948 1.04 3,453 3,523 1.02 Merge 1,003 1,022 1.02 793 804 1.01 Reading Time (Sum) 0 0 0 3,776 23,565 6.24 Writing Time (Sum) 0 0 0 4,353 24,550 5.64 Total-Time 14,085 50,311 3.57 7,137 34,616 4.85

In table 5.2, we present the summary of the difference between both network bandwidth con- figurations by memory structure and design pattern. The analysis by design pattern shows that Sayl has a better utilization of the network resources under UMA memory, followed by Producer- Consumer, and lastly Master-Worker. On the other hand, Master-Worker has a better utilization of network resources under NORMA memory, followed by Producer-Consumer, and Sayl. We evidence that the network bandwidth is a context-variable that significantly impacts the latency of a program. However, the improvement between two different network bandwidths is not linear with respect to its theoretical differential factor. Even though the experiment environment was guaranteed for the execution of the experiment design defined in this thesis project, the max- imum differential factor evidenced was 6.24. We found that the memory structures together with the network bandwidth are determinant to obtain a better latency performance.

Table 5.2: Latency average with 1 Gbps and 100 Mbps detailed by memory structure and design patterns. All results are shown in milliseconds.

Memory Structure Design Pattern 1 Gbps 100 Mbps Rate NORMA 14,085 50,311 3.57 Master-Worker 10,482 39,627 3.78 Producer-Consumer 14,638 53,074 3.63 Sayl 17,134 58,233 3.40 UMA 7,137 34,616 4.85 Master-Worker 7,237 34,686 4.79 Producer-Consumer 7,109 35,468 4.99 Sayl 7,064 33,694 4.77

Note: During the preliminary analysis of results for the network bandwidth we observed that the impact between both configurations was stable. That, combined with the time required to perform all experiments led us to the decision of conducting the 100 Mbps experiments with files from 1 million to 13 millions lines. Results presented in tables 5.1 and 5.2 present the average latency for both network bandwidths with files until 13 million lines.

96 5.1.3 Communication Time The communication time is an only-monitored variable selected for analysis in this project (i.e., we did not vary values in this variable, we only measure and register their values along the execution of the software). For the analysis of the only-monitored variables we decided to work only with the values registered under experiments performed at 1 Gbps, given the results obtained in the previous section. We must note that the analysis for this variable was initially conceived in a way that, after a second analysis, was found not totally correct. Therefore, in this section, and given the time constraints to finish the thesis and this document, instead of presenting the complete analysis of results, we can only present the correct rules and steps that should be followed in order to perform a correct analysis about the communication-time variable. Also, we present an analysis for a particular case following the introduced rules and evaluate the communication time for that case. Although it is not a generalizable analysis, and this is not what was planned, it helps to understand some part of the communication time behavior. Finally, it is worth noting that the experiments were performed correctly. The problem was the pre-processing algorithms used for aggregating the results in order to be analyzed that were discovered to be not completely correct.

A Set of Rules or Steps to Perform a Correct Analysis of the Communication Time The following logic is defined to process a raw latency file of the performed experiments in order to analyze the behavior of the communication-time variable. Since the current performed analysis does not provide enough information to determine the net communication time of an execution (i.e., we cannot determine exactly the proportion of the communication time with respect to the total-time that takes to sort a file).

• A raw latency execution file must be read (these are reported in TXT format).

• Every file contains a set of data composed by:

Stage: It provides the information about who it is the reported time. Options are: (i) sort, (ii) merge, (iii) communication time, (iv) reading time (only for UMA), (v) writing time (only for UMA), and (vi) total-time . Processor: It indicates the name of the component involved in the time report. If the time reported is a communication time this name is composed of an “origin-destination” of the communication. Moment: It defines the direction of the communication between the involved components. Options are: (i) send, or (ii) return. This field only applies for communication times. Time: It is the time that takes the stage (i.e., the reported time). Start: It is the stamp of the time at the exact moment of the start of a stage. The stamp was reported in milliseconds using the CurrentSystemMillis of Java. End: It is the stamp of the time at the exact moment of the end of a stage. The stamp was reported in milliseconds using the CurrentSystemMillis of Java.

• To understand the reported times in a time-line and analyze the behavior of a particular file execution, we must identify the relative processing times. To do this we must:

97 – Identify the minimum value of the Start field. By rule, the minimum value will always be the Start of the Total-Time Stage. This is our zero-time. The minimum value must be stored temporally during the execution of this whole process. – Then, for all reported records of the latency report (i.e., for every reported stage) we must take its Start time and substract it with respect to the minimum value identified in the previous step. This will give us the relative start of the Stage with respect to the zero time. – Using the duration time of each stage reported in the Time field, we can now build a time-line with the relative start and relative end of each stage reported. This information is enough to build a graphical representation of the time-line and evaluated visually the behavior of any latency report execution.

• Using the relative times of the previous step, we are going to identify the net time wasted in processing stages (i.e., sort and merge). To do this, we must follow these rules:

– Filter the relative times only to processing stages. – Order the relative start times referring to processing stages from minor to major values. – Define a variable to hold the partial results of the sum of processing times; hereinafter proc time var. After ordering, the time of the first record will be added to proc time var. – Go to the next record. – Hereinafter, you must apply the following pseudo-algorithm: Abbreviations used in the pseudo-algorithm: - relative end of previous record: R.E.P.R - relative start of current record: R.S.C.R - relative end of current record: R.E.C.R - relative start of previous record: R.S.P.R

Pseudo-algorithm: do{ if(R.E.P.R <= R.S.C.R) then proc_time_var += (R.E.C.R - R.S.C.R) else proc_time_var += (R.E.C.R - R.S.P.R) } while(there are processing records)

– Once the repetitive structure ends, the processing time var will have the time value of the net processing time of a latency report execution. – To obtain the net communication time, a simple difference must be applied: NetCommunicationT ime = T otalT ime − P rocessing T ime V ar – The Net Communication Time obtained must be registered in an Excel File with extra data, such as the file size, the memory structure, the design pattern, the number of nodes, the batchSize.

98 Table 5.3: Communication Time Analysis

Memory Total-Time Net Communication-Time Design Pattern Rate Structure (ms) (ms) Master-Worker NORMA 24,571 18,919 77.00% Producer-Consumer NORMA 27,036 17,413 64.41% Sayl NORMA 32,541 23,165 71.19% Master-Worker UMA 16,717 12,514 74.86% Producer-Consumer UMA 17,642 12,505 70.88% Sayl UMA 17,635 11,288 64.01%

A Particular Analysis of the Communication Time Figures 5.1, 5.2, 5.3, 5.4, 5.5, and 5.6 present the behavior of the selected design patterns under the NORMA and UMA memory structures for a same file of 14 million lines, all figures are presented a 1 Gbps, medium-granularities, and 4-task processors configuration. In order to understand these figures, some considerations must be taken:

• The Y-axis represents the Processor-Component name that executes the action.

• In the Y-axis a Processor-Component name followed by a line an another Processor-Component name makes reference to a communication time (origin-destiny).

• The X-axis is a time-line of the latency expressed in milliseconds. In this axis, it is possible to see the duration of the process of a component.

We analyze the source data of the depicted figures 5.1, 5.2, 5.3, 5.4, 5.5, and 5.6 according to the mentioned set of rules established in the previous section in order to analyze the Net Communication Time of these particular files executions. Table 5.3 presents the values for the total-time of the evaluated particular file under all design patterns and memory structures, and its corresponding Net Communication Time, calculate through the execution of the steps mentioned before. As a result, we are able to calculate the rate of the communication-time over the time that takes to sort the whole file. We observe that the time spend in communication it is relatively high (i.e., between 64% and 77% of the total-time). Therefore, the communication-time which is principally linked to the network bandwidth has a significant impact in the performance of the system.

99 All Processors

ns4

ns4

ns3

ns2

ns1

ns1

nm1

Control-ns4

Control-ns4

Control-ns4

100 Control-ns4 Processor NameProcessor Control-ns3

Control-ns3

Control-ns2

Control-ns2

Control-ns1

Control-ns1

Control-ns1

Control-ns1

Control-Merger

0 5000 10000 15000 20000 25000 Total Latency (ms)

Figure 5.1: Master-Worker behavior with a 14 million lines file using medium-granularity with 4-Nodes in NORMA at 1 Gbps All Processors ns3 ns2 ns1 nm1 ns3-SortedQueue ns2-SortedQueue ns1-SortedQueue SorterQueue-IntermediateControl SorterQueue-IntermediateControl SorterQueue-IntermediateControl SorterQueue-IntermediateControl

101 Queue-IntermediateControl

Processor NameProcessor Queue-IntermediateControl Queue-IntermediateControl IntermediateControl-ns4 IntermediateControl-ns3 IntermediateControl-ns2 IntermediateControl-ns1 Control-Queue Control-Queue Control-Queue Control-Queue 0 5000 10000 15000 20000 25000 30000 Total Latency (ms)

Figure 5.2: Producer-Consumer behavior with a 14 million lines file using medium-granularity with 4-Nodes in NORMA at 1 Gbps All Processors

ns4

ns3

ns2

nm1

ns4-container-PrepareContainer

ns3-container-PrepareContainer

ns2-container-PrepareContainer

WorkerPool-ns4 Name

- WorkerPool-ns3

WorkerPool-ns2 102

Processor WorkerPool-ns1

WorkerPool-Sorters

WorkerPool-Sorters

WorkerPool-Sorters

WorkerPool-Merger

Control-container-ReadyContainer

Control-container-ReadyContainer

Control-container-ReadyContainer

Control-container-ReadyContainer 0 5000 10000 15000 20000 25000 30000 35000 Total Latency (ms)

Figure 5.3: Sayl behavior with a 14 million lines file using medium-granularity with 4-Nodes in NORMA at 1 Gbps ns4 - Writing ns3 - Writing ns3 - Writing ns2 - Writing ns1 - Writing ns1 - Writing nm1 - Writing All Processors ns4 ns3 ns3 ns2 ns1 ns1 ns4 - Reading ns3 - Reading ns3 - Reading ns2 - Reading

103 ns1 - Reading ns1 - Reading

Processor NameProcessor nm1 - Reading nm1 Control-ns4 Control-ns4 Control-ns3 Control-ns3 Control-ns3 Control-ns3 Control-ns2 Control-ns2 Control-ns1 Control-ns1 Control-ns1 Control-ns1 Control-Merger 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Total Latency (ms)

Figure 5.4: Master-Worker behavior with a 14 million lines file using medium-granularity with 4-Nodes in UMA at 1 Gbps ns4 - Writing ns3 - Writing ns2 - Writing nm1 - Writing ns4 ns3 ns2 ns4 - Reading ns3 - Reading ns2 - Reading nm1 - Reading ns4-SortedQueue ns3-SortedQueue ns2-SortedQueue

104 SorterQueue-IntermediateControl

ProcessorName SorterQueue-IntermediateControl SorterQueue-IntermediateControl Queue-IntermediateControl Queue-IntermediateControl Queue-IntermediateControl IntermediateControl-ns4 IntermediateControl-ns3 IntermediateControl-ns2 IntemediateControl-Merger Control-Queue Control-Queue Control-Queue 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 Total Latency (ms)

Figure 5.5: Producer-Consumer behavior with a 14 million lines file using medium-granularity with 4-Nodes in UMA at 1 Gbps ns4 - Writing ns3 - Writing ns2 - Writing ns1 - Writing ns4 ns3 ns2 ns1 ns3 - Reading ns2 - Reading ns1 - Reading nm1 ns3-container-PrepareContainer ns2-container-PrepareContainer 105

ns1-container-PrepareContainer Processor NameProcessor WorkerPool-ns4 WorkerPool-ns2 WorkerPool-ns1 WorkerPool-Sorters WorkerPool-Sorters WorkerPool-Sorters WorkerPool-Sorters Control-container-ReadyContainer Control-container-ReadyContainer Control-container-ReadyContainer Control-container-ReadyContainer 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 Total Latency (ms)

Figure 5.6: Sayl behavior with a 14 million lines file using medium-granularity with 4-Nodes in UMA at 1 Gbps 5.1.4 RAM Memory Usage The RAM memory usage is also an only-monitored variable. Results shown in Figures 5.7 and 5.8 allow to determine that the behavior of the system performance is directly related with the RAM memory usage of the system’s components. In these figures, we observe that the latency behavior over the different file-lengths4 presents a polynomial growth. In contrast with the RAM memory usage, we observed that in NORMA there is high consumption of RAM memory in files from 1 million to 11 million lines. After that, the RAM memory usage seems to increment linearly. The RAM memory usage that we describe in figures 5.7 and 5.8 refers to the average RAM usage over all components involved in the execution of the algorithm under each design pattern. However, in UMA, the RAM memory presents a similar behavior of usage through all file lengths. It is worth to remember from section 4.1.3 that we set the JVM Heap Size in 6 GB, this is the reason why, figures 5.7 and 5.8 are bound by 6,000 MB. In NORMA, the RAM memory is used up to 5,200 MB with the Master-Worker design pattern. On the other hand, in UMA, the RAM is used only up to 3,200 MB also with the Master-Worker pattern. Our hypothesis for this behavior in NORMA with respect to the UMA behavior is that the data transportation requires too much RAM usage, producing a overload in every component of the solution in comparison with the RAM consumption of the components under the UMA structure. As we noted in NORMA, the Master-Worker presents high RAM memory usage, still though, it presents the best latency results. In UMA, Producer- Consumer has the best latency results; however, it is the second design pattern that more RAM memory uses. In general, the RAM memory usage is 40% less in UMA than in NORMA. We evaluated the behavior of the RAM memory by each of the components required for the design patterns, results are shown in figures 5.9 and 5.10. In general, we found that all components tend to use more memory when they were running the Master-Worker design pattern (i.e., com- paring the common components between design patterns). The sort components present similar RAM memory usage between the design patterns under both memory structures (i.e., NORMA and UMA) with an average of 3,000 MB. The Merger, Standardizer, Separated-queue (Producer- Consumer), Separated-container (Sayl), Intermediate-controller (Producer-Consumer and Sayl) components present high memory RAM usage in the NORMA architecture, that is, more than 90% of the available RAM memory for the JVM. These components present such level of memory RAM usage because in comparison with the sorter components, these are ought to keep in memory more than one batch of data at the time. However, in UMA, of those components, only the Merger presents high memory usage. The Merger is the only component that have to keep in memory the whole data of the file at a given moment in order to merge the previous sorted tasks.

4The File-Length variable is analyzed in depth in further sections.

106 60,000 6,000

50,000 5,000

40,000 4,000

30,000 3,000

LATENCY (MS) LATENCY USED RAM (MB) RAM USED 107

20,000 2,000

10,000 1,000

- - 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 FILE LENGTH

Master-Worker - RAM Producer-Consumer - RAM Sayl - RAM Master-Worker - Latency Producer-Consumer - Latency Sayl - Latency

Figure 5.7: Average RAM Usage vs Latency by File Length and Design Pattern in NORMA at 1 Gbps 35,000 6,000

30,000 5,000

25,000 4,000

20,000

3,000

LATENCY (MS)LATENCY 15,000 USED(MB) RAM 108

2,000 10,000

1,000 5,000

- - 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 FILE LENGTH

Master-Worker - RAM Producer-Consumer - RAM Sayl - RAM Master-Worker - Latency Producer-Consumer - Latency Sayl - Latency

Figure 5.8: Average RAM Usage vs Latency by File Length and Design Pattern in UMA at 1 Gbps 6,000

5,000

4,000

3,000

2,000

109 Average Average RAM Usage (MB)

1,000

- Intermedia Standardiz Separated Separated S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Merger Control te- er Queue Container Controller Master-Worker 3,551 3,550 3,495 3,540 3,126 3,238 3,037 2,956 2,824 2,745 5,954 5,932 5,861 - - - Producer-Consumer 2,813 2,746 3,331 3,379 3,009 3,011 2,856 2,891 2,812 2,789 5,927 4,048 5,512 5,634 - 5,889 Sayl 2,697 2,647 3,373 3,409 2,988 3,007 2,822 2,847 3,700 2,725 5,847 3,987 5,465 - 5,983 5,823 Processor Component

Figure 5.9: Average RAM Usage by Components of the Software Architecture of each Design Pattern in NORMA at 1 Gbps 6,000

5,000

4,000

3,000

2,000

110 Average Average RAM Usage (MB)

1,000

- Intermedia Standardiz Separated Separated S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Merger Control te- er Queue Container Controller Master-Worker 2,343 2,258 2,898 2,926 2,806 2,829 2,723 2,750 2,698 2,620 5,154 1,436 1,949 - - - Producer-Consumer 2,419 2,326 2,934 3,001 2,898 2,898 2,742 2,795 2,818 2,785 5,107 1,624 1,941 1,702 - 1,420 Sayl 2,249 2,149 2,857 2,878 2,790 2,794 2,675 2,718 2,761 2,687 4,984 1,436 1,956 - 1,589 1,494 Processor Component

Figure 5.10: Average RAM Usage by Components of the Software Architecture of each Design Pattern in UMA at 1 Gbps 5.1.5 Buffer Size Another only-monitored variable to be analyzed is the buffer size of the queues involved in the different architectures given by the design patterns, these queues are used to store temporally the tasks that must be processed by the system’s components. It it worth to remember, that the tasks under the NORMA structure contains the data itself. Nevertheless, in the UMA structure tasks only contain the path to the file where the data is. In the experiments, we monitored the buffer size to evaluate how much these queues grow while the program is running. Our purpose is to evaluate if the growth of the buffer size impact the system performance. As we established in the buffer size definition (of section 2.4.1), the buffer size introduces a factor of balance between the velocity at which tasks are generated versus the velocity at which they are processed. Figures 5.11 and 5.12 show the average of the maximum number of tasks that are queued in the program before they are processed. NORMA results shows that the Master-Worker and Producer-Consumer design patterns do not produce a relative large number of tasks, contrary to the Sayl design pattern which enqueues a large number of tasks in the prepared container. Master-Worker enqueues a maximum of five tasks in the largest evaluated file size (i.e., 20 million lines) of the experiments design. In the Producer-Consumer, there is a maximum of six tasks enqueued for the largest file. Over the different file sizes, it is observable a growth in the maximum number of tasks enqueued. Given that in NORMA memory, tasks are stored in memory (RAM memory), it is possible to associate the growth in the RAM memory usage (section 5.1.4) with the number of tasks queued. On the other side, the Sayl pattern shows for the prepared container a maximum of 18 tasks enqueued for the largest file. From the Sayl template definition (of section 3.6.13), we know that this design pattern makes explicit the need of a queue system for the storage of tasks that have no required parameters ready in order to be executed (the prepare container). However, the number of tasks enqueued in the ready container is relative low. Even though the Sayl pattern presents a greatest number of tasks queued, its RAM usage is similar to the Producer-Consumer or Master-Worker RAM usage. From the experiments execution, we know that some design patterns presented better performance than others5. However, it is not possible to associated the buffer size behavior with a better or worst latency performance since there is not enough evidence that this affects neither the RAM memory usage nor the processing tasks rate for our particular case study of sorting. More case studies and experimentation will be required in order to produce generalizable analysis. The processing task rate will be determined by the software architecture related with each design pattern. However, the low number of tasks in the queues indicates that tasks must not wait too much time in the queue previous to its execution. In the UMA results, it is observable that Master-Worker and Sayl presents a relative large number of tasks queued, contrary to Producer-Consumer which reports low numbers. We concluded that this behavior (i.e., a greater number of tasks queued than in NORMA) is given because in UMA memory structure tasks are conformed only by an id and a file path. Therefore, its production is significantly faster than in NORMA, increasing the number of tasks that must be stored in queue. On the other side, we conclude that the Producer-Consumer has fewer tasks in the queue due to its software architecture, where its consumers are always ready to dequeue a task immediately, if they are not busy.

5This statement is analyzed in detail in further sections.

111 20

18

16

14

12

10

8 112

6 Average Average Max. Number of TaskinQueue 4

2

0

File Length

Master-Worker Producer-Consumer Sayl Prepare Container Sayl Ready Container

Figure 5.11: Average Maximum Number of Tasks in Queue by Design Pattern in NORMA at 1 Gbps 20

18

16

14

12

10

8 113 6

Average Average Max. Number of TaskinQueue 4

2

0

File Length

Master-Worker Producer-Consumer Sayl Prepare Container Sayl Ready Container

Figure 5.12: Average Maximum Number of Tasks in Queue by Design Pattern in UMA at 1 Gbps 5.1.6 Batch Time Span In the experiments design, we stated the need to evaluate the impact of the batch time span in terms of performance factors, as well as its viability for the experiments design. In summary, we found that this context-variable does not have a positive impact in any of the selected performance- factors (i.e., there is not improvement in the latency or throughput, when a value major than zero is set for this variable). This section presents the main findings of the impact of the batch time span over latency. Figures 5.13, 5.14, 5.15, and 5.16 show the results of the experiments execution.

• Given that these tests was executed with the defined batch size configurations according to each design pattern configuration, a great number of tests were executed under the 4 million batch size configuration; however, in those cases it is not possible to evaluate the impact of the batch time span considering that the length of the files used in these experiments were 4 million as well. We executed those cases for the completeness of the test. Some variations can be visualized in the depicted graphs; however, they are attributed to some minimal variations in the execution environment stability.

• The times registered for sort, merge, reading, and writing tasks were not affected by the batch time span variation since the batch time span only delays the launch of the tasks.

• Although the communication time is not affected by the batch time span, it can be observed that in a great number of tests a significant variation of this time, compared to the variation of other variables. Please refer to the Time Communication analysis section 5.1.3.

• The latency is affected in greater or smaller way in each of the experiments depending of the batch size variable. The smaller batch size the greater the negative impact in the latency. The Batch-Size analysis section is detailed in further sections.

• In Master-Worker NORMA, Master-Worker UMA, and Producer-Consumer UMA configurations it can be observed that the latency is affected in a quasi-linear form according to the increase of values in the variable. However, in the particular case of the UMA con- figurations, it can be depicted a latency ”improvement” with some batch time span values; nevertheless this false improvement is due to the time reduction in other variables as reading time.

• In Sayl NORMA, Sayl UMA, and Producer-Consumer NORMA configuration it cannot be distinguished a behavior pattern caused by the batch time span. Additionally, under those configurations there is never a positive behavior in the latency.

5.1.7 File Length We evaluated the behavior of the context-variables under a several number of files of different lengths (i.e., number of lines in the file). Continuing with the base analysis of memory structure and design patterns under the two network bandwidth configurations, we obtained the results presented in figures 5.17, 5.18, 5.19, and 5.20. Tendency lines were added into the charts to facilitate its reading.

114 (a) Master-Worker Batch-Size: 1’000,000 (b) Master-Worker Batch-Size: 2’500,000

(c) Master-Worker Batch-Size: 4’000,000 (d) Sayl Batch-Size: 500,000

(e) Sayl Batch-Size: 2’000,000 (f) Sayl Batch-Size: 4’000,000

Figure 5.13: Batch Time Span Results for NORMA Configuration Part I

115 (a) Producer-Consumer Batch-Size: 500,000 (b) Producer-Consumer Batch-Size: 2’000,000

(c) Producer-Consumer Batch-Size: 4’000,000

Figure 5.14: Batch Time Span Results for NORMA Configuration Part II

116 (a) Master-Worker Batch-Size: 500,000 (b) Master-Worker Batch-Size: 2’500,000

(c) Master-Worker Batch-Size: 3’500,000 (d) Sayl Batch-Size: 500,000

(e) Sayl Batch-Size: 1’500,000 (f) Sayl Batch-Size: 3’500,000

Figure 5.15: Batch Time Span Results for UMA Configuration Part I

117 (a) Producer-Consumer Batch-Size: 1’000,000 (b) Producer-Consumer Batch-Size: 2’000,000

(c) Producer-Consumer Batch-Size: 4’000,000

Figure 5.16: Batch Time Span Results for UMA Configuration Part II

118 60,000

50,000

40,000

30,000 Latency (ms) Latency

20,000

10,000

- 1,000,000 3,000,000 5,000,000 7,000,000 9,000,000 11,000,000 13,000,000 15,000,000 17,000,000 19,000,000 File Length

Master-Worker Producer-Consumer Sayl

Figure 5.17: Average Latency by File Length in NORMA at 1 Gbps

For 1 Gbps (figures 5.17 and 5.18), it is observed a polynomial behavior of order two, quasi- lineal for NORMA and UMA architectures. The linear behavior is observed approximately up to 13 million lines from that point forward the latency adopts a polynomial behavior. In NORMA, Master-Worker always obtains the best latency results, followed by Producer-Consumer, and Sayl. However, in UMA, Producer-Consumer presents best latency results starting from the distribution point (i.e., five million lines) but with significant latency differences from 11 million lines onwards, Master-Worker and Sayl presents similar behaviors over all file lengths. On the 100 Mbps (figures 5.19 and 5.20), the behavior is largely linear; nevertheless, it is worth to remember that under this configuration, files were tested up to 13 million lines (as we stated in section 5.1.2). Similar to the 1 Gbps results, in NORMA, Master-Worker obtains the best latency results, but unlike the 1 Gbps, results the improvement over the other two design patterns is significantly higher (between 16% and 41% of improvement with respect to the closer design pattern results in comparison with the 6% and 34% of the 1 Gbps results). In UMA, all three patterns have very similar results.

File Length Processing Limit We performed some extra experiments at 1 Gbps to evaluate the largest file size that can be processed under the different memory structures considered in this project. Tables 5.4 and 5.5 show the latency results for the largest file sizes in NORMA and UMA, respectively. In NORMA, the largest file that could be processed has 23 million lines. Larger files produced a Java Out of

119 35,000

30,000

25,000

20,000

15,000 Latency (ms)Latency

10,000

5,000

- 1,000,000 3,000,000 5,000,000 7,000,000 9,000,000 11,000,000 13,000,000 15,000,000 17,000,000 19,000,000 File Length

Master-Worker Producer-Consumer Sayl

Figure 5.18: Average Latency by File Length in UMA at 1 Gbps

120,000

100,000

80,000

60,000 Latency (ms) Latency

40,000

20,000

- 1,000,000 3,000,000 5,000,000 7,000,000 9,000,000 11,000,000 13,000,000 File Length

Master-Worker Producer-Consumer Sayl

Figure 5.19: Average Latency by File Length in NORMA at 100 Mbps

120 70,000

60,000

50,000

40,000

30,000 Latency (ms) Latency

20,000

10,000

- 1,000,000 3,000,000 5,000,000 7,000,000 9,000,000 11,000,000 13,000,000 File Length

Master-Worker Producer-Consumer Sayl

Figure 5.20: Average Latency by File Length in UMA at 100 Mbps

Table 5.4: Average Latency for Large File Sizes in NORMA at 1 Gbps. All results are showed in milliseconds

File Length Master-Worker Producer-Consumer Sayl 20,000,000 41,699 44,460 55,923 22,000,000 50,957 46,812 59,554 23,000,000 59,755 49,588 63,776

Memory Error 6. This means that the JVM reaches its Java Heap size limit. On the other hand, in UMA, the largest file that could be processed has 40 million lines. In both cases, NORMA and UMA, all design patterns reach the Java Heap size limit at the same file length. In general, Producer-Consumer presents the best latency results under the different memory structures.

5.1.8 Task Granularity: Batch-Size We analyzed the impact of the task granularity through the application of three different values of batch-size (for details of the value selection and its representation in terms of number of lines to sort please refer to section 4.1.7). Figures 5.21, 5.22, 5.23, and 5.24 summarize the principal results for the 1 Gbps and 100 Mbps network configurations, respectively. In overview, we evidence that: (i) the task granularity has a remarkable impact in the latency of the system, (ii) for this specific case study is preferable to work with finer batch-sizes, and (iii) in the worst case the performance could decrease up to two times with respect to the best case due to a wrong batch-size and design pattern selection.

6https://docs.oracle.com/javase/7/docs/api/java/lang/OutOfMemoryError.html

121 Table 5.5: Average Latency for Large File Sizes in UMA at 1 Gbps. All results are showed in milliseconds

File Length Master-Worker Producer-Consumer Sayl 20,000,000 29,324 26,896 29,436 22,000,000 30,738 29,187 29,484 23,000,000 42,553 31,004 30,695 24,000,000 34,250 32,640 32,024 25,000,000 45,344 33,106 32,863 30,000,000 50,670 40,372 40,267 35,000,000 67,207 53,786 53,749 40,000,000 81,647 67,000 66,198

30,000

25,000

20,000

15,000

Latency (ms) Latency 10,000

5,000

- Master-Worker Producer-Consumer Sayl Fine 14,134 16,250 20,931 Medium 14,975 19,010 23,157 Coarse 16,714 24,613 27,917

Figure 5.21: Latency by Task Granularity in NORMA at 1 Gbps

The differential factor in latency within each design pattern due to the different task-granularities can range between 1.18 and 1.51 for both NORMA or UMA configuration. Additionally, according to the general results, latency shows an acceptable performance when tasks are distributed with a fine-granularity or medium-granularity; however, using coarse-granularity affects negatively the performance results. In section 4.1.7, a particular behavior was detected in the batch-size selection of Master-Worker and Sayl pattern under the UMA memory structure. In these, the fine-granularity shows the worst latency results. As figure 5.22 depicts, that behavior was observed in the whole experiments performed under 1 Gbps network. However, this behavior is not observed in the 100 Mbps network experiments as figure 5.24 depicts. The Master-Worker pattern presents the best latency results under the evaluated configurations with an unique case exception under the 1 Gbps network and UMA memory combination, as noted above, where the Producer-Consumer pattern obtains better results. Among the design patterns, Master-Worker evidenced to be the less affected by the variation of the task granularity.

5.1.9 Number of Available Distributed Task Processors We performed experiments varying in two the number of available distributed task processors from four to ten (i.e., the number of nodes available for the sort task). In summary, we confirm that

122 12,000

10,000

8,000

6,000

Latency (ms) Latency 4,000

2,000

- Master-Worker Producer-Consumer Sayl Fine 11,959 9,860 12,126 Medium 10,107 10,281 10,137 Coarse 10,887 11,173 10,759

Figure 5.22: Latency by Task Granularity in UMA at 1 Gbps

70,000

60,000

50,000

40,000

30,000

Latency (ms) Latency 20,000

10,000

- Master-Worker Producer-Consumer Sayl Fine 38,223 48,351 53,425 Medium 39,317 49,644 55,375 Coarse 40,233 60,078 64,570

Figure 5.23: Latency by Task Granularity in NORMA at 100 Mbps

123 40,000 35,000 30,000 25,000 20,000

15,000 Latency (ms) Latency 10,000 5,000 - Master-Worker Producer-Consumer Sayl Fine 30,984 31,847 30,832 Medium 34,621 34,561 31,819 Coarse 37,646 39,285 37,621

Figure 5.24: Latency by Task Granularity in UMA at 100 Mbps there is a performance improvement when the number of task processors is increased; however, the improvement caused by a large number of task processors is only appreciated when the size of the file is considerably large, such that all task processors can work concurrently. For a better understanding please refer to the definition of task-granularity in section 2.4.1. The task granularity in the sorting experiments is understood as the Batch Size, which defines the different sizes of the batch of work (i.e., number of lines per task) to be processed by each of the available task processors. Therefore, according to the different sizes of files and task granularities analyzed in this thesis project, it might be possible that some tasks processors were idle. The impact of these variables together with the number of task processors will be analyzed further in this chapter. On the one hand, charts depicted in figures 5.25 and 5.26 summarize the obtained results for the experiments performed at 1 Gbps described by memory structure and design pattern. According to the NORMA results, Master-Worker is the design pattern that better takes advantage of the increment in the number of task processors, this is the latency obtained with 4-nodes vs the obtained with 10-nodes, with a total improvement of 495 ms, followed by Producer-Consumer with 396 ms, and finally Sayl with a surprising performance decreasing of 117 ms. In NORMA, Master-Worker presents the best average latency results compared to the other configurations. For the UMA results, Sayl takes better advantage of the processors with a latency improvement of 833 ms, followed by Producer-Consumer with 807 ms, and Master-Worker with 564 ms. In UMA, Producer-Consumer presents the best latency results. On the other hand, charts depicted in figures 5.27 and 5.28 summarize the obtained result for the 100 Mbps network bandwidth. Under the NORMA memory structure, Producer-Consumer presents the best total improvement with 1,118 ms, followed by Sayl with 208 ms, and a performance decreasing of Master-Worker of 922 ms; despite that, Master-Worker remains presenting the best latency results. Under the UMA memory structure, all-three design patterns present similar latency results. However, Sayl takes better advantage of processors with 2,281 ms, Master-Worker with 1,484 ms, and Producer-Consumer with 1,345 ms. As has been noted, an increment in the number of task processors is better exploited under UMA than under a NORMA memory structure. In a similar way, a better utilization of the task processors is noticed in the 100 Mbps than the 1 Gbps network configuration.

124 25,000

20,000

15,000

10,000 Latency (ms) Latency 5,000

- Master-Worker Producer-Consumer Sayl 4 Nodes 15,599 20,254 23,999 6 Nodes 15,233 19,847 23,982 8 Nodes 15,135 19,863 24,064 10 Nodes 15,104 19,859 24,117

Figure 5.25: Latency by Number of Available Distributed Task Processors Configuration in NORMA at 1 Gbps

As we mentioned before, these results are based on the average latency of the performed exper- iments (i.e., over all performed experiments for the sorting case. However, in this section we just analyzed the impact of the task processors without considering the impact of other variables yet). For this reason, a performance improvement cannot be visualized correctly considering only this variable, given that in a great number of experiment cases all task-processors cannot be used by the conditions established for the same experiment case; as it was stated at the beginning of this section, the impact of the number of task processors is observed better when it is analyzed together with other variables as task granularity and file size.

5.1.10 Number of Available Task Processors + Task Granularity We have analyzed the impact that different context-variables independently have in the latency performance-factor of the sorting case study. Now, we will present the analysis of the impact of multiple context-variables combined, in order to accomplish our characterization goal. In this section, we evaluated how the combination of the number of available task processors and the task granularity together impacts the latency of the system together. Figures 5.29, 5.30, 5.31, and 5.32 show the results for the selected design patterns in the NORMA and UMA memory structures. The behavior described in the individual analysis of the related context-variables (of sections 5.1.9 and 5.1.8) is extended for this analysis. The behavior determined by the task gran- ularity is spread to the four task-processors configurations. However, in this analysis, it is started to be note that the best latency results are obtained by the 8-nodes configuration. In conclusion, conversely to what might be supposed, a greater number of task processors do not mean a bet- ter performance necessarily. In general, the best latency results are obtained with 8-nodes and a fine-granularity for the design patterns evaluated.

5.1.11 Number of Available Task Processors + File Length According to section 5.1.9, the number of available task processors do not produce significant differences in the latency of the system. In this analysis, we observe that the behavior of the

125 12,000

10,000

8,000

6,000

4,000 Latency Latency(ms) 2,000

- Master-Worker Producer-Consumer Sayl 4 Nodes 11,452 10,965 11,533 6 Nodes 10,983 10,466 10,997 8 Nodes 10,690 10,187 10,761 10 Nodes 10,887 10,159 10,700

Figure 5.26: Latency by Number of Available Distributed Task Processors Configuration in UMA at 1 Gbps

60,000

50,000

40,000

30,000

20,000 Latency (ms) Latency

10,000

- Master-Worker Producer-Consumer Sayl 4 Nodes 38,686 53,362 57,930 6 Nodes 39,270 52,756 57,767 8 Nodes 39,468 52,402 57,740 10 Nodes 39,608 52,244 57,722

Figure 5.27: Latency by Number of Available Distributed Task Processors Configuration in NORMA at 100 Mbps

126 35,000 30,000 25,000 20,000 15,000

Latency (ms) Latency 10,000 5,000 - Master-Worker Producer-Consumer Sayl 4 Nodes 35,358 36,114 34,827 6 Nodes 34,372 35,207 33,502 8 Nodes 34,065 34,835 32,821 10 Nodes 33,874 34,769 32,546

Figure 5.28: Latency by Number of Available Distributed Task Processors Configuration in UMA at 100 Mbps

Figure 5.29: Average Latency by Number of Available Task Processors and Task Granularity in NORMA at 1 Gbps

127 Figure 5.30: Average Latency by Number of Available Task Processors and Task Granularity in UMA at 1 Gbps

Figure 5.31: Average Latency by Number of Available Task Processors and Task Granularity in NORMA at 100 Mbps

128 Figure 5.32: Average Latency by Number of Available Task Processors and Task Granularity in UMA at 100 Mbps number of task processors detailed by file length is similar to the behavior described in section 5.1.7 (File-Length). However, for those files with a large number of lines (i.e., those with more than 11 million lines) an improvement in the latency performance factor is evidenced when there is a major number of task processors, especially in those configurations with UMA memory. Figures 5.33, 5.34, 5.35, 5.36, 5.37, and 5.38 depict the results of the experiments performed at 1 Gbps in both memory structures. We decide not to show charts for the 100 Mbps experiments, considering that the behavior can be inferred from the corresponding results of the related variables as seen for the 1 Gbps configuration. On the one hand, for the UMA memory structure, we conclude that the files (i.e., the size of the files) evaluated in the experiment design for the sorting case might be little to take advantage of the number of tasks processors defined. That is, all task-processors are not always used in the execution of the sorting algorithm. For this reason, we only observed an improvement in the latency given by the fact of having more task processors in those files of greatest size. On the other hand, for the NORMA memory structure, there are not appreciable improvements in the latency with a larger number of task processors, not even with the files of greatest size. It might be possible that the impact of communication times in this memory structure is so higher than the relative improvement given for the task processors.

5.1.12 Task Granularity + File Length In this section we analyze the impact on latency of the task granularity through files of different lengths, in order to provide more details for the selection of a particular task granularity level to select under a specific file length. Figures 5.39, 5.40, 5.41, 5.42, 5.43, and 5.44 depict the results of the experiments performed at 1 Gbps in both memory structures. It can be observed that Producer-Consumer (figure 5.40) and Sayl (figure 5.41) present similar

129 45,000

40,000

35,000

30,000

25,000

20,000 Latency (ms) Latency

15,000

10,000

5,000

- - 2,000,00 0 4,000,00 0 6,000,00 0 8,000,00 0 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 File Length

4-Nodes 6-Nodes 8-Nodes 10-Nodes

Figure 5.33: Average Latency in Master-Worker by Task Processors Number and File Length in NORMA at 1 Gbps

50,000

45,000

40,000

35,000

30,000

25,000 Latency (ms) Latency 20,000

15,000

10,000

5,000

- - 2,000,00 0 4,000,00 0 6,000,00 0 8,000,00 0 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 File Length

4-Nodes 6-Nodes 8-Nodes 10-Nodes

Figure 5.34: Average Latency in Producer-Consumer by Task Processors Number and File Length in NORMA at 1 Gbps

130 60,000

50,000

40,000

30,000 Latency (ms) Latency

20,000

10,000

- - 2,000,00 0 4,000,00 0 6,000,00 0 8,000,00 0 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 FILE LENGTH

4-Nodes 6-Nodes 8-Nodes 10-Nodes

Figure 5.35: Average Latency in Sayl by Task Processors Number and File Length in NORMA at 1 Gbps

45,000

40,000

35,000

30,000

25,000

20,000 Latency (ms) Latency

15,000

10,000

5,000

- - 2,000,00 0 4,000,00 0 6,000,00 0 8,000,00 0 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 File Length

4-Nodes 6-Nodes 8-Nodes 10-Nodes

Figure 5.36: Average Latency in Master-Worker by Task Processors Number and File Length in UMA at 1 Gbps

131 50,000

45,000

40,000

35,000

30,000

25,000 Latency (ms) Latency 20,000

15,000

10,000

5,000

- - 2,000,00 0 4,000,00 0 6,000,00 0 8,000,00 0 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 File Length

4-Nodes 6-Nodes 8-Nodes 10-Nodes

Figure 5.37: Average Latency in Producer-Consumer by Task Processors Number and File Length in UMA at 1 Gbps

60,000

50,000

40,000

30,000 Latency (ms) Latency

20,000

10,000

- - 2,000,00 0 4,000,00 0 6,000,00 0 8,000,00 0 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 FILE LENGTH

4-Nodes 6-Nodes 8-Nodes 10-Nodes

Figure 5.38: Average Latency in Sayl by Task Processors Number and File Length in UMA at 1 Gbps

132 50,000

45,000

40,000

35,000

30,000

25,000

Latency (ms) Latency 20,000

15,000

10,000

5,000

- - 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 File Length

Fine Medium Coarse

Figure 5.39: Average Latency in Master-Worker by Task Granularity and File Length in NORMA at 1 Gbps behaviors in the NORMA memory, on these, fine granularity demonstrated to provide the best latency results in files of up to approximately 16 million lines. For bigger files, medium granularity shows better latency performance. The coarse granularity for these design patterns provide poor latency results. In Master-Worker (figure 5.39), fine granularity provides the best results in files up to 9 million lines. For bigger files, all task-granularities have similar behaviors with a few improvement in the medium granularity. On the other hand, in UMA memory, as we have noticed in section 5.1.8, a significant performance decreasing behavior is observed in Master-Worker and Sayl when it is used a fine granularity. Figures 5.42 and 5.44 show that this behavior occurs from files of 11 million lines onwards. The other two granularities (i.e., medium and coarse) have similar results. In Producer-Consumer (figure 5.43), we conclude the same behavior than in Master-Worker for the NORMA memory structure. According to the behavior observed in the analysis of these two context-variables (Task Gran- ularity and File Length), it is feasible to state that the coarse granularity will present the best latency results in files that have a greater size with respect to those evaluated in this thesis project. It seems to be a point where the fine granularity becomes ”too fine” affecting the communication times and subsequently decreasing the performance. Task granularities must be defined according to the size of the evaluation subject.

5.1.13 Number of Task Processors + Task Granularity + File Length In this section, we analyze the impact in latency of the three context-variables that we have been evaluating individually and combined in pairs. Our purpose with this analysis is to observe if the conclusions previously made in sections 5.1.9, 5.1.8, 5.1.7, 5.1.10, 5.1.11, and 5.1.12 are still valid when they are combined, and observe if new conclusions can be drawn for the latency behavior.

133 60,000

50,000

40,000

30,000 Latency (ms) Latency

20,000

10,000

- - 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 File Length

Fine Medium Coarse

Figure 5.40: Average Latency in Producer-Consumer by Task Granularity Number and File Length in NORMA at 1 Gbps

70,000

60,000

50,000

40,000

30,000 Latency (ms) Latency

20,000

10,000

- - 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 File Length

Fine Medium Coarse

Figure 5.41: Average Latency in Sayl by Task Granularity and File Length in NORMA at 1 Gbps

134 40,000

35,000

30,000

25,000

20,000 Latency (ms) Latency 15,000

10,000

5,000

- - 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 File Length

Fine Medium Coarse

Figure 5.42: Average Latency in Master-Worker by Task Granularity and File Length in UMA at 1 Gbps

30,000

25,000

20,000

15,000 Latency (ms) Latency

10,000

5,000

- - 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 File Length

Fine Medium Coarse

Figure 5.43: Average Latency in Producer-Consumer by Task Granularity and File Length in UMA at 1 Gbps

135 40,000

35,000

30,000

25,000

20,000 Latency (ms) Latency 15,000

10,000

5,000

- - 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 File Length

Fine Medium Coarse

Figure 5.44: Average Latency in Sayl by Task Granularity and File Length in UMA at 1 Gbps

From the behavior observed in the design pattern results under the NORMA memory, we conclude that:

• The coarse granularity always produce bad latency results in comparison with the other task granularities for all selected design patterns and number of task processor configurations. • The Master-Worker is the design pattern less affected by the task-granularity variable, that is, the variations in the latency results between the different granularities are minimal. Contrary to the Master-Worker, the Producer-Consumer and Sayl are highly impacted by the task granularities. Our hypothesis is that the overload associated to the large number of required component for these design patterns; and therefore an extra data-transportation processes and RAM consumption7, will be determinant for the task-granularity’s impact. • For files with more than 13 millions lines, it is preferable to use a medium-granularity in order to get a better latency performance. However, this variable depends on others such as the file size. • The performance improvement related with the number of task processors is not significant for this memory structure and evaluated file sizes. However, given the behavior presented in the UMA memory structure, we might think that the length of the files used in these experiments are not larger enough to appreciated an improvement related to a greater number of available task processors. In the evaluated file sizes, the performance improvement is determined by the task-granularity. • Although each combination of task-granularity and number of available task processors has its own behavior, the behavior described in section 5.1.7 (file-length) is conserved in general by

7Communication Time and RAM Memory Usage are analyzed in further sections.

136 the fine and medium granularities under all number of available task processor configurations in the design patterns.

From the behavior observed in the design pattern results under the UMA memory, we conclude that:

• The Master-Worker and Sayl design patterns present similar distribution behaviors in latency. From 1 million lines to approximately 10 million lines, the best configuration is given by the maximum number of available task processors in the experiments (i.e., 10 nodes) and fine and medium granularities. From 10 million lines onwards, the best configuration is given by medium and coarse granularities with also 10 task processors. In these two design patterns, a fine granularity will affect considerably the system performance. As we stated in sections 5.1.11 and 5.1.12, our hypothesis is that improvement related with the number of task processor depends on others variables such as the file length and task granularity. Because of the improvements of having more task processors are not reflected if there are idle nodes.

• The Producer-Consumer design pattern does not present too much variation in the latency when varying the task granularity as Master-Worker or Sayl. It presents good latency results with fine and medium granularities with 10 task processors in the set of considered file sizes. The coarse granularity presents average latency results.

• Design patterns take more advantage of available task processors under the UMA memory than under NORMA. However, the improvement obtained by additional task processors is not considerably significant. Therefore, a trade-off between the cost of the extra processing nodes and the improvement must be taking into account.

137 45,000

40,000

35,000

30,000

25,000 Latency(ms) 20,000 138

15,000

10,000

5,000

- 1,000,000 3,000,000 5,000,000 7,000,000 9,000,000 11,000,000 13,000,000 15,000,000 17,000,000 19,000,000 File Length

Fine - 4 Nodes Medium - 4 Nodes Coarse - 4 Nodes Fine - 6 Nodes Medium - 6 Nodes Coarse - 6 Nodes Fine - 8 Nodes Medium - 8 Nodes Coarse - 8 Nodes Fine - 10 Nodes Medium - 10 Nodes Coarse - 10 Nodes

Figure 5.45: Average Latency in Master-Worker by Task Granularity, File Length, and Number of Available Task Processors in NORMA at 1 Gbps 50,000

45,000

40,000

35,000

30,000

25,000 Latency(ms) 139 20,000

15,000

10,000

5,000

- 1,000,000 3,000,000 5,000,000 7,000,000 9,000,000 11,000,000 13,000,000 15,000,000 17,000,000 19,000,000 File Length

Fine - 4 Nodes Medium - 4 Nodes Coarse - 4 Nodes Fine - 6 Nodes Medium - 6 Nodes Coarse - 6 Nodes Fine - 8 Nodes Medium - 8 Nodes Coarse - 8 Nodes Fine - 10 Nodes Medium - 10 Nodes Coarse - 10 Nodes

Figure 5.46: Average Latency in Producer-Consumer by Task Granularity, File Length, and Number of Available Task Processors in NORMA at 1 Gbps 60,000

50,000

40,000

Latency(ms) 30,000 140

20,000

10,000

- 1,000,000 3,000,000 5,000,000 7,000,000 9,000,000 11,000,000 13,000,000 15,000,000 17,000,000 19,000,000 File Length

Fine - 4 Nodes Medium - 4 Nodes Coarse - 4 Nodes Fine - 6 Nodes Medium - 6 Nodes Coarse - 6 Nodes Fine - 8 Nodes Medium - 8 Nodes Coarse - 8 Nodes Fine - 10 Nodes Medium - 10 Nodes Coarse - 10 Nodes

Figure 5.47: Average Latency in Sayl by Task Granularity, File Length, and Number of Available Task Processors in NORMA at 1 Gbps 40,000

35,000

30,000

25,000

20,000 Latency(ms) 141

15,000

10,000

5,000

- 1,000,000 3,000,000 5,000,000 7,000,000 9,000,000 11,000,000 13,000,000 15,000,000 17,000,000 19,000,000 File Length

Fine - 4 Nodes Medium - 4 Nodes Coarse - 4 Nodes Fine - 6 Nodes Medium - 6 Nodes Coarse - 6 Nodes Fine - 8 Nodes Medium - 8 Nodes Coarse - 8 Nodes Fine - 10 Nodes Medium - 10 Nodes Coarse - 10 Nodes

Figure 5.48: Average Latency in Master-Worker by Task Granularity, File Length, and Number of Available Task Processors in UMA at 1 Gbps 35,000

30,000

25,000

20,000 Latency(ms)

142 15,000

10,000

5,000

- 1,000,000 3,000,000 5,000,000 7,000,000 9,000,000 11,000,000 13,000,000 15,000,000 17,000,000 19,000,000 File Length

Fine - 4 Nodes Medium - 4 Nodes Coarse - 4 Nodes Fine - 6 Nodes Medium - 6 Nodes Coarse - 6 Nodes Fine - 8 Nodes Medium - 8 Nodes Coarse - 8 Nodes Fine - 10 Nodes Medium - 10 Nodes Coarse - 10 Nodes

Figure 5.49: Average Latency in Producer-Consumer by Task Granularity, File Length, and Number of Available Task Processors in UMA at 1 Gbps 40,000

35,000

30,000

25,000

20,000 Latency(ms) 143

15,000

10,000

5,000

- 1,000,000 3,000,000 5,000,000 7,000,000 9,000,000 11,000,000 13,000,000 15,000,000 17,000,000 19,000,000 File Length

Fine - 4 Nodes Medium - 4 Nodes Coarse - 4 Nodes Fine - 6 Nodes Medium - 6 Nodes Coarse - 6 Nodes Fine - 8 Nodes Medium - 8 Nodes Coarse - 8 Nodes Fine - 10 Nodes Medium - 10 Nodes Coarse - 10 Nodes

Figure 5.50: Average Latency in Sayl by Task Granularity, File Length, and Number of Available Task Processors in UMA at 1 Gbps Figure 5.51: Average latency by memory structure and design pattern at 1 Gbps

5.1.14 An Isolated Experiment for The UMA-NORMA Memory Structure As we stated in section 4.1.6, we explored a variation of the UMA memory structure. Our purpose with this isolated experiment is to observe the behavior of this memory structure and compare it with the memory structures evaluated according to the experiment design. Figure 5.51 shows a comparative chart of the latencies obtained with the different memory structures in the design patterns. Results show according to the expected (i.e., considering the behavior established for this memory structure), that UMA-NORMA latency results are distributed between the NORMA and UMA latency results. However, it is observed that the behavior of the latency results are closer to the behavior of the NORMA results than from the UMA results. Our hypothesis is that the load and split process carried out at the beginning of the UMA behavior is relative short in comparison with the big times spend in the communication process when NORMA is used.

Note: The UMA-NORMA experiments where conducted only on files of 8, 9, 10, and 13 million lines and at 1 Gbps network configuration. The comparison of results with respect to the other memory structures (UMA and NORMA) depicted in figure 5.51 correspond only to the latency results obtained in the mentioned file sizes in order to make them comparable.

5.2 Context-Variables Impact on Throughput for The Sorting Case

This section presents the analysis of the experiments defined to evaluate the performance factor: Throughput. It is worth to remember from the chapter 4 (Experiments Design) that the throughput experiments were conducted under a more restricted set of context-variables values in comparison with the latency experiments given the preliminary analysis performed for them. Unlike the latency experiments, throughput experiments were conducted only for the combinations that presented the best performance results, namely:

• The UMA memory structure.

• 1 Gbps for network bandwidth.

144 • File sizes of 5, 8, 11, 14, 17, and 20 million lines. • Medium-granularities. This section follows a similar analysis schema to the latency section (5.1) to present the results of the experiments.

5.2.1 Number of Service Requests We performed some experiments to define an initial for the number of service requests for the throughput performance factor in the sorting case. For this, we observed the behavior of the Master- Worker and Producer-Consumer design patterns under a different number of service requests, that is, the number of concurrent files that arrive to the software to be sort. Given the fact that our software architecture is limited to only one merge component, we assume that there is a specific number of service requests under which all design pattern implementations provide its best throughput results. For the same reason, we might expect that the throughput results do not vary to much with respect to the latency results, we will confirm this hypothesis in section 5.3. On the other hand, these experiments also pretend to evaluate the impact of having an additional service request in the system performance.

Number of Processed Total ms to Process File Size Design Pattern Batch Size Nodes Service Files Time (ms) One File Request 5,000,000 Producer-Consumer 1,500,000 10 20 216 1,281,102 5,931 5,000,000 Producer-Consumer 1,500,000 10 8 216 1,336,955 6,190 17,000,000 Master-Worker 2,500,000 10 3 60 1,322,101 22,035 17,000,000 Master-Worker 2,500,000 10 4 60 1,367,268 22,788

Table 5.6: Number of Service Requests Experiment

Table 5.6 shows that at higher number of service requests, which can be interpreted as user requests in real-world applications, do not necessarily affect negatively the throughput of the soft- ware system. In the experiments performed with the Producer-Consumer design pattern using files of five million lines, results show a not significant difference of 259 ms between 8 and 20 service requests. In the experiments with Master-Worker, we evaluated the impact of one single additional service request with large files (i.e., 17 million lines), results show a difference of 753 ms. Therefore, we conclude that the impact in the throughput is more determined by control of the data flow (i.e., how the sofware was programmed) than for the load given by extra user requests. In this particular case, the impact of having one merge determines the whole throughput rate. A two-merge version of this case study will be study in section 5.2.4. The number of service requests used in the throughput experiments (defined in 4.2.1) were established calculating the maximum load of concurrent files that the system can process without overload risk, this according to the file length processing limit identified in the latency experiments.

5.2.2 Design Patterns and File Size In this section, we analyze the throughput of the sorting case under the selected design patterns and six different selected file sizes. Figure 5.52 shows that Producer-Consumer and Master-Worker provide the best throughput results under the different evaluated file sizes. However, Sayl also presents acceptable throughput results in comparison with the mentioned design patterns results.

145 40,000

35,000

30,000

25,000

20,000

15,000

10,000

ThroughputtoFile)OneProcess (ms 5,000

- 5,000,000 8,000,000 11,000,000 14,000,000 17,000,000 20,000,000 File Size

Master-Worker Producer-Consumer Sayl

Figure 5.52: Average Throughput by File Length in UMA at 1 Gbps

5.2.3 Number of Available Distributed Task Processors In this section, we analyze the impact of the number of the available task processors in the through- put of the system. Figures 5.53, 5.54, and 5.55 show the throughput measures for the performed design experiments summarized in table 4.14 of chapter 4. Results show that there is no significant difference in the throughput obtained for the different configurations of nodes and design patterns. However, similar to the conclusions, we get for the latency in section 5.1.9 and its related sections, an throughput improvement was evidenced when there were large number of tasks processors, es- pecially over the largest files. Figure 5.56 presents a comparison between the evaluated design patterns with different nodes configurations. Results show that Master-Worker with eight and ten nodes, and Producer-Consumer with 8 nodes present the best throughput results.

146 35,000

30,000

25,000

20,000

15,000

10,000 Throughput (ms to Process One File)One ThroughputtoProcess (ms 5,000

- 5,000,000 8,000,000 11,000,000 14,000,000 17,000,000 20,000,000 File Size

Master-Worker 4 Nodes Master-Worker 6 Nodes Master-Worker 8 Nodes Master-Worker 10 Nodes

Figure 5.53: Average Throughput by Number of Available Distributed Task Processors for the Master- Worker in UMA at 1 Gbps

35,000

30,000

25,000

20,000

15,000

10,000 Throughput (ms to Process One File)OneThroughput toProcess (ms 5,000

- 5,000,000 8,000,000 11,000,000 14,000,000 17,000,000 20,000,000 File Size

Producer-Consumer 4 Nodes Producer-Consumer 6 Nodes Producer-Consumer 8 Nodes Producer-Consumer 10 Nodes

Figure 5.54: Average Throughput by Number of Available Distributed Task Processors for the Producer- Consumer in UMA at 1 Gbps

147 40,000

35,000

30,000

25,000

20,000

15,000

10,000 Throughput (ms to Process One File)One to Process Throughput (ms

5,000

- 5,000,000 8,000,000 11,000,000 14,000,000 17,000,000 20,000,000 File Size

Sayl 4 Nodes Sayl 6 Nodes Sayl 8 Nodes Sayl 10 Nodes

Figure 5.55: Average Throughput by Number of Available Distributed Task Processors for the Sayl in UMA at 1 Gbps

148 40,000

35,000

30,000

25,000

20,000

149 15,000 ThroughputtoFile)OneProcess (ms

10,000

5,000

- 5,000,000 8,000,000 11,000,000 14,000,000 17,000,000 20,000,000 File Size

Master-Worker 4 Nodes Master-Worker 6 Nodes Master-Worker 8 Nodes Master-Worker 10 Nodes Producer-Consumer 4 Nodes Producer-Consumer 6 Nodes Producer-Consumer 8 Nodes Producer-Consumer 10 Nodes Sayl 4 Nodes Sayl 6 Nodes Sayl 8 Nodes Sayl 10 Nodes

Figure 5.56: Average Throughput by Number of Available Distributed Task Processors for the All Design Patterns in UMA at 1 Gbps 35,000

30,000

25,000

20,000

15,000

10,000

5,000

-

Throughput (ms to Process One File)One to Process Throughput (ms 5,000,000 8,000,000 11,000,000 14,000,000 17,000,000 20,000,000 File Size

Master-Worker One Merger Master-Worker Two Mergers

Figure 5.57: Average Throughput with One and Two Mergers with UMA at 1 Gbps in Master-Worker

5.2.4 The Impact of Two-Mergers We performed an extra experiment to evaluate the impact of having more than one merge in the throughput of the sorting case. Figures 5.57, 5.58, and 5.59 show the throughput obtained using two-merge components and the comparison with the one-merge results. These experiments were conducted using a 10-nodes deployment. Results show a significant improvement in the throughput with the introduction of the extra merge component, as we have noted, the merge component could turn into a bottleneck for the throughput results. Results show that the inclusion of the extra merge component can improve the throughput of the system between 10% and 27%. In this case, Sayl is the pattern that better take advantage of the extra component, followed by Master-Worker, and Producer-Consumer. We assume based on these results that extra merge components can further improve the system performance. However, the number of merge components must not exceed the number of concurrent files or the number of available sorter processors because otherwise they will be idle considerably part fo the time.

5.3 A Comparative Analysis between Latency and Throughput for The Sorting Case

One of our goals with the development of this thesis project is to evaluate if the latency results can be taken as reference to estimate the throughput of a software system assuming the same deployment. Table 5.7 shows a comparison between the latency and throughput results obtained under a specific set of configurations. The result of the correlation coefficient between the Latency and Throughput data showed in the table 5.7 is 0.9820. This shows a high positive correlation which indicates that the latency results can be taken as a reference for a throughput estimation as long as the deployment configurations are conserved. In addition, we present the figures 5.60, 5.61, and 5.62 which show the linear regression by design pattern according to the data depicted in table 5.7. This confirms the relation between the latency and throughput performance factors

150

35,000 )

30,000

25,000

20,000

15,000 to Process One FileOne to Process

10,000

5,000

-

Throughput(ms 5,000,000 8,000,000 11,000,000 14,000,000 17,000,000 20,000,000 File Size

Producer-Consumer One Merger Producer-Consumer Two Mergers

Figure 5.58: Average Throughput with One and Two Mergers with UMA at 1 Gbps in Producer-Consumer

40,000 ) 35,000 30,000 25,000 20,000

to Process One FileOne to Process 15,000 10,000 5,000 -

Throughput(ms 5,000,000 8,000,000 11,000,000 14,000,000 17,000,000 20,000,000 File Size

Sayl One Merger Sayl Two Mergers

Figure 5.59: Average Throughput with One and Two Mergers with UMA at 1 Gbps in Sayl

151 for this case study.

Throughput Throughput (ms to Latency difference File Size Design Pattern Batch-Size Nodes Process (ms) with respect One File) to Latency 5,000,000 Master-Worker 2,500,000 4 5,824 5,846 (22) 5,000,000 Master-Worker 2,500,000 6 5,997 5,796 202 5,000,000 Master-Worker 2,500,000 8 6,089 5,849 240 5,000,000 Master-Worker 2,500,000 10 6,150 5,849 301 5,000,000 Producer-Consumer 1,500,000 4 6,091 4,594 1,496 5,000,000 Producer-Consumer 1,500,000 6 6,284 4,701 1,583 5,000,000 Producer-Consumer 1,500,000 8 6,276 4,701 1,575 5,000,000 Producer-Consumer 1,500,000 10 6,279 4,701 1,578 5,000,000 Sayl 1,500,000 4 7,453 4,938 2,515 5,000,000 Sayl 1,500,000 6 7,261 5,102 2,160 5,000,000 Sayl 1,500,000 8 7,192 5,102 2,090 5,000,000 Sayl 1,500,000 10 7,207 5,102 2,105 8,000,000 Master-Worker 2,500,000 4 10,640 8,067 2,573 8,000,000 Master-Worker 2,500,000 6 10,782 8,621 2,162 8,000,000 Master-Worker 2,500,000 8 10,707 8,099 2,609 8,000,000 Master-Worker 2,500,000 10 10,455 8,099 2,356 8,000,000 Producer-Consumer 1,500,000 4 10,050 8,193 1,857 8,000,000 Producer-Consumer 1,500,000 6 9,850 7,069 2,781 8,000,000 Producer-Consumer 1,500,000 8 9,905 7,410 2,494 8,000,000 Producer-Consumer 1,500,000 10 10,172 7,410 2,761 8,000,000 Sayl 1,500,000 4 12,343 8,023 4,320 8,000,000 Sayl 1,500,000 6 11,599 7,509 4,090 8,000,000 Sayl 1,500,000 8 11,624 7,509 4,115 8,000,000 Sayl 1,500,000 10 12,179 7,509 4,671 11,000,000 Master-Worker 2,500,000 4 16,056 12,074 3,982 11,000,000 Master-Worker 2,500,000 6 15,441 11,081 4,360 11,000,000 Master-Worker 2,500,000 8 15,622 11,148 4,474 11,000,000 Master-Worker 2,500,000 10 15,547 10,564 4,984 11,000,000 Producer-Consumer 1,500,000 4 14,093 11,439 2,654 11,000,000 Producer-Consumer 1,500,000 6 14,199 11,107 3,093 11,000,000 Producer-Consumer 1,500,000 8 14,432 10,128 4,304 11,000,000 Producer-Consumer 1,500,000 10 14,567 9,754 4,813 11,000,000 Sayl 1,500,000 4 17,731 11,043 6,689 11,000,000 Sayl 1,500,000 6 17,081 11,488 5,592 11,000,000 Sayl 1,500,000 8 16,575 10,643 5,932 11,000,000 Sayl 1,500,000 10 16,725 10,643 6,082 14,000,000 Master-Worker 2,500,000 4 20,226 16,637 3,589 14,000,000 Master-Worker 2,500,000 6 19,715 14,281 5,434 14,000,000 Master-Worker 2,500,000 8 19,541 14,281 5,259

152 14,000,000 Master-Worker 2,500,000 10 19,578 14,281 5,296 14,000,000 Producer-Consumer 1,500,000 4 18,842 18,360 482 14,000,000 Producer-Consumer 1,500,000 6 19,316 16,858 2,458 14,000,000 Producer-Consumer 1,500,000 8 19,526 17,204 2,322 14,000,000 Producer-Consumer 1,500,000 10 19,607 15,376 4,231 14,000,000 Sayl 1,500,000 4 23,007 17,787 5,220 14,000,000 Sayl 1,500,000 6 22,022 17,079 4,944 14,000,000 Sayl 1,500,000 8 21,623 15,545 6,078 14,000,000 Sayl 1,500,000 10 21,189 14,811 6,379 17,000,000 Master-Worker 2,500,000 4 23,981 20,452 3,529 17,000,000 Master-Worker 2,500,000 6 23,695 20,929 2,766 17,000,000 Master-Worker 2,500,000 8 23,015 18,779 4,235 17,000,000 Master-Worker 2,500,000 10 23,356 18,779 4,577 17,000,000 Producer-Consumer 1,500,000 4 23,586 22,303 1,283 17,000,000 Producer-Consumer 1,500,000 6 24,067 20,489 3,579 17,000,000 Producer-Consumer 1,500,000 8 23,922 18,395 5,527 17,000,000 Producer-Consumer 1,500,000 10 23,824 19,879 3,945 17,000,000 Sayl 1,500,000 4 26,993 21,626 5,367 17,000,000 Sayl 1,500,000 6 26,041 20,973 5,068 17,000,000 Sayl 1,500,000 8 25,727 20,707 5,020 17,000,000 Sayl 1,500,000 10 25,677 20,517 5,160 20,000,000 Master-Worker 2,500,000 4 32,293 25,925 6,368 20,000,000 Master-Worker 2,500,000 6 31,808 26,436 5,373 20,000,000 Master-Worker 2,500,000 8 31,111 23,680 7,431 20,000,000 Master-Worker 2,500,000 10 30,783 23,680 7,103 20,000,000 Producer-Consumer 1,500,000 4 30,760 30,517 243 20,000,000 Producer-Consumer 1,500,000 6 31,989 28,064 3,925 20,000,000 Producer-Consumer 1,500,000 8 31,827 25,890 5,937 20,000,000 Producer-Consumer 1,500,000 10 31,675 26,249 5,426 20,000,000 Sayl 1,500,000 4 36,104 29,814 6,290 20,000,000 Sayl 1,500,000 6 34,104 27,757 6,347 20,000,000 Sayl 1,500,000 8 33,733 27,685 6,048 20,000,000 Sayl 1,500,000 10 33,488 26,463 7,024

Table 5.7: Results Comparison between Latency and Throughput for The Sorting Case

5.4 A Complementary Evaluation of The XML Processing Case

As we stated in section 3.7.1, we analyzed a second case study in this master thesis. This case study was originally developed in coordination with Mejia and Cordoba as part of their bachelor graduation project [9]. They evaluated the performance factor of throughput for multiple architec- tural configurations with the goal of improving the processing time of XML files in a multinational company in Colombia. In this thesis project, we completed the work exploring the behavior of the

153 35,000 y = 1.2428x + 284.44 R² = 0.975 30,000

25,000

20,000

15,000

10,000 ThroughputtoFile)OneProcess (ms

5,000

- 5,000 10,000 15,000 20,000 25,000 Latency (ms)

Figure 5.60: Linear Regression for the Master-Worker Results

Figure 5.61: Linear Regression for the Producer-Consumer Results

154 35,000 y = 1.2428x + 284.44 R² = 0.975 30,000

25,000

20,000

15,000

10,000 ThroughputtoFile)OneProcess (ms

5,000

- 5,000 10,000 15,000 20,000 25,000 Latency (ms)

Figure 5.62: Linear Regression for the Sayl Results latency performance factor under some of the most relevant architectural configurations identified in the bachelor graduation project.

5.4.1 Summary of the Throughput Analysis from ”Processing Large Volumes of Data Physically Stored in XML Files” In the coordination of the development of this bachelor graduation project, we faced multiple chal- lenges as the understanding of the company needs, the research of design patterns, the selection of suitable design patterns for this case study, the research of different XML parsers, program- ming languages, the design of different architectures to evaluate their performance, the coding of the design patterns, the deployment of the resulting system configurations, the execution of the experiments design defined, among others. As we stated before, this bachelor graduation project analyzes the throughput performance factor of the real-world application of large XML processing case. Table 5.8 summarizes the most relevant results for throughput for the experiments performed for this case study.

155 Monolithic Deployments Average Number Average Number Average Number Consumer Throughput Throughput Throughput Configuration of Tables for of Tables for of Tables for Components 1MB 5MB 10MB 1MB File 5MB File 10MB File Frascati-Producer/Consumer 1 66,880 1 file/53.54 s 30,246 1 file/187.52 s 71,051 1 file/449.1 s Frascati-Producer/Consumer 12 26,749 1 file / 20.81 s - - - - Frascati-Reactor 12 26,325 1 file / 40.9 s 32,432 1 file/128 s - - Ice-Java-Producer/Consumer 1 66,880 1 file/49.94 s 30,246 1 file/187.16 s 71,051 1 file/450.9 s Ice-Java-Producer/Consumer 12 26,752 1 file / 11.72 s - - - - Ice-Java-Reactor 12 26,752 1 file / 20.57 s 30,246 1 file/79.87 s 399,058 1 file / 195.26 s Ice-C#-Producer/Consumer 12 26,752 1 file / 13.10 s - - 710,780 1 file / 114.31 s Ice-C#-Reactor 12 26,752 1 file / 24.14 s 30,246 1 file/81.01 s 71,051 1 file/191.11 s WCF-Producer/Consumer 1 66,880 1 file/49.82 s 30,246 1 file/180.36 s 71,051 1 file/449.5 s WCF-Producer/Consumer 12 26,752 1 file / 12.48 s - - 710,780 1 file / 113.97 s WCF-Reactor 12 26,752 1 file / 29.24 s 30,246 1 file/79.96 71,051 1 file/189 s Java-SpBaseDatos 1 26,752 1 file /32.3 s - - - - Java-SpBaseDatos 12 26,752 1 file/6.016 s 109,071 1 file/148.92 s - - Prueba Base 0 26,752 1 file/31.83 s - - - - 156 Distribuited Deployments Average Number Average Number Average Number Consumer Throughput Throughput Throughput Configuracion of Tables for of Tables for of Tables for Components 1MB 5MB 10MB 1MB File 5MB File 10MB File Frascati-Producer/Consumer 12 26,752 1 file / 11.35 s 29,644 1 file/51.08 s 710,780 1 file / 132.92 s Frascati-Producer/Consumer 48 26,752 1 file / 6.9 s - - - - Frascati-Producer/Consumer 96 26,752 1 file / 10.67 s 512,764 1 file/29.88 s - - Frascati-Reactor 12 26,752 1 file / 21.95 s 30,306 1 file/80.48 s 71,306 1 file/189.48 s Ice-Java-Producer/Consumer 12 26,752 1 file / 11.51 s - - - - Ice-Java-Producer/Consumer 96 26,752 1 file / 6.1 s 217,936 1 file/44.34 s - - Ice-Java-Reactor 12 26,752 1 file / 20.51 s 30,246 1 file/78.04 s 413,304 1 file / 186.23 s Ice-C#-Producer/Consumer 12 26,752 1 file / 12.77 s - - - - Ice-C#-Producer/Consumer 96 26,752 1 file / 6 s 512,764 1 file/29.89 s - - Ice-C#-Reactor 12 26,752 1 file / 20.45 s - - - - WCF-Producer/Consumer 12 26,752 1 file / 12.48 s - - 710,780 1 file / 113.97 s WCF-Producer/Consumer 96 26,752 1 file / 9.6 s 512,764 1 file/30.21 s - - WCF-Reactor 12 26,752 1 file / 20.48 s - - - -

Table 5.8: Summary of the principal Throughput results analyzed in the ”Processing Large Volumes of Data Physically Stored in XML Files” Document [9] 5.4.2 Latency Performance Analysis Table 5.9 summarizes the latency results of the experiments defined for the Large XML Processing case study in table 4.15 of chapter 4. These show a combination of three context-variables (i.e., number of available task processors, file size, and number of components per task processor) and two design patterns (i.e., Producer-Consumer and Reactor). Results for the Producer-Consumer show that, in general, a larger number of consumers (i.e., total of consumer components in the system) might be determinant to obtain a better latency performance. However, results also show that it will be always preferable to have a large number of task processors instead of a large number of components per task processors, even in those configurations where the total of number of components per task processor is larger than the number of task processors. Therefore, this becomes a relevant trade-off decision. Contrary to expectations, results for the Reactor experiments with 1 MB files, show a better latency performance in those configurations with less components per task processor. However, in the experiments of 5 MB results show better latency results with larger consumer components. In summary, the does not show significant improvements according to the number of processing components. Additionally, the Reactor latency results are always worse than the Producer-Consumer results. Finally, we analyzed some comparable results between the latency and throughput for the Producer-Consumer pattern in this case study. We found a correlation coefficient of 0.9696 which means that the latency is related positively with the throughput. This conclusion is similar to the obtained in the sorting case study. In this section, we do not explore the impact of others variables previously mentioned through this thesis document because they were not applicable for this case study such as the task granularity and the buffer size. Some other variables are restricted to the complementary experiments defined or there was not enough information to make valid conclusions. Additionally, given the time constraints for this thesis project, and considering the work developed in coordination with Mejia and Cordoba in [9], we decided not to do more in-depth analysis.

5.4.3 Batch Time Span Evaluation in the XML Processing Case Given the results obtained in section 5.1.6 (Batch Time Span in the Sorting Case), we decided to explore the impact of the Batch Time Span in a software system that has multiple processing components deployed in a same node. The experiment was performed using three different values for this variable: 2, 4, and 6 seconds of batch time span. For the experiment, we used a system configuration of eight available tasks processors (nodes) and six components per node; and a based Producer-Consumer software architecture. On average, the reference processing time for a 1MB XML file under the mentioned architecture was of 7,598 ms. We used 1 MB files for the test. Results depicted in table 5.10 show that the batch time span decrease the system performance in terms of latency in all evaluated cases.

5.5 Best Performance Combinations

This section presents a summary of the best performance results obtained in the executed exper- iments of this thesis project. Results are divided initially by case study and then by performance factor. The purpose of this section is also to show the architectural configuration that produced the best performance results.

157 Table 5.9: Average Latency Results for the Selected XML Processing Case

Components Total Latency Design Pattern Nodes FileSize by Node Components (ms) Producer-Consumer 12N 1M 12C 144 2,875 Producer-Consumer 8N 1M 6C 48 7,598 Producer-Consumer 8N 1M 12C 96 12,336 Producer-Consumer 12N 1M 6C 72 12,353 Producer-Consumer 4N 1M 12C 48 15,834 Producer-Consumer 4N 1M 6C 24 18,141 Producer-Consumer 12N 1M 1C 12 22,508 Producer-Consumer 8N 1M 1C 8 25,122 Producer-Consumer 12N 5M 12C 144 26,962 Reactor 12N 1M 1C 12 29,679 Reactor 12N 1M 6C 72 31,464 Reactor 12N 1M 12C 144 34,619 Producer-Consumer 8N 5M 6C 48 35,322 Producer-Consumer 12N 5M 6C 72 35,360 Producer-Consumer 8N 5M 12C 96 35,892 Producer-Consumer 4N 5M 12C 48 38,072 Producer-Consumer 12N 5M 1C 12 49,290 Producer-Consumer 8N 5M 1C 8 53,229 Reactor 12N 5M 12C 144 89,540 Reactor 12N 5M 6C 72 92,337

Table 5.10: Average Latency for the XML Processing Case with different Batch Time Span configurations

Batch Time Span Average Latency (ms) (ms) 0 7,598 2,000 58,127 4,000 280,673 6,000 165,891

158 5.5.1 Sorting Case We present the best performance configurations (i.e., the architectural configurations that provided the best performance metrics) sorted by file size, considering that the file size is the main input for the experiments design execution.

5.5.2 Best Latency Results Table 5.11 shows the latency results for the sorting case. The best latency makes reference to those architectural configurations that provides the minimum latency for a determined file size. In conclusion, the best results for the performed experiments are obtained with: UMA memory structure, Master-Worker and Producer-Consumer, fine and medium granularities, and the number of task processors is determined by the file size, the larger the file size the more task processors are required to get a better performance. Our hypothesis is that Master-Worker and Producer- Consumer provide better results than Sayl because in these, the workers or consumers, respectively, do not have to wait for the coordination of a workerpool to get a task from the respective queue.

5.5.3 Best Throughput Results Table 5.12 shows the throughput results for the sorting case. The best throughput makes reference to those architectural configurations that provides the minimum throughput, expressed in millisec- onds to process one file, for a determined file size. It is worth to remember that the throughput experiments for the sorting case was conducted under a more restricted set of context-variables val- ues in comparison with the latency experiments. In conclusion, the best results for the performed experiments are obtained with: UMA memory structure, Master-Worker and Producer-Consumer, medium granularities, and the number of task processors is determined by the file size, the larger the file size the more task processors are required to get a better performance. As expected accord- ing to the latency results and the correlation between the latency and throughput. The throughput results confirm that the Master-Worker and Producer-Consumer provides better results than Sayl.

5.5.4 XML Processing Case Table 5.13 shows the best latency performance configuration ordered by file size and design pattern, for the XML Processing case study. This table summarizes the best combinations introduced in the previous table 5.9. We decided to show the best results by design pattern to evidence how much difference can be observed in the latency behavior of a software system of the real-world by the change of a design pattern. The Producer-Consumer provides a 90% better latency than Reactor. Reactor shows a poor performance given the specialization and segmentation of the processors for stored procedures, this causes that in specific moments, this design pattern does not take advantage of the set of available processors for stored procedures. Conversely, the Producer-Consumer does not present this behavior given that a consumer is able to process any stored procedure involved in the XML file [9].

5.6 Chapter Summary

In this chapter we analyzed the system response in terms of its latency and throughput performance factors, based on the data gathered from the experiments execution in the two specific case studies

159 Table 5.11: Best Average Latency Combinations for the Evaluated File Sizes for the Sorting case at 1 Gbps

Memory Average File-Size Design Pattern BatchSize Nodes Architecture Latency (ms) 1,000,000 UMA Master-Worker 500,000 4 1,217 1,500,000 UMA Master-Worker 500,000 4 1,519 2,000,000 UMA Master-Worker 500,000 4 1,779 2,500,000 UMA Master-Worker 500,000 8 2,107 3,000,000 UMA Master-Worker 500,000 8 2,594 3,500,000 UMA Master-Worker 500,000 8 3,006 4,000,000 UMA Master-Worker 500,000 8 3,475 4,500,000 UMA Master-Worker 500,000 8 4,017 5,000,000 UMA Master-Worker 500,000 8 4,425 5,500,000 UMA Producer-Consumer 1,500,000 6 5,062 6,000,000 UMA Producer-Consumer 1,500,000 4 5,423 6,500,000 UMA Producer-Consumer 1,500,000 8 5,768 7,000,000 UMA Producer-Consumer 1,500,000 8 6,304 7,500,000 UMA Producer-Consumer 1,500,000 6 6,783 8,000,000 UMA Producer-Consumer 1,500,000 6 7,069 8,500,000 UMA Producer-Consumer 1,500,000 6 7,760 9,000,000 UMA Producer-Consumer 1,500,000 6 8,076 9,500,000 UMA Producer-Consumer 1,500,000 10 8,444 10,000,000 UMA Producer-Consumer 1,500,000 10 8,663 11,000,000 UMA Producer-Consumer 1,500,000 10 9,754 12,000,000 UMA Producer-Consumer 1,500,000 10 10,413 13,000,000 UMA Master-Worker 2,500,000 10 12,255 14,000,000 UMA Master-Worker 2,500,000 6 14,281 15,000,000 UMA Producer-Consumer 1,500,000 10 15,275 16,000,000 UMA Producer-Consumer 1,500,000 8 17,502 17,000,000 UMA Producer-Consumer 1,500,000 8 18,395 18,000,000 UMA Master-Worker 2,500,000 8 20,603 19,000,000 UMA Master-Worker 3,500,000 6 21,811 20,000,000 UMA Master-Worker 2,500,000 8 23,680 22,000,000 UMA Master-Worker 2,500,000 10 25,654 24,000,000 UMA Master-Worker 2,500,000 10 28,572 23,000,000 UMA Sayl 1,500,000 10 30,695 26,000,000 UMA Master-Worker 2,500,000 10 32,256 25,000,000 UMA Sayl 1,500,000 10 32,863 30,000,000 UMA Producer-Consumer 2,500,000 10 37,450 35,000,000 UMA Sayl 1,500,000 10 53,749 40,000,000 UMA Sayl 2,500,000 10 63,086

160 Table 5.12: Best Average Throughput Combinations for the Evaluated File Sizes for the Sorting case at 1 Gbps

Average Throughput Procesed File Size Design Pattern BatchSize Nodes (ms to Process Files One File) 5,000,000 Master-Worker 2,500,000 4 216 5,618 8,000,000 Producer-Consumer 1,500,000 10 110 9,683 11,000,000 Producer-Consumer 1,500,000 4 81 13,870 14,000,000 Master-Worker 2,500,000 8 60 18,288 17,000,000 Master-Worker 2,500,000 8 60 22,183 20,000,000 Master-Worker 2,500,000 10 43 30,129

Table 5.13: Best Average Latency Combinations for the Evaluated File Sizes for the XML Processing case at 1 Gbps

Components Average File-Length Design Pattern Nodes by Node Latency (ms) 1M Producer-Consumer 12N 12C 2,875 5M Producer-Consumer 12N 12C 26,962 1M Reactor 12N 1C 29,679 5M Reactor 12N 12C 89,540 defined for this thesis project: (i) the sorting case and (ii) the large XML processing case. We presented the analysis by performance factor and case study, and for each of these combinations, we analyzed the impact in the performance given the variation of the involved context-variables. We presented a comparative analysis between the latency and throughput, and finally we introduced to the best performance combinations (i.e., the best system configurations) under which is possible to obtain the best performance measures for the evaluated case studies.

161 Chapter 6

Summary and Conclusions

From a domain-specific design patterns perspective, in this thesis we have explored the impact that context-variables have on software systems in two particular performance factors: latency and throughput. Even though some domain-specific design patterns have proven that they can influence the achievement of software quality attributes, this relation is not explored quantitatively in the literature. Therefore, there is not enough information to predict the impact of a particular design pattern in the satisfaction of extra-functional requirements in a software system, including performance, which was targeted in this thesis. In order to obtain an initial understanding of the quantitative relationship between the ap- plication of design patterns and system performance, as well as the use of design patterns as architectural solutions to dynamically satisfy performance on quantitative-basis, we established a set of specific objectives whose fulfillment would allow us to understand from a quantitative point-of-view the behavior of the aforementioned relationship. The first of our objectives was to select a subset of suitable domain-specific design patterns from those proposed in the literature for performance improvement. To achieve this, we conducted a systematic literature review (SLR) of domain-specific design patterns that presumably target the performance of systems. Based on this SLR, we elaborated a technical report of performance-domain design patterns, which is useful as a catalog for practitioners and researchers. The difficulty of producing this catalog is based on the fact that we had to understand and consolidate each of these design patterns in order to put them in a standardized design pattern template. The second objective was to select a subset of significant context-variables that directly affect system performance. Based on the SLR information, we extracted a subset of significant context-variables that might impact the system performance. We were faced again to understand and consolidate each of the context-variables presented in the SLR, since authors tend to name a same variable (i.e., based on its definition) in multiple ways. Concerning the third objective, to define a set of relevant case studies to evaluate the impact of context-variables and selected design patterns under different system configurations, in this thesis we addressed two different case studies, the sorting case and the large XML processing case. Our purpose was to evaluate a well-known theoretical case study (the sorting case) and observe how the involved variables impact the system’s performance. Then, we evaluated a real-world business problem (the large XML processing case) of a multinational company to observe its behavior and determine if we could extrapolate what was found with the theoretical case. However, given the difficulty to analyze the large XML processing case, we had to study this in coordination with a separate but related bachelor graduation project. Taking into account the case studies, we per- formed an evaluation to select the domain-specific design patterns that are applicable to them,

162 selecting the Master-Worker, Producer-Consumer, and Sayl for the sorting case, and the Producer- Consumer, and Reactor for the large XML Processing case. Fourth, to design and implement the base components required for realizing the selected design patterns in the given software systems. To achieve this objective, we develop in Java and FraSCAti middleware the base components required for realizing the design patterns. However, we faced multiple problems given the programming constraints of the middleware, as well as difficulties to code in SCA an OOP design pattern. The developed SCA components in FraSCAti can be used in future work when a complete MAPE- K system is developed. Fifth, to establish appropriate values for the selected context-variables to experiment with: once we had defined the case studies and design patterns, we had to select the context-variables that would be subject of study in this thesis. Making assumptions of those context-variables that would probably cause more impact in the system’s performance. We selected the following context-variables for study: (i) Number of Distributed Task Processors, (ii) Network Bandwidth, (iii) Buffer Size of the Queue, (iv) Memory Structure, (v) Task-Granularity, (vi) Size of the File, (vii) Communication-Time, (viii) Batch Time Span, (ix) Number of Components per Task Processor, and (x) RAM Memory Usage. However, in order to establish appropriate values for these context-variables we had to perform multiple isolated experiments that gave us an idea of the behavior and to confirm the viability of exploring each of these variables. The isolation of experiments was carried out controlling the experiments environment and varying some defined context-variables values according to expert judgment. Sixth, to measure the impact of context- variables variation and domain-specific design patterns on the performance factors of the system. For this, we conceived a design of experiments strategy that allowed us to deliberately make changes in the input variables, and then observe how the observed outcome (i.e., response) varies accord- ingly. The design of experiments defines the test environments of interest for this thesis project in order to obtain a measure of the impact that a context-variable variation and a design pattern have on a system’s performance. This was certainly the most difficult objective to fulfill given the time required to perform and record the complete set of experiments. Seventh, to determine the best design patterns and system configuration combinations that produce the best performance response. Once we performed the complete set of experiments for both case studies, we were able to perform individual context-variables analysis and their impact in the system performance from a domain-specific design pattern perspective. In a second phase, we performed an analysis of multiple context-variables together. All of these allowed us to determine the best performance configurations. These analyses constitute an effort to provide initial but significant experimental data to SAS designers that allows the community to determine how to fulfill performance goals under changing context conditions at execution time. The analyses performed in this thesis confirm that design patterns are a powerful design tool to improve the performance of a software system. A bad selection of a design pattern or a poorly defined architecture can decrease significantly the performance of a system. We also confirm that there are context-variables that have a greater or lesser impact in the system’s performance. The selection of the values of a context variable, it is also a critical design decision since the correct selection of these values may have a considerable impact in the performance. From our analyses, we identified that usually the influence or the impact of a context-variable in the system’s performance is maintained through the combination of several context variables. Although there are multiple context variables applicable to software systems, we measured and defined the most relevant ones (i.e., the variables that have a significant impact in the performance) implied by the objectives of this thesis. These variables are: (i) Number of Distributed Task Processors, (ii) Network Bandwidth, (iii) Memory Structure, (iv) Task-Granularity, (v) Size of the

163 File, (vi) Communication-Time, and (vii) Number of Components per Task Processor. Each of these variables were analyzed in depth in chapter 5. Our main conclusions on these variables are:

• For our experiments, we found that the number of task processors is crucial in distributed architectures, a larger number of task processors usually means a better performance; this is true when there are no idle task processors. However, it is of our knowledge, that even if there are not idle processors, a bad assignment or an overload of tasks in the task processors might decrease the performance.

• For the network bandwidth, we confirm that the greater the network bandwidth the better the performance expected.

• The UMA memory structure presents the best results under all evaluated cases, since the data is not transmitted between software components of the application.

• The task granularity has a significant impact in the performance and the definition of a granularity size is determining for the measure of that impact. According to our results, it is recommendable to use fine or medium granularities which usually allow to take better advantage of the available task processors (nodes). However, the task-granularity is directly linked to the size of the file that is going to be processed. That is, for example, a coarse- granularity for a determined size of file can become a fine-granularity for a file of greater size. The size of the file is the direct input to the case studies evaluated in this thesis.

• The communication-time is a context-variable linked principally to the network bandwidth, our preliminary analysis allow us to determine that this can represent at least 70% of the the total time that takes to perform an algorithm.

• Finally, for the number of components per task processor, we concluded that there is no rule for this variable to determine the impact on performance, it will depend totally from the design pattern selection.

For the selected domain-specific design patterns involved in this thesis, we determined that Master-Worker and Producer-Consumer present the best latency and throughput results for the sorting case. For the large XML processing case the Producer-Consumer presents the best latency and throughput results. The foregoing, considering the presented conclusions for the context- variables. The conclusions and information gathered from the performed experiment designs and the SLR elaborated for this thesis project constitutes a valuable source of information for future works on SAS systems, especially in those that use the MAPE-K model to address its reconfiguration where performance is one of the objectives to improve or maintain.

164 Bibliography

[1] G. D. Abowd, A. K. Dey, P. J. Brown, N. Davies, M. Smith, and P. Steggles. Towards a better understanding of context and context-awareness. In International Symposium on Handheld and Ubiquitous Computing, pages 304–307. Springer, 1999.

[2] M. Ali and M. O. Elish. A comparative literature survey of design patterns impact on software quality. In 2013 International Conference on Information Science and Applications (ICISA), pages 1–7. IEEE, 2013.

[3] J. Antony. Design of experiments for engineers and scientists. Elsevier, 2014.

[4] M. Barbacci, M. H. Klein, T. A. Longstaff, and C. B. Weinstock. Quality attributes. Technical Report CMU/SEI-95-TR-021, CMU/SEI, 1995.

[5] B. Barney et al. Introduction to parallel computing.

[6] M. Beisiegel, H. Blohm, D. Booz, J.-J. Dubray, A. C. Interface21, M. Edwards, D. Ferguson, J. Mischkinsky, M. Nally, and G. Pavlik. Service component architecture. Building systems using a Service Oriented Architecture. BEA, IBM, Interface21, IONA, Oracle, SAP, Siebel, Sybase, white paper, version, 9, 2007.

[7] F. Buschmann, K. Henney, and D. Schimdt. Pattern-oriented Software Architecture: On Patterns and Pattern Language, volume 5. John Wiley & Sons, 2007.

[8] F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad, M. Stal, P. Sommerlad, and M. Stal. Pattern-oriented software architecture, volume 1: A system of patterns, 1996.

[9] F. C´ordobaand R. Mej´ıa. Procesamiento de Grandes Vol´umenesde Datos Almacenados Fisi- camente en Archivos XML. Icesi University, 2016. Software Systems Engineering Graduation Project.

[10] R. de Lemos, D. Garlan, C. Ghezzi, H. Giese, J. Andersson, M. Litoiu, B. Schmerl, D. Weyns, L. Baresi, N. Bencomo, Y. Brun, J. Camara, R. Calinescu, M. B. Cohen, A. Gorla, V. Grassi, L. Grunske, P. Inverardi, J.-M. Jezequel, S. Malek, R. Mirandola, M. Mori, H. A. M¨uller, R. Rouvoy, C. M. F. Rubira, E. Rutten, M. Shaw, G. Tamburrelli, G. Tamura, N. M. Villegas, T. Vogel, and F. Zambonelli. Software engineering for self-adaptive systems: Research chal- lenges in the provision of assurances. In R. de Lemos, D. Garlan, C. Ghezzi, and H. Giese, editors, Software Engineering for Self-Adaptive Systems III, volume 9640 of Lecture Notes in Computer Science (LNCS). Springer, 2017. In Press.

165 [11] R. De Lemos, H. Giese, H. A. M¨uller, M. Shaw, J. Andersson, M. Litoiu, B. Schmerl, G. Tamura, N. M. Villegas, T. Vogel, et al. Software engineering for self-adaptive systems: A second research roadmap. In Software Engineering for Self-Adaptive Systems II, pages 1–32. Springer, 2013.

[12] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design patterns: elements of reusable object-oriented software. Pearson Education, 1994.

[13] A. Grama. Introduction to Parallel Computing. Pearson Education. Addison-Wesley, 2003.

[14] http://frascati.ow2.org/. Annotation type oneway.

[15] IBM Corporation. An Architectural Blueprint for Autonomic Computing. June 2006.

[16] M. Jimenez. A framework for generating and deploying dynamic performance monitors for self-adaptive software systems. Master’s thesis, Icesi University, jul 2016.

[17] S. Keele. Guidelines for performing systematic literature reviews in software engineering. In Technical report, Ver. 2.3 EBSE Technical Report. EBSE. 2007.

[18] J. Kephart and D. Chess. The vision of autonomic computing. Computer, 36(1):41–50, 2003.

[19] F. Khomh and Y.-G. Gueheneuce. Do design patterns impact software quality positively? In Software Maintenance and Reengineering, 2008. CSMR 2008. 12th European Conference on, pages 274–278. IEEE, 2008.

[20] B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey, and S. Linkman. Systematic literature reviews in software engineering–a systematic literature review. Information and software technology, 51(1):7–15, 2009.

[21] J. Kramer and J. Magee. Self-managed systems: an architectural challenge. In Future of Software Engineering, 2007. FOSE’07, pages 259–268. IEEE, 2007.

[22] C. Krupitzer, F. M. Roth, S. VanSyckel, G. Schiele, and C. Becker. A survey on engineering approaches for self-adaptive systems. Pervasive and Mobile Computing, 2014.

[23] A. Kumar, A. Dutt, and G. Saini. Merge sort algorithm. International Journal of Research, 1(11):16–21, 2014.

[24] M. Litoiu, M. Shaw, G. Tamura, N. M. Villegas, H. A. M¨uller,H. Giese, R. Rouvoy, and E. Rutten. What Can Control Theory Teach Us About Assurances in Self-Adaptive Software Systems? In R. de Lemos, D. Garlan, C. Ghezzi, and H. Giese, editors, Software Engineering for Self-Adaptive Systems III, volume 9640 of Lecture Notes in Computer Science (LNCS). Springer, 2017. In Press.

[25] J. Marino and M. Rowley. Understanding sca (service component architecture). Pearson Education, 2009.

[26] R. Mazo, J. C. Mu˜noz-Fern´andez, L. Rinc´on,C. Salinesi, and G. Tamura. VariaMos: an Extensible Tool for Engineering (dynamic) Product Lines. In Procs. of 19th Intl. Conf. Software Product Lines (SPLC), pages 374–379. ACM, 2015.

166 [27] J. C. Mu˜noz,G. Tamura, N. M. Villegas, and H. A. M¨uller.Surprise: User-controlled Granular Privacy and Security for Personal Data in SmarterContext. In Procs. of 2012 Conf. of the Center for Advanced Studies on Collaborative Research, CASCON ’12, pages 131–145. IBM Corp., 2012.

[28] J. C. Mu˜noz-Fern´andez,G. Tamura, I. Raicu, R. Mazo, and C. Salinesi. REFAS: a PLE approach for simulation of self-adaptive systems requirements. In Procs. of 19th Intl. Conf. Software Product Lines (SPLC), pages 121–125. ACM, 2015.

[29] P. Oreizy, M. M. Gorlick, R. N. Taylor, D. Heimbigner, G. Johnson, N. Medvidovic, A. Quilici, D. S. Rosenblum, and A. L. Wolf. An architecture-based approach to self-adaptive software. IEEE Intelligent systems, 14(3):54–62, 1999.

[30] F. Paterna, A. Acquaviva, A. Caprara, F. Papariello, G. Desoli, and L. Benini. Variability- aware task allocation for energy-efficient quality of service provisioning in embedded streaming multimedia applications. IEEE Transactions on Computers, 61(7):939–953, 2012.

[31] J. Rudzki. How design patterns affect application performance–a case of a multi-tier j2ee appli- cation. In International Workshop on Scientific Engineering of Distributed Java Applications, pages 12–23. Springer, 2004.

[32] M. Salehie and L. Tahvildari. Self-adaptive software: Landscape and research challenges. ACM Transactions on Autonomous and Adaptive Systems (TAAS), 4(2):14, 2009.

[33] L. Seinturier, P. Merle, R. Rouvoy, D. Romero, V. Schiavoni, and J.-B. Stefani. A component- based middleware platform for reconfigurable service-oriented architectures. Software: Practice and Experience, 42(5):559–583, 2012.

[34] A. Systems. Behind the java warm-up problem.

[35] C. Szyperski. Component software: beyond object-oriented programming. Pearson Education, 2002.

[36] G. Tamura. QoS-CARE: A Reliable System for Preserving QoS Contracts through Dynamic Reconfiguration. PhD thesis, Universit´edes Sciences et Technologies de Lille-Lille I (France); Universidad de Los Andes(Colombia), 2012.

[37] G. Tamura, R. Casallas, A. Cleve, and L. Duchien. QoS Contract Preservation through Dy- namic Reconfiguration: A Formal Semantics Approach. Science of Computer Programming (SCP), 94(3):307–332, 2014.

[38] G. Tamura, M. Jimenez, K. Lara, and J. Ropero. Domain-specific design patterns for perfor- mance in software systems: A systematic literature review. Technical report, Icesi University, 2014.

[39] N. Villegas, G. Tamura, and H. M¨uller. Architecting software systems for runtime self- adaptation: Concepts, models, and challenges. In I. Mistrik, N. Ali, R. Kazman, J. Grundy, and B. Schmerl, editors, Managing Trade-Offs in Adaptable Software Architectures, pages 17 – 43. Morgan Kaufmann, Boston, 2017.

167 [40] N. M. Villegas. Context Management and Self-Adaptivity for Situation-Aware Smart Software Systems. PhD thesis, University of Victoria, 2013.

[41] N. M. Villegas, H. A. M¨uller,and G. Tamura. On Designing Self-Adaptive Software Systems. Sistemas & Telem´atica, 9(18):29–51, 2011.

[42] N. M. Villegas, H. A. M¨uller,G. Tamura, L. Duchien, and R. Casallas. A framework for evaluating quality-driven self-adaptive software systems. In Proceedings of the 6th international symposium on Software engineering for adaptive and self-managing systems, pages 80–89. ACM, 2011.

[43] N. M. Villegas, G. Tamura, H. A. M¨uller,L. Duchien, and R. Casallas. DYNAMICO: A reference model for governing control objectives and context relevance in self-adaptive software systems. In Software Engineering for Self-Adaptive Systems II, pages 265–293. Springer, 2013.

[44] M. Woodside, G. Franks, and D. C. Petriu. The future of software performance engineering. In Future of Software Engineering, 2007. FOSE’07, pages 171–187. IEEE, 2007.

[45] L. Wu and R. Buyya. Service level agreement (sla) in utility computing systems. IGI Global, 2012.

[46] L. Wu, S. K. Garg, and R. Buyya. Service level agreement (sla) based saas cloud manage- ment system. In Parallel and Distributed Systems (ICPADS), 2015 IEEE 21st International Conference on, pages 440–447. IEEE, 2015.

[47] I. D. Zone. Granularity and parallel performance, 2012.

168