<<

Eindhoven University of Technology

MASTER

Multi-core datapath contention modelling

Tang, X.

Award date: 2017

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

Multi-core Datapath Contention Modelling

Master thesis project (5T746) of Xinjue Tang ([email protected]) Student ID. 0898843

Host Company: Océ -Technologies B.V. Research & Development Department St. Urbanusweg 43 5914 CA Venlo The Netherlands Supervisor Océ : dr. Lou Somers ([email protected]) Supervisors TNO: dr. Tjerk Bijlsma ([email protected]) dr. Martijn Hendriks ([email protected]) Supervisor TU/e: prof. Twan Basten ([email protected]) GLOSSARY Platform The operating system and computer hardware. CPU An abbreviation of . It may contain one or more cores, and includes interconnection and caches. Core The basic computation unit of a CPU. Multiprocessor The use of two or more CPUs within a single computer system. A multi- processing system includes multiple complete processing units. Multithreading A software concept. The ability of a CPU or a single core to execute multiple processes or threads concurrently, supported by the operating system. Hardware threads Refers to hyper-threading, which is Intel’s proprietary simultaneous multithreading. The ability of a CPU or a single core to execute multiple processes or threads at the same time, supported by the hardware. A core supporting 2 hardware threads means 2 threads can be executed simultaneously. Retired Instructions Refers to instructions that are actually executed and completed by a CPU. Modern processors execute much more instructions that the program flow needs (but results are “stored” only for retired instructions). This is called “”. The instructions that are “proven” as indeed needed by flow are retired instructions.

1

ABBREVIATIONS DES Discrete Event Simulation DVFS Dynamic Voltage and Frequency Scaling HRT Hard Real Time WCRT Worst-Case Response Time SRT Soft Real Time NRT Not Real Time AuE Application under Evaluation LLC Last Level SO Static Ordering LJF Longest Job First NGMP Next-Generation Multi-core VTune Intel® VTuneTM Amplifier 2013 CPI NUMA Non- ALU

2

TABLE OF CONTENTS Glossary ...... 1 Abbreviations ...... 2 Table of Contents ...... 3 I. Abstract ...... 4 II. Introduction ...... 4 III. System Description ...... 6 IV. Performance Prediction: the State of The Art ...... 8 V. Modeling Approaches: Analytical Model vs. DES Model ...... 10 VI. Development and Requirements ...... 12 VII. Test-Sets and Platforms for F-path ...... 13 VIII. Introduction on the DES Model Components ...... 14 IX. The F-path Model without a Penalty Model ...... 16 X. F-path DES Model with the Static Penalty Model ...... 21 A. Predict the processing time of bitmaps with increasing number of cores ...... 23 B. Predict the processing time of different bitmaps ...... 24 C. Predict the performance on different platforms ...... 24 XI. F-path DES Model with the Dynamic Penalty Model ...... 25 A. Hypothesis and Validation ...... 26 1) No Hyper-threaded mode: ...... 26 2) Hyper-threaded mode: ...... 29 B. Model Realization ...... 31 1) The dynamic penalty model developed for capability 1 ...... 33 2) The dynamic penalty model developed for capability 2 ...... 40 3) The dynamic penalty model developed for capability 3 ...... 44 XII. Comparisons to the Analytical Model ...... 47 XIII. Combined Datapath DES Model with the Dynamic Penalty Model ...... 50 XIV. Conclusion ...... 52 XV. References ...... 54

3

I. ABSTRACT One printer Océ currently developed envelops two platforms, each of them executes one of the pipelines used to generate the firing patterns for the printer. When Océ wants to save its cost on manufacturing this printer, it is general to consider implementing both pipelines in the same platform. At the same time, it is not willing to sacrifice the throughput. Hence, inspecting the possibility of implementing the pipelines in one platform without losing the throughput constraints is the researching topic Océ proposes. Modeling is one way to inspect the performance of systems. From the perspective of Océ , this model should be able to estimate the processing time of the printer on different inputs. Currently, two models have been accomplished to estimate the speed of each pipeline running on its own platform. One model is developed by discrete event simulation, called DES model, while the other one is by regression analysis, called analytical model. Although these two models have been validated by the measured processing time on available test-sets, the report claims that in some extreme cases, predictions made by the analytical model are less accurate. The reports also declares that the predicted processing time of the proposed implementation is not that convincible by just adding the predicted processing time of each model. Thus, a new model using discrete event simulation is built here as an alternative to this analytical model, which helps to address the above two problems raised. Besides, parts of this new DES model, such as resources’ models, are reusable, and are shared with another existing DES model when they are combined together. Another advantage of a DES model is its ability to easily inspect the influence of different scheduling algorithms, which helps to find new algorithms that can increase the throughput of the printer. In summary, the DES model proposed in this project has more capabilities than the analytical model.

II. INTRODUCTION In the printing industry, throughput is one of the major criteria to indicate the quality of a printer, and of its datapath, which is responsible for the image processing. Amdahl’s law [13] claims that the speedup of parallelizing software relates to the proportion of the parallel part of a software component, which further enhances the throughput. Hence, an increase in the number of cores is a trend in manufacturing to meet hard constraints on throughput. While manufacturers try to apply more cores in their products, they realize that it is often not the case that an expected gain in performance can be achieved. It is concluded in [1] that designs made before implementation have great influence on the total development cost of a product. This leads to the need for performance estimation in the design phase. Some critical decisions, such as the type of processor and operating system, are made to end up with a model for general performance prediction.

In this project, the target for performance prediction is a datapath system that contains two pipelines, shown as two white rectangles in Figure 1. The first pipeline, called P-path, converts input files into bitmaps; the following one, called F-path, translates the bitmaps into firing patterns for the inkjet print heads. These two paths are realized into two different platforms in the current implementation. In between, there is a hard drive isolating the two pipelines. It stores the bitmaps converted from P-path, and F-path can then send requests asking for the bitmaps. The printer starts to operate the inkjet print heads after it receives firing patterns from F-path. The idea proposed is to make these two paths share the same platform to save the cost. And the throughput constraints should be met at the same time. Hence, before implementation, as mentioned above, an early design model should be made to inspect the possibility. Currently, a DES model of P-path and an analytical model of F-path have been

4 accomplished. Besides, there is also a combined model by appending the analytical model of F-path to the DES model of P-path.

Figure 1. Current Datapath System: application with its hardware deployment

A question is raised that whether current models are accurate enough to predict the throughput of the datapath system. Actually, these two models have proved their ability to predict the throughput accurately in previous work. However, the accuracy of the combined model cannot be confirmed because there is no accomplished implementation of the combined system that can be used for model validation. Chapter Ⅴ in this report analyzes two approaches – regression analysis and DES used in modeling; it shows the necessity to build a DES model of F-path to achieve a more convincible combined datapath model than the current one.

Figure 2. Performance Model in an iterative V-model development process

This raises another question, namely how to build such a DES model. Figure 2 taken from [2] indicates a development process to build a predictive performance model. The prerequisite is a validated model of an existing system. From this built model, a new model can be developed according to the analysis on the performance-related requirements shown as ○1 in Figure 2. Chapter Ⅵ describes more details on this performance model and imposes the requirements for developing the F-path DES model gradually. In [14], a framework is proposed to develop a DES model, and the model of the computational resource – multi- core CPU has been provided in this framework. However, the abstract piecewise linear algorithm used to solve the contention in this multi-core CPU may not be the one used in the datapath system. And not only

5 the contention, there are some other causes that affect the utilization of the processor, for example, hyperthreading and DVFS techniques. Besides, there are probably some shared resources not yet modeled, that the contention on them can influence the performance greatly. Therefore, the goal here is to 1) modify the abstract multi-core CPU model in this framework; 2) investigate the shared resources where contention on them could become bottlenecks to the performance, and realize these resources into the models. To make the DES model sensitive to various environments, a model called penalty model is built to scale the utilization of all modeled computational resources. In this report, two design approaches are applied in the design phase shown in the V-model to build such a penalty model. One is the top-down approach, and the other one is the bottom-up approach. This leads to two different DES models created in the whole project for the F-path system. Chapter Ⅹ and XI show how to apply these two design approaches, and validations are performed to the resulted DES models. As will discussed in Chapter Ⅴ, the DES model may not be more accurate than the analytical model, so Chapter XII compares the predictive errors between the DES model built in this report and the analytical model of the F-path system. The penalty model derived from the top-down approach, however, fails to capture some important characteristics of the system, and is not able to scale resources’ utilization reasonably when combining the DES model of each pipeline. Hence, in Chapter XIII, the model derived from the bottom-up approach is applied when combining the models. Chapter XIII introduces an upper bound by experience when scaling down the resources’ utilization to investigate the influence of possible contention between P-path and F-path. The processing time predicted by this combined model indicates the possibility of implementation in the same platform. Some interesting questions are proposed for future research in Chapter XIV, and Chapter XV makes a conclusion of this report.

III. SYSTEM DESCRIPTION As said in the introduction, the datapath consists of one P-path and one F-path pipeline. The task graph of P-path is shown in Figure 3. At the beginning, a scheduler assigns each page of a PDF input to a client sequence, named DP client in this report. A DP client converts the page input into a bitmap, and currently, three DP clients can exist at the same time. Each task in the DP client shown in Figure 3 runs in a separate . Hence, tasks in the DP clients can be pipelined to increase the throughput. At the end of each DP client, a translated bitmap of one page is stored on a drive for F-path to retrieve.

Figure 3. High-level model of the current P-path application

6

Figure 4 shows the image processing steps of the F-path application. Seen in Figure 4, there is a task named ‘Schedule and Dispatch’ collecting data from the drive. The inputs it collects are the bitmaps processed by the P-path system represented in bands. Then it sends these bands to different handlers for processing. These parallelized threads perform the same behavior to their respective bands, and forward the processed data to an output generator – assembler. F-path processes one page band by band until all bands of the current page are done; then bands of the next page can be processed. In the F-path application, the number of threads that can be generated is based on the available cores and hardware threads in the platform. The number of threads cannot exceed the number of the processing units (number of cores x hardware threads/core). Bands have the same number of pixels, but because of the compression operation in the P-path application, the number of bytes of different bands differs. This results in varying processing time for each band. Because of this characteristic, to increase the CPU utilization, when the two pipelines are mapped into the same platform, the cores free from processing F- path are used for P-path processing.

Figure 4. High-level model of the current F-path application

Figure 5 gives an example of how this system works. Assume that the platform has four single threaded cores, and yellow bars represent tasks from P-path, green tasks from F-path. From Figure 5, it can be seen that the system starts from P-path, and F-path processes the bands of the bitmap generated from P-path. If any of these two yellow bars has generated a bitmap for one page, F-path begins. So any task from P- path is pre-empted. When at least one of the F-path tasks finishes its job to process the issued bands of the same page, the interrupted P-path task processing a next page continues. P-path can also start a new thread to make use of a free core. P-path can generate at most four tasks from the beginning in this example. In general, F-path processes band by band from one bitmap, and as bands’ sizes are different on one page, so some idle cores can be spared for processing the P-path tasks. This mechanism increases the CPU utilization, while causing contention in the cores, storage devices, and interconnect. Hence, the problem here is to determine the contention between P-path and F-path. As the contention in the cores is solved by setting priority as described, the problem now turns to how to define and model the contention in the shared storage devices and interconnect. However, the influence of such contention is hard to be determined when there is no accomplished system. It might have minor or major effects on the final performance. By simply adding the two predictive results from the models of each pipeline, the current combined model has difficulties to capture the influence on the performance caused by the contention. This is because that the analytical model is derived from regression analysis, which establishes a relationship between selected variables. And hence, its capability for prediction is limited to these dependent variables. In case the variables are not enough to represent some important changes of the environment, additional variables are required to be considered and a new regression analysis needs to

7 be performed. The current analytical model does not consider the situation when two pipelines share the same resources. Apparently, this leads to some missing variables when establishing the regressive equations. Although there is a possibility that the contention on the shared resources has a minor influence on the performance, and then the current combined model can truly predict the throughput of the combined datapath system, but there is no such acknowledge on this assumption. Because of this uncertainty, a new analytical model with more dependent variables that represent the changes in the environment is desired. However, there are no input-output pairs available to perform the regression analysis for a new analytical model that considers the contentions between the two pipelines. Moreover, a modification of the current P-path also has to be carried out separately, as the contention affects both pipelines. A DES model with detailed representations of the shared resources is more flexible. When combining two DES models, models of the resources are shared between two pipelines, so no additional effort is needed to build new resources’ models for the P-path system. And the influences can be inspected by just sliding up and down the utilization of the modeled shared resources.

Figure 5. An example of interaction between tasks from P-path and F-path

IV. PERFORMANCE PREDICTION: THE STATE OF THE ART Varieties of approaches have arisen to predict the time-related performance of applications running on a multi-core platform with co-runners. According to the time-criticality of the applications they serve, these approaches can be classified into two groups. 1). Approaches serving HRT applications. In these approaches, a tight upper bound to the WCRT of the AuE that shares the resources with its co-runners is given. 2). Approaches serving S/NRT applications. These approaches mainly focus on the average performance of AuEs. They also serve applications to be predicted in an early design phase, as usually, no absolute accuracy needs to be promised in that early phase. Approaches used to predict HRT applications are also applicable to serve S/NRT applications, but predict their WCRT, which may not be necessary or useful for S/NRT applications.

8

Techniques applied to approaches coping with HRT applications can be classified as application-centric or platform-centric. Application-centric techniques require detailed static analysis to the applications. [8] has analyzed and summarized the pros and cons of this kind of techniques. A precise prediction of the 푘 performance can be guaranteed, but with a huge complexity ( ), where k is number of tasks1 running in 푛 the platform, and n is the number of cores assigned to execute these tasks. Such complexity increases the cost (time and money) of the prediction, which is not desirable in early design phase in our case. A variation of the application-centric techniques is introduced in [9], providing a relative simple and accurate prediction in the early design phase. In [9], the predicted latency (푒푡푚푢푐) is composed of the estimated execution time of the application running without co-runners (푒푡푠표푙표) and the interference from its co- runners (∆푡) (some other techniques, including platform-centric techniques like [6], also apply this model, but they are different in generating 푒푡푠표푙표 and ∆푡 ). Approach in [9] analyzes each AuE (can be multithreaded), like other application-centric techniques, but individually to generate a profile containing the usage of the resources. The complexity is reduced as k only contains the number of tasks in a single application. The profile contains no distribution of accesses to the shared resources over time, but only accumulated numbers. For instance, 50% of accesses are sent 1 cycle after the previous one, and the other 50% 7 cycles (example taken from [9]). From this profile, ∆푡 can be derived according to scheduling algorithms. Although this approach reduces some complexity, it is not realistic in our datapath application, as for each data input and change in the number of threads to be used, a new profile is required.

Platform-centric techniques require the platform to be composable and predictable. Composability ensures that applications cannot influence each other. And predictability enables a tight upper bound prediction for each individual application. To simplify the prediction of each application, most of the platform-centric techniques [3][4][5][6] regard the tasks that constitute each application as smaller applications. These tasks are scheduled and then predicted according to the platform’s scheduling algorithm. As a result, the performance of each application is a superposition of the performance prediction of its tasks. [7] provides a framework that decouples the applications and the platform. In this way, applications can apply different scheduling algorithms as the one used by the platform. So developers can develop their own applications without the cooperation of the platform’s supplier. Although these platform-centric techniques indeed predict the performance accurately, unfortunately, the platforms used in the datapath application ensure no composability and predictability.

Applying a stressmark can also be used to predict the S/NRT applications. A stressmark is an application that runs concurrently with the AuE and tries to maximize its accesses to the shared resources, so processing time of the AuE can be measured as the performance prediction. This approach usually gives a more pessimistic prediction result than the reality as the maximum accesses usually do not occur. Despite this fact, the prediction result is not an upper bound of the WCRT, as the maximum accesses the stressmark sets may not actually simulate the worst-case situation. As a result, it is too pessimistic for S/NRT applications but also not capable of predicting the WCRT. [10] builds a model on cache contention by applying a stressmark able to modify the bandwidth of accesses to the shared cache. A reuse distance histogram of AuE is generated with a bandwidth-adjustable stressmark as its co-runner to predict the precise cache effective size of the AuE. The reuse distance histogram is a function of the number of retired

1 A task is one unit of an application, that runs in a single thread.

9 instructions, the number of LLC accesses, and LLC misses. These values are different when the number of threads in the AuE varies, so it also not applicable in our case.

These searched techniques above are not suitable for predicting the performance of this datapath S/NRT application. The regression analysis indicated in footer 1 and discrete event simulation in [2][14] are two techniques considered to be able to build a predictive model for this project. Later in this report, a DES model is proposed for this streaming application. Processing time can be predicted without further time measurements when environments are different once the model is built. The purpose of the model is not for general purpose applications, it only serves this datapath application.

V. MODELING APPROACHES: ANALYTICAL MODEL VS. DES MODEL There is an analytical model that can predict the latency of the F-path system at the moment. It uses Amdahl’s law [11] and regression analysis. In this regression analysis, the processing time of the F-path application relates to variables, related to the inputs and platform characteristics. These characteristics should be quantified into variables. Inputs can be simply represented by their size, while it could be hard to relate the characteristics of the platform. The CPU PASSMARK [12] score, derived from a performance benchmark, proved itself as a better relation than some others, such as CPU frequency. With these quantified variables, the resulted regressive model shows an excellent prediction of the performance of the current F-path application. For most cases, the differences between prediction and real execution time are around 2%. However, for some other cases, the differences are approximately 6% to 8%. The error can come from three aspects. 1). Not enough variables are included in the regression equations. 2). The characteristics of the platform are not represented properly. 3). Same for the inputs characteristics.

A DES model can address all the three problems above if it is modeled properly. It can model all the states of the resources. And unlike regression analysis, not all characteristics to be modeled should be represented by some quantified variables. For example, the behavior of multicore processors on the platform, such as how the cores solve contest, can be simulated separately, instead of using variables to conclude all the characteristics of the whole platform. This improves the accuracy of the predicted results definitely but sacrifices the abstract level of the model and increases the complexity greatly. And it costs much more time, energy and money, to create such model than the analytical model. Such trade-off is worthless if only maximum 8% accuracy improvement can be achieved ideally. From the discussion of the last paragraph, the possible reason that results in some relatively large error is because of the simple quantification of the inputs. Unlike the other two aspects, for which different attempts were made to derive reasonable representatives, inputs are only represented by the entire size sent to the F-path application. With such representative, the analytical model regards equal workload distribution to each handler shown in Figure 4 by default. But actually, those handlers do not necessarily process the same amount of data, which results in different processing time. Most cases using the available test-sets distribute quite an equal workload to each handler. This is the reason that in most situations, the estimated errors are around 2% compared to the measured time. The DES model for the F-path application is proposed mainly for solving this problem. It takes the size of workload each handler needs to process into account, instead of the accumulated workload size, and predicts the elapsed time of each task. Although the DES model improves the inputs’ representation, it may not guarantee the same accuracy as the analytical model does. Depending on the abstract level of the DES model, more error could be generated if the DES model does not capture the platform correctly. To gain a relatively accurate

10 prediction and an abstract and simple DES model, bottlenecks of the F-path application are studied in Chapter XI, and only resources where these bottlenecks happen are modeled in details.

Another advantage of the DES model is its ability to present the detailed proceeding of an application with the assistance of a visualization tool. Further analysis can be carried out to inspect the performance of an application. For instance, different scheduling algorithms can be researched to look for the one giving a better performance. TRACE [13], a Gantt chart viewer tool, is coupled to the DES model to create the timing points of each task and the entire processing of the application can be seen directly. Figure 6 shows an example how TRACE works with the DES model, and helps for analysis.

(a) tasks scheduled by SO with 12 cores

(b) tasks scheduled by LJF with 12 cores Figure 6. Gantt chart generated from the DES model simulation

Each colored bar in Figure 6 is the elapsed time of each task predicted by a DES model. The DES model only predicts the elapsed time of each task, but no start time and end time. But with an auxiliary model (can be called listener) embedded, the DES model has the ability to record the start time and end time of each task by appending the elapsed time to the end time of the previous task which asks for the same resource. Once it finishes the prediction of one task, it communicates with TRACE. TRACE arranges the start and end points of received tasks in the timeline to the resources tasks claim. It can be seen from the Gantt chart in Figure 6(a), although 12 cores execute the application, nine of the cores have to wait for more than 50% out of the processing time because of the bad scheduling. So a different scheduling algorithm can be applied to the DES model to improve the results. Figure 6(b) is the resulting Gantt chart

11 after substituting the SO scheduling algorithm in the DES model by the LJF scheduling algorithm. LJF is also an off-line scheduling of tasks as SO, but tasks are sorted by their workload (execution time in the example) before they are distributed. The two green bars pointed at by the red arrows in the figures are from the same task. In (a), it shows that the task can only be processed after the task indicated by the blue bar finishes. But in (b), it can be processed at the very beginning by a different core. Compared to the processing time shown in Figure 6(a), this application with LJF improves more than half of the time than when it is scheduled with SO (70ms vs. 150ms). Including the advantage mentioned in Chapter Ⅱ, a DES model of F-path is developed in this project, and then a DES model of the entire datapath system can be derived.

VI. DEVELOPMENT PROCESS AND REQUIREMENTS There are two metrics used to determine the value of a model, one is the model accuracy, the other one is its complexity. As discussed in the previous chapter, the complexity sacrifices with the increase in the accuracy generally, and bottlenecks are analyzed to balance the accuracy and complexity. To assess the accuracy, the model should be validated and then be calibrated. However, it is not possible to validate a model in the design phase when the targeted system has not yet been realized. Section 2 in [2] declares that development with gradually iterative steps on an existing model, shown in Figure 2, can confirm some trust, determined by the cost-to-validation. A model that can be easily validated has a low cost-to- validation, otherwise, there is a high cost-to-validation.

In this project, the single thread F-path application running on a multi-core platform is the existing system that should be modeled by DES and validated at the very beginning. According to the requirements of the project, this model is to be gradually enhanced to enable its ability to predict the performance of the system in various environments. For the existing F-path system, validation of this model can be done easily. And the measurements used for validation can also be used to calibrate the model to increase the accuracy of it. And adding parameters to the built model eases the modifications of the model. Here lists some requirements of the F-path model: predict the processing time in environments that -

1. more than one core applied to process the handler tasks in Figure 4 2. different bitmaps are imported to the application 3. the application running on a different platform

The combined model can then be constructed from the validated P-path and F-path DES models, but it has a really high cost-to-validation as there is no accomplished combined system yet. Section 2 in [2] claims that certain amount of trust should be given to the predicted results from the model with a high cost-to-validation. This means if this combined model indicates the throughput cannot be met, then the combined system should not be developed. If the predictions state that the throughput can be satisfied, after the realization of the system, the combined model can be further validated and calibrated, to be used for next design iterations. Actually, the trust level given to this non-validated model comes from two aspects. One is the accuracy of the models constituting this model, which should be validated before combining the models. Another aspect is a reasonable explanation on the influence when two systems sharing the same resources. The explanation is based on the understanding of the system, and mainly explains two questions: firstly, whether there is an influence on the performance, and if yes, how this happens; secondly, how much it affects the performance. Unfortunately, it is not possible to accurately quantify the influence without the existence of a system. Hence, the performance can only be inspected

12 by setting possible scaling interval to the shared resources based on the developer’s experience. In this report, an upper bound to the utilization of shared resources is set to adjust the influence on the processing time of each pipeline. This upper bound is derived from the observation and analysis of F-path in Chapter XI. With this variation on the influences, the answer to the question: whether the throughput meets constraints, may not be just yes or no, but a probability distribution.

VII. TEST-SETS AND PLATFORMS FOR F-PATH To validate the F-path DES model, processing time of the F-path application on bitmaps should be measured. The current F-path application has the ability to record the start time and end time of the tasks shown in Figure 4, which makes the validation and calibration of the F-path model possible. Currently, two platforms are available to run the F-path application – [email protected] and Xeon E5- [email protected]. [email protected] owns 4 cores in one CPU, while Xeon [email protected] owns 8 cores in one CPU and has two CPUs packaged. Both of the platforms have a hyper-threaded mode. In that mode, the number of hardware threads in every core is doubled. Hence, measurements are done on i7- [email protected] from 1 core to 3 cores with and without hyper-threaded mode (1 hardware threads to 6 threads); and 1 threads to 30 threads on [email protected]. One core is assigned to deal with the master thread and other OS activities. To ensure the accuracy of measured data (without the interference of other activities), not all available cores should be assigned to the bands’ processing task.

The architecture of the current platforms helps to derive the possible contention in the system. For instance, if the L2 cache is shared between cores, then contention happens when cores ask for accesses to that cache at the same time. Figure 7 provides the block diagrams of NGMP, which is the overview architecture of the current platforms. Any explored platform in the future can be seen to follow this NGMP architecture. From this figure, it can be seen that to any NGMP platform, the contention happen between cores focuses on the interconnect (), and the shared LLC.

Figure 7. Overview Architecture of the targeted platforms

Figure 8, derived from the descriptions in [16] and [17], provides insights on connections between processors and by a ring bus, which is the interconnect applied in both i7-6700 and Xeon E5-2650 mentioned above. Figure 8(a) shows how the ring bus connects the resources in a four-core

13 processor (i7-6700). It can be seen that there are red blocks located on the ring bus, which are the stops connected to each resource. Information transfer can be pipelined with the help of these stops. For instance, if core 0 is talking to core 1, at the same time, core 2 can transfer data to the LLC by the ring bus. In Figure 8(a), it also shows there are four stops to each LLC slice, which increases the pipeline stages and helps to increase the throughput. To make the processors scalable, when more than four cores are in the processors, cores are arranged like in Figure 8(b), which takes 8-cores CPU as the example (Xeon E5-2650). It can be seen that there are still 8 stops in the ring bus helping transfer information between cores and LLC. Each core in this architecture has a direct connection to one LLC slice through the same stop located on the bus, so it should be slightly quicker when cores talk to their own slice than to slices belonging to other cores. Such understanding of the processors helps to model the specifications of the interconnect in the DES model.

(a) (b)

Figure 8. Pipelined bi-direction ring bus connecting four cores(i7-6700) and eight cores(Xeon E5-2650)

There are five test-sets used to test the F-path application by the software team, which will also be applied to validate the DES model. The five test-sets are listed as set-B, set-G, set-M, set-S, and set-W. These test-sets contain a number of bitmaps obtained after performing the P-path application. As the handlers in the F-path application can be processed in parallel to increase the throughput, these bitmaps are divided into bands and further dispatched to different threads. In our experiments, each bitmap is divided into 23 bands, which is used as the reference. Other divisions are also possible, but will not be presented in the report. Further research can be predicting the processing time when bitmaps are divided into another number of bands.

VIII. INTRODUCTION ON THE DES MODEL COMPONENTS In this project, the F-path DES model to be determined follows a framework in which both runtime library and abstract models for tasks and the platforms are provided. Figure 9 shows this framework, taken from Fig. 3 in [14]. There are two layers in this framework, one layer is untimed, where an application and its platform are modeled; the other layer is timed, where series of states help to determine the process

14 of the application. In the untimed layer, the application block contains the task graph (also known as the data flow) of the F-path shown in Figure 4. The drive is simulated by a task that generates the bands’ loads of each bitmap. Each task has a state engine which is driven by the task dynamics, and knows the resources to be mapped. The platform block provides the specification of some resources, like the speed of the CPU, or the size of memory. When the simulation of this model starts, the task dynamics checks the tasks that are ready for execution, and allows them to be issued to the corresponding resources. The internal scheduling algorithm embedded to each resource in the platform block decides the resource utilization each ready task can claim. For instance, a task obtaining full utilization of one core can be processed at the maximum speed of the CPU, while, a task getting zero utilization cannot be processed yet. The resource dynamics block determines the resources’ states at a time, like the available space and speed to translate the model abstract time to real time latency. To be more specific, the untimed layer in Figure 9 is static and has no relationship with the global time clock. Any modeled task and resource in this layer only use abstract units. For example, the speed of a computational resource – multi-core CPU is represented by the abstract_time_unit/load_unit instead of the global time unit, such as seconds per workload load_unit. And as this model simulates the true latency, the resource dynamics translates such model abstract time into global time. Hence, the latency to process each band derived in this model is generated by

푎푏푠푡푟푎푐푡 푡𝑖푚푒 abstract_time_unit 𝑔푙표푏푎푙 푡𝑖푚푒 (𝑖푛 푠푒푐표푛푑) workload (in load_unit) × (𝑖푛 ) × 푤표푟푘푙표푎푑 푙표푎푑_푢푛𝑖푡 푎푏푠푡푟푎푐푡 푡𝑖푚푒 (𝑖푛 abstract_time_unit)

Formula (1)

푎푏푠푡푟푎푐푡 푡𝑖푚푒 In this formula, the scheduling algorithm determines of each issued task, and resource 푤표푟푘푙표푎푑 𝑔푙표푏푎푙 푡𝑖푚푒 dynamics determines of the resource. 푎푏푠푡푟푎푐푡 푡𝑖푚푒

Figure 9. DES model framework

15

The explosion block shown in Figure 9 is the introduced penalty model, which details the resource dynamics above. The current resource dynamics only tracks and adjusts the CPU frequency with varying voltage as DVFS is applied in CPU. And only two internal states can be reached in this resource dynamics, one in high CPU frequency and high energy consumption, the other one in low frequency and energy consumption. Although such automaton may be rough, it is not what the datapath system would focus on. As described in Chapter VI, the model should predict the time to process different bitmaps on different environments. However, changes in voltage are not one of the requirements, and it is not the reason that causes processing time variation in the environments which are under estimation. So a penalty model is derived and added to the resource dynamics to capture the possible contention and other penalties 𝑔푙표푏푎푙 푡𝑖푚푒 (𝑖푛 푠푒푐표푛푑) happening on this datapath system by scaling in Formula (1). Although 푎푏푠푡푟푎푐푡 푡𝑖푚푒 (𝑖푛 abstract_time_unit) it is called penalty model, according to the requirement 3 in Chapter VI, it should assist the resource dynamics to translate the properties of the modeled resources, e.g., the maximum speed of the multi- core CPU in the platform block.

In this report, two design approaches – top-down and bottom-up, are used to generate this penalty model. In the top-down approach, an overview of the penalty model is investigated. As the resulting penalty model is realized by a lookup table, it is called the static penalty model. While in the bottom-up approach, detailed factors that affect the system should be simulated. By integrating these influences, a penalty model, called the dynamic penalty model, can be built to scale the resource utilization properly. The static penalty model is easier to be realized than the dynamic penalty model, as all the factors can be hidden, and not necessarily need to be analyzed. However, since the static penalty model directly relates each output (measured processing time here) to its corresponding input, of each different environment, the same measurement process should be performed to rebuild that model. This increases the time cost of development. And it is not possible to build a model when there is no existing system, as no measurement can be done to fill in the lookup table. The dynamic penalty model is more flexible to all kinds of environments. Once it is built based on limited data, no more measurements need to be done. This is also a better option when there is no existing system to validate that model. The weak point for this choice is the complexity to analyze the influences of each possible factor, which causes construction difficulties. Through the report, both penalty models for the F-path application are built, but for the later combined datapath modeling, the dynamic penalty model is preferred due to its reusability.

IX. THE F-PATH MODEL WITHOUT A PENALTY MODEL As described in Chapter VI, development with small iterative steps making use of a validated model helps to confirm certain accuracy. Hence, the first step leading to the desired F-path DES model is creating a DES model which simulates the single thread F-path application running on one platform, Xeon E5- [email protected]. It follows a simplified V-model development process:

1. Analyze the requirements. In this model, estimate the latency of the F-path system running in a single thread. 2. Design a DES model and implement it. As the framework of the DES model (Figure 9) is given, in this phase, only the task graph of the F-path system is to be modeled. The scheduling algorithm for a single processing resource embedded (models of the multi-core CPU) in the framework is an abstract piece-wise linear relationship to the number of tasks. That is, with the number of tasks issued to the same core in the processor, the utilization of this core spared to each task decreases

16

linearly. This is not what happens in the F-path system. When there are multiple tasks asking for the same core concurrently, only one task can get the access to this core, and other tasks should wait until it finishes. Hence, the scheduling algorithm inside the provided multi-core CPU should be replaced. 3. Validate and calibrate this DES model. By measuring the processing time of the F-path system for the available bitmaps, it is easy to compare the predicted latency to the measured latency. The model can be calibrated if great errors occur between these values.

From Formula 1, to estimate the latency of each task, three separate parts need to be determined before. However, as it is hard to quantify the workload and resource utilization in both abstract unit and concrete unit. Hence, when implementing the model, the latency of each band is measured acting as the input workload to this model with the same quantified value but in load_unit. The other two units are abstract_time_unit 푚𝑖푙푙𝑖푠푒푐표푛푑 simplified into 1 , and 1 . So only one corresponding parameter is included 푙표푎푑_푢푛𝑖푡 abstract_time_unit in the DES model – task load. The other two speed-related properties are set to the constant value one. While the penalty model mainly focuses on scaling the resource capacity to simulate the influences when the environment changes, it is not necessary here as this model only predicts the performance in one single-threaded environment, all the penalties in this environment have been included in the measured bands’ processing time. On the other hand, from this derivation, it can also be seen that the contention/penalties when changing environment are hidden inside the bands’ processing time. Therefore, even without the existence of a penalty model, this model can be validated in a different environment, as long as the workload is correctly imported, so it can be called raw F-path model. Although in this way, the model predicts entire processing time of the bitmaps only with the help of measurement on bands’ processing time. This is not desirable for a predictive model as too many measurements should be performed to obtain the correct workload to this model. And this model cannot predict when the bands’ processing time remains unknown. As a result, this model is built mainly to check the general structure of the F-path DES model, whether with enhancement (adding a penalty model) to this model, the ability to predict the processing time of some bitmaps consisting of known bands’ processing time can be achieved.

As discussed above, step 3 now aims to validate the structure of this general DES model. To enable such validation, band’s processing time of the five test-sets when a different number of cores are applied to process the bands are measured on the Xeon platform as workload input. And two parameters, indicating the number of cores used, and whether the processor is in hyper-threaded mode or not are added. But before importing the measured time, it is necessary to check whether there is a large deviation between different runs within the same environment. If there is a large deviation, it is not possible to summarize the performance of F-path application into one model. To determine such possibility, bands’ processing time running on a single core, with and without hyperthreading applied on Xeon is deeply investigated.

When all the bands are processed in sequence by one single thread, this core is fully utilized to this task that processes the bands. No contention happens between cores on shared resources (regarding the F- path application). Hence, processing time variation could be because of the cache misses and replacements by this core’s operations and bad speculation, which are influenced significantly by the application. Measurements are carried out four times to compare the bands’ processing time among different runs in this single thread situation. Table 1 shows the number of bands with different time differences compared to the first test. It can be seen that between the four measured results, at most 1%

17 of total 10465 bands have time deviation larger than 4% (highlighted in red), which is quite minor considering the uncertainty of the speculation and the cache behavior. Same research is done when using one hyper-threaded core, and the results are shown in Table 2. It can be seen that the measured time shows more deviation between runs than the results in Table 1. It is reasonable because in the hyper- threaded mode, two hardware threads are available in the same core, so bands can be processed in two parallel threads. And these two threads share the same . The contention could happen between these two threads at any memory level and bus for transportation. This indicates more variations on the bands’ processing time than in the no hyper-threaded mode. Although with this effect, most of the processing time on different runs are quite similar indicated in Table 2. There is only around 3% of the bands suffering differences greater than 5%, while no more than 1% of the bands exceeds differences over 8%. No more research is performed on other environments to inspect the possibility, but from a glance of the measured bands’ processing time on other environments of different runs, the bands’ processing time of the same band has little deviation between runs. Hence, it can declare that within the same environment, the performance of the F-path application is roughly the same, and any run of the application can provide representative processing times in both hyper-threaded mode and no hyper- threaded mode.

Table 1. The number of Bands with Time Differences within one single thread

To test 1 test 2 test 3 test 4 diff. number of bands number of bands number of bands <=2% 9279 (88.67%) 9370 (89.54%) 9365 (89.49%) <=3% 10101 (96.52%) 10145 (96.94%) 10159 (97.08%) <=4% 10386 (99.25%) 10378 (99.17%) 10422 (99.59%) <=5% 10440 (99.76%) 10428 (99.65%) 10434 (99.70%) <=6% 10452 (99.88%) 10431 (99.68%) 10437 (99.73%) <=7% 10456 (99.91%) 10435 (99.71%) 10442 (99.78%)

Table 2. The number of Bands with Time Differences within one hyper-threaded core

To test 1 test 2 test 3 test 4 diff. number of bands number of bands number of bands <=2% 8987 (85.88%) 8045 (76.88%) 7926 (75.74%) <=3% 9769 (93.35%) 9254 (88.43%) 9114 (87.09%) <=4% 10138 (96.88%) 9865 (94.27%) 9801 (93.66%) <=5% 10276 (98.19%) 10155 (97.04%) 10147 (96.96%) <=6% 10332 (98.73%) 10291 (98.34%) 10282 (98.25%) <=7% 10352 (98.92%) 10343 (98.83%) 10322 (98.63%) <=8% 10365 (99.04%) 10362 (99.02%) 10342 (98.82%) <=9% 10383 (99.22%) 10373 (99.12%) 10356 (98.82%) <=10% 10395 (99.33%) 10391 (99.29%) 10371 (99.10%) <=11% 10406 (99.44%) 10402 (99.40%) 10385 (99.24%)

In addition to the time to process bands by the task handler, the entire latency consists of pre- processing time of the schedule&dispatch task, and the post-processing time of the assembler task for each bitmap, shown in Figure 4. Figure 10 shows the average bitmaps’ pre-processing time and its

18 deviation range when applying a different number of cores on Xeon. From this figure, it can be seen that the time cost on scheduler&dispatch is nearly the same when applying a different number of cores. This makes sense as it is done at the very beginning by the master thread, which cannot be affected by other tasks executed by its forked threads not yet been created at the time. This pre-processing time is only affected by the capacity of its running platform, such as the speed of the CPU. So a constant value 5.1 load_unit is applied as the workload for the schedule&dispatch task in any environment within the same platform. The assembler task in Figure 4 actually consists of tasks that configure the connected printer and correct the firing patterns injected to the printers. The configuration task contains activities such as setting paper size and ink color, which are the properties when printing files. The time cost on the configuration, hence, should be similar no matter how many cores are used in the application. While the time to correct the firing patterns depends on the distributed complexity of each input, which is hard to be represented. Besides, the time cost on assembler task, including configuration and correction, is usually much less than the time spent on usual bands. Considering the complexity of building a predictive model of correction task, and few gains in the accuracy of the entire F-path application can be achieved, it is not worthy to put attention on building a predictive model to estimate the post-processing time. In the DES model, it is represented as 1 load_unit for each bitmap. As a result, the time predicted by the following DES models are the predicted time from the task schedule&dispatch to the latest time of handler shown in Figure 4.

5.6 5.4 5.4 5.2 5.2 5 5 4.8 4.8 4.6 4.6

4.4 processing processing time (ms) processing processing time (ms) 4.4 4.2 4 4.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 number of cores used number of cores used

Figure 10. Average pre-processing time and its deviation range of different number of cores used with no hyperthreading (left) and hyperthreading technique (right) applied

The following two figures Figure 11 and Figure 12 show the average error compared to the measured F-path application running time of all the five test-sets in no hyper-threaded and hyper-threaded mode respectively. The middle blue lines in Figure 11 and 12 represent the average differences, dots below and above these two lines are the standard deviation range. It can be seen that although the deviation range is relatively wide at some points, this raw F-path model still proves its accuracy with its quite minor average error (0.71% in no hyper-threaded mode and 0.62% in hyper-threaded mode). To be more convincible, the worst-case prediction error is 2.6% on no hyper-threaded mode when using 14 cores; and 1.37% on HT mode when 7 cores applied, which is quite accurate in the early design phase. Hence, this generic model is validated and does not need further calibration.

19

To further enable the ability to predict, a penalty model is added to the resource dynamics to scale the 푚𝑖푙푙𝑖푠푒푐표푛푑 utilization of the computation resources ( ) in the platform. As discussed, the accuracy of abstract_time_unit this generic model depends on the accuracy of the imported workload. From Formula 1, the changes in the workload can be reflected by scaling the resource utilization, either abstract or concrete one. So actually, the purpose to scale this utilization is to predict the workload in different environments without measuring the bands’ processing time in each environment.

average differences with std. deviation 2.50% 2.00% 1.50% 1.00%

differences(%) 0.50% 0.00%

number of cores used

average differences

Figure 11. Average error of the raw F-path model compared to the measurement time in no HT mode

average differences with std.deviation 1.20% 1.00% 0.80% 0.60% 0.40%

differences(%) 0.20% 0.00%

number of cores used

average differences

Figure 12. Average error of the raw F-path model compared to the measurement time in HT mode

20

X. F-PATH DES MODEL WITH THE STATIC PENALTY MODEL As indicated in the last chapter, the penalty model used to scale the resource utilization reflects the changes in the workload. Seen in Figure 13, the bands’ processing time when applying a different number of cores differs greatly. It is not possible to get an accurate prediction when only the bands’ processing time in one single-thread environment is measured. In the following chapters, bands’ processing time in this single-thread environment is used as the baseline input. With this input and the built penalty model, the DES model of F-path is expected to predict the processing time of bitmaps in various environments as mentioned in chapter Ⅵ.

A top-down design approach is used in this chapter to achieve a static penalty model. In this design approach, the static penalty model acts as a black box. Its input is changes in the model parameters representing the environment, and its output is the scale factor to the concrete speed of computational (band’s processing time in a different environment) resources. As the scale factor is derived from , and the bands’ (band’s processing time in the baseline) processing time can be seen as the workload (same value in load_unit) of the tasks, the static penalty model can be simplified as a black box with the baseline as its input, and the output the scaled band’s workload in a different environment. And the resulting scale factor in each environment is used to scale the resource utilization either in the abstract or concrete unit.

The simplest way to achieve this penalty model is applying a lookup table, by analyzing the growth trend of different bands’ processing time. The growth trend here is the relationship between processing time of the same band with the applied number of cores Xeon [email protected] increasing. As there is much data measured (there are 10,465 bands in the 5 test-sets, and 15 cores in Xeon-E52650 with and without the hyper-threaded option, in total there are more than 300,000 points), it is hard and a waste of time to analyze time increase trend of each band. Averaging the target data helps to simplify the analysis on the growth trend.

In Figure 13, each dot in the blue lines represents the average difference of all the bands in the test- sets with respect to the single core baseline. Dots below and above the blue lines are the corresponding standard deviation range. It can be seen that in general, when more cores are applied, the time to process bands increases. This is as expected because some resources, like LLC, have a constant capacity and are shared between applied cores. And replacement happens in LLC could be more frequent at a time with an increase in the number of cores used. Figure 13 shows that the increase has deviations that cannot be neglected, especially when using 15 cores in the no hyper-threaded mode (marked by the red ellipse). This means that the average increase cannot represent the trend of all the bands’ processing time. Think about the example given in Figure 5. Between timeline C and D, there are four bands (green bars) are processed in parallel. If the cores regarding to each band ask accesses to the shared bus or LLC at the same time, these shared resources have to schedule these requests. Some requests can be handled earlier than other requests. Besides, a miss in LLC means that the required data has been evicted by other data, so one LLC miss to one core could mean one LLC hit to another core. Moreover, seen from Figure 5, around half of the band processed by core No. 3 is overlapped with other bands. So within the same band, it could happen that the processing time increases for only parts of bands, and the rest parts remain the same processing time. All these factors lead to such deviation shown in Figure 13. However, the deviation range at 15 cores shown in Figure 13 (left) is much wider than the deviation range at other number of cores. It is observed that there is one band in every bitmap that the time to process this band has increased more than 75% compared to the baseline. The standard deviation at this point decreases to 2.75% when this

21 band of each bitmap is excluded. Although with these deviations, the average increase to the single core still fills in a lookup table to model the penalty as a start. This is because that longer and shorter predicted bands’ processing time may compensate each other and then, results in an accurate prediction. Figure 14 provides an example of how this works.

35.00% 30.00% 30.00% 25.00% 25.00% 20.00% 20.00% 15.00% 15.00% 10.00% 10.00% 5.00%

5.00% average difference average difference (%) 0.00% avergae difference (%) -5.00% 0.00% 1 2 3 4 5 6 7 8 9 101112131415 1 2 3 4 5 6 7 8 9 101112131415 number of cores used number of cores used

Figure 13. average error compared to single core bands’ processing time in no HT (left) and HT (right)

(a)

(b) (c)

Figure 14. An example on how average increase factor compensate the processing time

Figure 14 (a) shows a bitmap containing 4 bands of different workloads, and one core is used to process this bitmap as the baseline. Figure 14 (b) shows the imaginary result when 2 cores are applied to process it. Seen from it, the band from left to right shown in Figure 14 (a) has an increment of 10%, 12%, 13% and 11% respectively. By averaging these increase factors, each band has an estimated increment 11.5%. Figure 14 (c) shows the predicted processing time by applying this increase factor to all the four bands. The estimated processing time of this bitmap running on a two core environment is 5.575 unit of time, while the true processing time indicated in Figure 14 (b) is 5.62 unit of time. The error is only 0.08% in this example. The key to this approach is that the most of the increase factors of bands in the bitmaps are in

22 the standard deviation range. And the increments are approximately positive proportional to the workloads. These two prerequisites hold in most of the environments of different cores from the measured processing time (too much data, not shown here). And even if there are extreme values inside the set of the increase factors, as long as they do not influence the average value of the set a lot, the predicted time should also be quite accurate, which is the case of the 15-core environment in no hyper- threaded mode.

A. Predict the processing time of bitmaps with increasing number of cores The following charts Figure 15 and Figure 16 indicate the average differences between the predicted processing time of F-path using a static penalty model derived from Figure 13 and the measured processing time. It can be seen in Figure 15 that the average errors are between 0.54% 3.03%, with the largest error on test set-B when 11 cores are used. And in the hyper-threaded mode, seen from Figure 16, the average differences vary between 0.26% to 1.63%, with the largest error be 3.36% on test set-W when 6 cores are applied. From these results, it can claim that the using the average differences shown in Figure 13 can perfectly predict the processing time of test-sets when different number of cores are used on Xeon platform.

average differences of the whole test sets 4.00% 3.50% 3.00% 2.50% 2.00% 1.50% 1.00% difference(%) 0.50% 0.00% -0.50%

number of cores used

average differences

Figure 15. average differences compared to the measured processing time in no HT mode

23

average differences of the whole test sets 3.00% 2.50% 2.00% 1.50% 1.00%

differences(%) 0.50% 0.00% -0.50%

number of cores used

average differences

Figure 16. average differences compared to the measured processing time in HT mode

B. Predict the processing time of different bitmaps The static penalty model above, built by a lookup table, is derived by studying the average increment of all the five test-sets with increasing number of cores used. It cannot ensure the prediction accuracy if it is used to predict other bitmaps not in the test-sets. So it is natural to think that not all the processing time of bitmaps can be predicted with the above penalty model. However, it is declared from previous work2 that the available test-sets are representative enough to most of the bitmaps. So if the model can accurately estimate the processing time of the bitmaps in the test-sets, it can guarantee the accuracy of most of the bitmaps even not in the test-sets.

C. Predict the performance on different platforms In the static penalty model implemented by a lookup table, the processing time of each band is obtained by scaling the workload directly. From the measured processing time, it is difficult to derive any more details of the system. This overview is the highest and the only layer in the top-down approach. To translate such predicted latency from Xeon [email protected] to other platforms, it is necessary to scale the scale factors in the lookup table. For instance, using the ratio of between the base frequency of Xeon and i7 to scale all the scale factors in the derived lookup table above. This factor can be set to any reasonable values if they are deduced from available information, like frequency or PASSMARK, and able to result in an accurate prediction. Besides, the same pattern for the increasing trend should be observed from a different platform as shown in Xeon, otherwise, it is not possible to get accurate results with only one scale factor. The processing time of F-path on [email protected] is hence measured to find such factor. Figure 17 compares the increase factors to 1-core baseline environment of i7 (blue bar) and Xeon (orange bar). It can be seen that the increase pattern in i7 is totally different compared to the increase pattern in Xeon when applying two and three cores to run the F-path application. As discussed, the accuracy of the model depends on the accuracy of predicted workload. This indicates that for each mode (hyper-threaded

2 All the five test-sets have been used in multiple projects, and these test-sets contains best, worst and average-case bitmaps.

24 and no hyper-threaded), and with different number of cores, there should be a scale factor used to translate the scale factor in the lookup table generated for Xeon to a new table for i7 to ensure the accuracy of the prediction results. These factors are not the only one to scale the entire lookup table, and are not derived from any related platform information. So there is no guarantee that the model can predict the performance of other platforms, as the scale factors are too specific within its own environment (same platform, same number of cores applied). Hence, using a lookup table with scale factors is not able to make an accurate prediction on the performance of a non-existing system, which is one of the requirements. This is the reason to build a dynamic penalty model. With detailed analysis in the bottom-up approach, more accurate results may be derived. This approach is discussed in the next chapter.

14.00%

12.00%

10.00%

8.00%

6.00%

4.00%

average average difference(%) 2.00%

0.00% noHT-1core noHT-2cores noHT-3cores HT-1core HT-2cores HT-2cores -2.00%

i7-6700 Xeon-2650

Figure 17. Average differences of the bands’ processing time when two cores and three cores are applied compared to 1-core environment as baseline on i7-6700 and Xeon E5-2650

XI. F-PATH DES MODEL WITH THE DYNAMIC PENALTY MODEL Two sections are included in this chapter. In Section A, bottlenecks when applying more than one core to processing the application are analyzed in both hyper-threaded and no hyper-threaded modes on the Xeon platform. As the F-path application only records the processing time of bitmaps and each band inside the bitmaps, it is difficult to inspect the bottlenecks that cause such increase shown in Figure 13. So VTune is then used. VTune is a tool that monitors the performance of any applications by inserting numbers of hardware events, such as a that records the clock cycles of all the LLC misses/hits. So from VTune, we can have a deeper insight in the application, in addition to just processing time. The measured processing time by running the F-path application via VTune is consistent to the measured processing time just by running the application because it also includes the time to record the hardware events inserted by VTune. Despite the existence of this overhead, VTune has the ability to filter out activities (threads, functions, call stacks). Hence, tasks for processing bands – handlers, can be isolated from other tasks or other unrelated activities to get the hardware events of handlers. Then the analysis on the bottlenecks can be performed based on the hardware events collected instead of just bands’ processing time. Besides, it should also make sure that there are not many variations between the clock cycles of different runs. As

25 described in Chapter IX, bands’ processing time between different runs should be quite similar. While VTune does not record the actual time to process, the processing time can be represented by clock cycles. Although in general, the processing time can be calculated by (clock cycles)/(processor frequency), intel® uses a technology called Turbo that accelerates the frequency of the processor, so the processing time cannot be derived usually from processor’s base frequency. Hence, if the clock cycles from different runs are similar, then it is usually safe to declare that the processing time to process via VTune between different runs are also similar, and then the hardware events collected from the VTune can be analyzed. In Section B, the penalty model is realized by the bottom-up approach. This penalty model is then embedded into the basic DES model derived in in Chapter IX and validation on this DES model with the penalty model is performed. In the bottom-up approach, the penalty model is designed into several layers. Sub-models in the lowest layer are derived from the bottleneck analysis, where influences of these bottlenecks are quantified. Then the sub-models in the upper layer gradually integrate sub-models in its lower layer and finally leads to one penalty model that scales the resource utilization. A. Hypothesis and Validation 1) No Hyper-threaded mode: As said in the introduction in this chapter, values of the clock cycles of different runs should be quite similar to make the analysis on the hardware events possible. And as discussed in Chapter X-B, the available test-sets contain all the representative bitmaps. To speed up the experiments’ running time and ease the analysis, only one of the five test-sets – bitmaps in the set-B are used as inputs to the F-path application. This means all the experiments done via VTune in this chapter are only based on the set-B. Table 3 presents the clock cycles from four different runs of the whole set-B when a different number of cores are used on the Xeon platform. It shows the similar clock cycles from different runs when the same number of cores are applied. Hence, the data collected from VTune can be used to analyze the bottlenecks in the F-path application.

Table 3. Clock Cycles of different Runs provided by VTune No. 1 core applied 2 cores applied 3 cores applied 4 cores applied 6 cores applied 1 80,452,000,000 79,676, 000,000 85,254,000,000 88,874,000,000 97,602,000,000 2 80,478,000,000 79,710,000,000 85,072,000,000 89,080,000,000 97,294,000,000 3 80,420,000,000 79,768,000,000 84,974,000,000 88,576,000,000 97,496,000,000 4 80,310,000,000 79,740,000,000 85,216,000,000 88,816,000,000 97,386,000,000

From Figure 13, it can be seen that there are two jumps both in the hyper- and no hyper-threaded mode, one is from two cores to three cores, and the other one from five cores to six cores applied to the F-path application. To find the bottlenecks occurring in these setups, experiments with VTune are performed. In the following analysis, experiments are performed only in the 1-core, 3-cores and 6-cores environments on the Xeon platform, as they are points of interest. Before doing the experiments, workload issued to each used core is inspected as the clock cycles spent on each core directly relates to the workload issued to it. Table 4 summarizes the band size in bytes each core should process according to the current distribution algorithm. The core identifier aligns with the identifier in VTune for easy comparison (may not be the same as physical id). From this table, it can be seen that within five cores applied, processing time variation is hard because of bad workload distribution with such similar workload to each core. While using more than five cores, workload distribution is not that even. Hence, workload distribution can be

26 taken into consideration when analyzing performance in the 6-cores environment, although probably not much influence on the total execution time with such insignificant deviation in the issued workload.

Table 4. Size of bands distributed to each core with increase in the number of cores

bands' size distribution when applying a different number of cores (Set-B) unit: bytes 1 core 2 cores 3 cores 4 cores 5 cores 6 cores 7 cores 8 cores core6 3,02E+08 2,06E+08 1,53E+08 1,24E+08 1,01E+08 82834340 74429168 core5 3,02E+08 1,99E+08 1,55E+08 1,23E+08 1,03E+08 82789230 82017685 core4 1,99E+08 1,48E+08 1,18E+08 1,05E+08 89362466 84278095 core3 1,47E+08 1,21E+08 1,05E+08 92656032 82387427 core2 1,18E+08 96040371 90111884 78988484 core1 94280860 85473236 72903041 core0 80894505 64139739 core7 6,04E+08 64978054

Figure 18 (a) to (c) show the general exploration of the F-path application with one core, three cores and six cores applied, and these figures only represent the exploration on the bands’ processing tasks (“handler” in Figure 2). Now, attentions are paid on the fourth column in Figure 18, where CPI rates are indicated. In the case of the similar retired instructions, growth in CPI rate can tell the increments on the bands’ processing time (the processing time can approximately be derived from CPI*retired instructions/CPU frequency). However, the CPI rate increments do not align with the growth in Figure 13. On average, the CPI rate in the 3-cores and 6-cores environment are 0.756 and 0.774 cycles per instruction. Compared to the CPI rate - 0.754 cycles per instruction in the 1-core environment, apparently, the growth in CPI rate is not as high as shown in Figure 13. By accumulating the retired instructions shown in the third column of Figure 18, it can be derived that more than 5.68% and 17.86% of retired instructions on average are executed in the 3-cores and 6-cores environments respectively, compared to in the 1-core environment. These extra retired instructions can come from three sources. The first one is the overhead for multithreading programming. Second, these retired instructions loaded to caches may be evicted before they can be computed; they are fetched and decoded multiple times. Third, they are used to ensure the cache coherency (in the sandy-bridge architecture, snooping is applied to solve this problem) in the multicore environment. It is important to configure which could be the main reason leading to more retired instructions when more cores are used.

The overhead of multithreading programming should not be so much from experience. To make sure, the 2-cores experiment is also performed by VTune, shown in Figure 18 (d). The average retired instructions (average the fired instruction on different runs) are 106,474,300,000 for 2 cores environment, while for one core, the number is 106,709,500,000. The two numbers are almost the same (0.22% less for two cores than one core). The multithreading programming should have similar overhead to each used core, unless a different instructions’ library is used for environments of different number of cores (not possible). We would like to conclude that the overhead of the multithreading programming is hardly the main source. From this decrease in the number of fired instruction, which aligns with the trend shown in Figure 13, it can also conclude that less retired instructions are evicted in two cores environments than in one core environment. This is because that additional happens cross-cores when more cores applied. So the fired instruction number should increase in two core environment if only considering

27 the cache coherence. But the truth derived from VTune is opposite. The only reason is that fewer evictions happen when two cores applied than when 1 core applied, and this overrides the influence of cross-core cache coherence. From the above analysis, the instruction evictions and cache coherence are two main sources of variations of the retired instructions.

(a) one core used

(b) three cores used

(c) six cores used

(d) two cores used

Figure 18. General exploration from VTune

Not only instructions are evicted, data loaded can also be evicted in the caches. Evicted information not yet computed should be reloaded, so parts of the increasing processing time consist of the time spent on reloading this information. Fortunately, the ring bus of the two available platforms has a separate ring for

28 executing snooping (for cache coherency). In general, when information in LLC is evicted, snooping between cores and reloading are performed simultaneously using different rings. And snooping happens in the same die while information need to be reloaded from DRAM because of LLC misses, so snooping takes much shorter time than reloading. As a result, cache coherency can hardly influence the processing time, and evictions happening in the caches, shortened to “cache evictions”, is the main reason of increasing processing time.

Moreover, seen from Figure 18 (b) and (c), there are two columns marked red by VTune, which indicate loads block stores. This occurs when a prior store in flight, is writing data a load wants. Then this load blocks the store forwarding to lower memory hierarchy, and retrieves the data. However, as the store cannot forward until the load blocks finishes, there is a possibility that the data load wants much more than the store is writing. In this case, the store operation is pending until the load operation reads all the data, which may be located in remote DRAM. This causes additional time for store operations. To see whether it can be the true bottleneck that influences the performance, Table 5 is derived, which indicates the clock cycles that loads block stores spends. It can be seen that with the increase in the number of cores applied, the clock cycles (time) indeed increase in general. This is expected as more instructions are loaded with more cores. However, it can also be seen that the clock cycles of different runs deviate significantly, especially for 4-cores and 7-cores environments. From the processing time measured by the F-path application, there is a little deviation in the processing time between different runs when using the same number of cores on the same platform. So although the loads block stores increases with the increase in the number of cores, it does not contribute significantly to the processing time of the F-path application. Otherwise, with deviation on the values of the loads block stores in the same environment, the processing time will also deviate a lot, which is not the truth. Hence, it can be concluded that loads block store factor is not a bottleneck for the F-path application.

Table 5. Clock Cycles *1,000,000 that Loads block Stores takes on different runs

1st run 2nd run 3rd run 4th run 1 core 3380 3380 3298 3131 2 cores 3440 3427 3147 3670 3 cores 5854 5927 5551 6164 4 cores 6309 5344 6443 5194 6 cores 7711 7589 7361 7499 7 cores 6302 8445 7484 7373

From these analyses, increase in the bands’ processing time is because of the instruction and data evictions, which happen more frequently when more cores are applied as contentions in the caches happen more frequently. And cache misses lead to requests for loading instruction or data from lower memory level (cache or DRAM), which leads to a contest on the interconnect. As a result, the increase part of the bands’ processing time in no hyper-threaded mode comes from loading information when cache evictions happen.

2) Hyper-threaded mode: In the hyper-threaded mode, two hardware threads are available in the same core, so bands can be processed in two parallel threads, in theory doubling the throughput. However, with respect to a single hard thread, not full utilization of the core (limited computation resources) can be issued, which increases

29 the time to process each band. Besides, these two threads share the same memory hierarchy. The contention could happen between these two threads at any memory level and bus for transportation. To understand the influence of hyper-threaded technique in the application, experiments are performed in a single core with and without hyper-threaded mode to get rid of the effects of contention between cores. Two screenshots taken from VTune are shown to represent a general exploration of the threads that process bands. As mentioned, test set-B is used as inputs in the experiments, and only tasks that process bands are selected by VTune. The clock cycles shown in Figure 19 (a) and (b) represent the clock cycles to process all the bands issued to each thread. From Table 4, it is known that when using two core in no hyper-threaded mode, the workload to each core is almost the same. This also holds for the 1-core environment in hyper-threaded mode, that each hardware thread has to process almost the same number of bands. Then the clock cycles taken to process the same amount of data as one hardware thread to process should be almost around 40,226,000,000 (80,452,000,000/2), seen from Figure 19 (a). However, in hyper-threaded mode, only around 67.1% of the clock cycles taken in one hardware thread in hyper- thread mode are to handle the same amount of data on average when in no hyper-threaded mode (40,226,000,000/((59,944,000,000+59,942,000,000)/2)). If this decrement derived from the test set-B is applicable to the processing time of all the bands in the whole test-sets, then by applying this factor on the measured bands’ processing time in no hyper-threaded mode (processing time divides this factor), the estimated bands’ processing time in hyper-threaded mode in 1-core environment should be quite similar. By comparing the results to the measured bands’ processing time in the hyper-threaded mode with the estimated time derived from this operation, there is only on average 0.3% differences between the predicted time and the measured time with 0.73% standard deviation between. Without a deeper investigation on the possible bottlenecks, a pretty close estimation on bands’ processing time in the 1- core hyper-threaded mode can be given by just measuring the bands’ processing time in 1-core no hyper- threaded mode, which is actually the inputs to the DES model. However, it should be kept in mind that this factor includes all the detailed effects, like core utilization for each thread, and inner-core contention. When applying more cores to process the F-path application, contention on shared resources could be more intense in hyper-threaded mode, as more threads now contest for them than no hyper-threaded mode.

(a) one core applied in no HT mode

(b) one core applied in HT mode

Figure 19. General Exploration by VTune when one core is applied

30

Hence, in the hyper-threaded mode, the reasons that result in increases in the bands’ processing time should be the same as no hyper-threaded mode, but now have more influences than in no hyper- threaded mode. And lower core utilization to each hardware thread in hyper-threaded mode than in no hyper-threaded mode is another reason for the growth in the bands’ processing time. It should ensure when modeling that the same influences are not duplicated in different forms and hence increasing the prediction error.

B. Model Realization In the bottom-up design approach, sub-models regards to detailed sub-systems are built to finally constitute one model that captures the characteristics of the entire system, and is able to predict the systems’ performance. And in the last section, bottlenecks that could influence the system’s processing time are inspected to avoid much complexity when building sub-models. In other words, these bottlenecks are the targets to be simulated by the sub-models. Here we summarize the bottlenecks that have been discussed:

1. Instructions and data evictions because of the contention happening in caches 2. Lower core utilization to each hardware thread in the hyper-threaded mode than in no hyper- threaded mode 3. More evictions happening in hyper-threaded mode than in no hyper-threaded mode when the same number of cores is applied

The problem now is how to realize them into sub-models. Simulating all the techniques applied to the resources where bottlenecks happen is not realistic, and not meaningful. For example, LLC can be modeled by its capacity, principle of locality, replacement policy, and all the other related techniques, and there is no required data available at LLC at the beginning. To get instructions and data evictions, in other words, LLC misses/hits, it needs to know how much data at a time each thread asks for, which is determined by the application, and the location of required data in DRAM which is found through the operating system. Hence, not only cache itself, to get the accurate number of LLC misses/hits, the loading amount of data each time and techniques relate to memory allocation in the operating system need to be simulated. This is labor intensive, and is not valuable considering its cost. Besides, not all the techniques used in the cache and the operating system are visible to customers. In this subsection, these bottlenecks are represented by functions derived from statistical analysis which can explain their main characteristics.

Time spent on processing any application consists of two parts: 1) the computation time when cores execute instructions; 2) the communication time for loading and storing instructions and data. Bottlenecks listed above can then result in two types of models that scale the utilization of the influences of resources based on which kind of time they influence. In the hyper-threaded mode, the computation time increases because of lower utilization of the computation units in the cores, such as ALUs, to each hardware thread than in the no hyper-threaded mode (contention on the shared cores). So one sub-model is to scale these computation resources’ utilization. And the information evictions result in increases in the communication time as evicted information needs to be reloaded. Interconnect is the medium when reloading this information, hence, the increase in the communication time is actually the time spent on transferring the information through the interconnect. This can be modeled as the changes in the interconnect utilization, that with lower interconnect utilization, more time is spent on communication. Figure 20 shows the sub- models based on the type of the time bottlenecks affect. It can be seen that they are categorized as computation time-related and communication time-related. These sub-models are organized in two

31 layers, the lower layer, layer 0, are the models representing the bottlenecks, they finally lead to two models in the upper layer used to scale the utilization of shared resources. In the computation time- related zone, it can be seen that the computation resources’ utilization to each thread relates to the hyperthreading technique as discussed. There is only one sub-model in the upper layer in this zone as the hyperthreading technique is not modeled, but represented as one factor in this computation resources’ utilization sub-model. In the communication time-related zone, there are two sub-models in the layer 0. In principle, the communication time is a function of the distance from the source to the destination, the communication speed, and the size of transfer information. When taking contention on the shared interconnect into consideration, waiting time for access to the bus is also one part of the communication time. As any miss happens in the first two cache levels and hits in LLC costs much less time than miss in LLC, so the influence of the cache evictions mainly indicates the influence of LLC misses. The cache eviction model in Figure 20 simulates the increments in the number of cache misses in environments of different number of cores compared to the 1-core environment. The waiting distance model indicates the time penalty cost on LLC misses, including the waiting penalty of these LLC misses (waiting for LLC misses requested by other cores at the same time). The interconnect utilization model is then a function of its composited sub-models, which finally affects the communication time. As the purpose of the penalty model is to scale the bands’ processing time in the 1-core environment to the time in different environments, the sub-models can be integrated into one model that generates the scale factor of the processing time. When implementing, they are integrated into the existing multicore CPU model.

Figure 20. Penalty model built from bottom-up approach

The dynamic penalty model is built gradually with iterative steps as indicated in Figure 2. In each iterative step, the DES model embedded with the developing penalty model enables a new capability based on the analysis of the requirements. A new capability expected to be achieved in each step is described as follows: The capability to estimate the time to process bitmaps

1. when different number of cores in no hyper-threaded mode are used in one platform. (discussed in the subsection 1 in this section)

32

2. when different number of cores in hyper-threaded mode are used in the same platform. (discussed in the subsection 2 in this section) 3. when a different platform is used to process the F-path application. (discussed in the subsection 3 in this section)

Since many derivations of formulas are discussed in the following subsections, before continuing, a table, Table 6, is given to show the meaning of symbols in the following paragraphs:

Table 6. Meanings of the symbols used in the following paragraphs symbols meaning superscripts meaning subscripts meaning 푈 utilization T target environment i retired instructions # the number of B 1-core environment in t processing time the no hyper-threaded mode 푀 cache misses HT hyper-threaded mode cross cross-core 푆 snooping LLC LLC misses 퐿 latency waitng LLC misses waiting I increments to c computation the baseline resources PA penalty of LLC inter interconnect misses resource D penalty of LLC processor multicore CPU misses resource because of the waiting latency BA bands P target platform k scaling factor cores cores used

Symbols can be combined together, for instance, #푀퐵 represesnts the number of cache misses in the 퐵−퐻푇 1-core environment, and 푈𝑖푛푡푒푟 represents the utilization of interconnect in 1-core hyper-threaded environment. In any case, if B is used together with HT, it means the 1-core environment in the hyper- threaded mode.

1) The dynamic penalty model developed for capability 1 Model Analysis and Realization:

From Figure 20, it can be seen that if the hyper-threaded mode is not enabled, the sub-model of the computation resources utilization is not necessary to be built. As discussed earlier, it is hard to model the real number of cache misses in the cache eviction model because of many constraints. So the variation of the cache misses when applying different number of cores compared to baseline in no hyper-threaded mode (the cache misses estimated in the 1-core environment without hyper-threaded mode on) is simulated here.

From the bottleneck analysis in the previous section (Section A), the trend of cache misses can be derived from the increment of retired instructions taken from VTune. The number of retired instructions

33 is affected by cache evictions and cache coherency. And the number of snooping for cache coherency relates to the cache evictions, it happens when the data in the cache is replaced. So it is assumed that the variation trend of the snooping numbers to be same as the variation on the cache evictions with a certain scaling factor. Then, the increment in the number of retired instructions indicates the trend of cache evictions, in other words, the increment in the number of cache misses, seen from the following formula.

#푀푇+#푆푇−(#푀퐵+#푆퐵) #푀푇+푘∗#푀푇−#푀퐵−푘∗#푀퐵 (1+푘)∗#푀푇−(1+푘)∗#푀퐵 #푀푇−#푀퐵 ≈ = = = 퐼 푇 #푀퐵+#푆퐵 #푀퐵+푘∗#푀퐵 (1+푘)∗#푀퐵 #푀퐵 𝑖

Formula (2)

Cache evictions can be categorized into inner-core and cross-core cache evictions. The cross-core evictions happen on LLC when different cores access the same location in LLC. While inner-core evictions can happen on any cache level when data is evicted by other data loaded by the same core. Hence, the total number of cache misses is a sum of 1) the sum of the inner-core cache misses of each core on private caches (L1+L2), 2) the sum of the inner-core cache misses of each core on LLC, 3) the cross-core cache misses on LLC as a whole. The problem now turns to modeling these three factors. From Formula (2), it 푇 푇 퐵 can be derived that #푀 = (퐼𝑖 + 1) ∗ #푀 . As the variation trend of cache misses to the baseline in no 퐵 푇 푇 hyper-threaded mode needs to be found, #푀 is set to be 1 (an abstract value), then #푀 = (퐼𝑖 + 1). And as no cross-core cache eviction in the baseline environment, this #푀퐵 is composed of inner-core misses on private caches and inner-core misses on LLC. Each factor is initialized to cause half of the total cache misses as a start (0.5 for each type of cache misses). Such distribution can be modified later if higher accuracy on the trend of predicted cache evictions can be achieved than in this equal distribution.

With the increase in the number of cores applied, the number of bands to be processed at each core decrease, same to the inner-core cache misses on private caches. Hence, the function that describes the 0.5 inner-core cache misses on L1+L2 can be . The second column in Table 7 applies this function to #푐표푟푒푠푇 represent the influence of inner-core cache eviction on L1+L2. For the same reason, the inner-core cache evictions on LLC should also decrease with increase in cores applied. But because each core has its private caches to hold data, the inner-core cache evictions in LLC decreases further. The function that formulates 0.5 such decrease can be , which is used in the third column in Table 7. 퐼 푇 is calculated from the (#푐표푟푒푠푇)2 𝑖 #𝑖푇−#𝑖퐵 formula 퐼 푇 = , where #𝑖 in any environment of different number of cores have been measured by 𝑖 #𝑖푇 푇 VTune. The resulting 퐼𝑖 is shown in the fourth column in Table 7. As said above, the total number of cache misses contains three factors. With two of them formulated, the cross-core cache evictions can be calculated through equation:

0.5 0.5 ( + ) ∗ #푐표푟푒푠푇 + #푀푇 − #푀퐵 #푐표푟푒푠푇 (#푐표푟푒푠푇)2 푐푟표푠푠 푇 푇 푇 퐵 퐵 0.5 0.5 = 퐼 → #푀 = 퐼 ∗ #푀 + #푀 − ( + ) ∗ #푐표푟푒푠푇 #푀퐵 𝑖 푐푟표푠푠 𝑖 #푐표푟푒푠푇 (#푐표푟푒푠푇)2 Formula (3)

For example, the cross-core cache misses in 6-core environment can be achieved by Formula 3 with all 0.5 0.5 variables substituted by the values found in Table 7: 17.86% ∗ 1 + 1 − ( + ) ∗ 6 = 0.595267. The 6 (6)2 resulting cross-core cache evictions in other environments of different cores are shown in the fifth column in Table 7, and the left figure in Figure 21 plots the increasing trend of the first four environments. It seems that two kinds of functions can present the curve shown in Figure 21 (left), one is parabolic 푦2 = 2 ∗ 푝 ∗

34

푥, with (푝 /2, 0) as its focal point; and another one is logarithm function y = 푎 ∗ 푙표𝑔푏푥, with 푎 as its scaled coefficients. Function that represents this curve can be found through regression analysis. However, it is out of the knowledge of the author and needs time to learn so several experiments are done to inspect which one can fit the curve better. And it is found that logarithm function fits better to the curve than 푦 parabolic function. The coefficient 푎 in the logarithm function equals to . The 푦 here is the calculated 푙표𝑔푏푥 cross-core cache misses, and 푥 represents the number of core applied, so now the value of the base of the logarithm is to be concern. The following derivation explains that the value of the base b can be set to any value greater than 1.

Table 7 Table used to deduct the function of cross-core eviction (* indicates the data comes from measurements, all in abstract unit)

#cores inner-core inner-core increment of calculated calculated predicted predicted errors l1+l2 LLC misses instructions* cross-core coefficient cross-core increment of misses cache misses cache misses instructions 1 0.5 (init) 0.5 (init) 0.00% 0 (init) 0 0.00% 0.00% 2 0.25 0.125 -0.22% 0.2478 0.2478 0.238001 -1.20% 0.99% 3 0.166667 0.055556 5.68% 0.390133 0.246147 0.377223 4.39% 1.29% 4 0.125 0.03125 9.68% 0.4718 0.2359 0.476003 10.10% -0.42% 6 0.083333 0.013889 17.86% 0.595267 0.230281 0.615224 19.86% -2.00% 7 0.071429 0.010204 21.65% 0.645071 0.229779 0.668154 23.96% -2.31%

Assume that 푎1 is the efficiency when 푏1 is the base, same for the (푎2, 푏2) pair. The ratio between 푎1 푙표𝑔푏2푥 푙표𝑔2푏1 푙표𝑔2푏1 푙표𝑔2푏1 and 푎2 can be represented by = , then 푎1 = ∗ 푎2. With ∗ 푎2 substituting 푎1in 푙표𝑔푏1푥 푙표𝑔2푏2 푙표𝑔2푏2 푙표𝑔2푏2 푙표𝑔2푏1 푙표𝑔2푥 the formula 푎1 ∗ 푙표𝑔푏1푥, it can be derived that y = ∗ 푎2 ∗ 푙표𝑔푏1푥 = 푎2 ∗ = 푎2 ∗ 푙표𝑔푏2푥 = 푙표𝑔2푏2 푙표𝑔푥푏2 푦. So the variables’ pair (푎1, 푏1) results in the same cross-core misses in each environment of different cores as the pair (푎2, 푏2). The base of the logarithm function is then set to 2 to calculate the corresponding 푇 #푀푐푟표푠푠 coefficient by function 푇 . The sixth column in Table 7 shows the calculated coefficient when 푙표𝑔2(#푐표푟푒푠 ) different number of cores used. It can be seen that there are few variations between the coefficients. So to fill in the logarithm formula, they are averaged to 0.237981 and then the formula that represents the 푇 cross-core cache misses becomes 0.237981 ∗ 푙표𝑔2(#푐표푟푒푠 ). The right figure in Figure 21 plots the calculated cross-core misses in the fifth column in Table 7 and the estimated cross-core cache misses by this formula. It can be seen that the two curves in the right figure in Figure 21 greatly overlap each other. The last column in Table 7 shows the errors between the measured increment of instructions and the 0.5 predicted increments using Formula (2) with #푀푇 be 0.237981 ∗ 푙표𝑔 (#푐표푟푒푠푇) + ( + 2 #푐표푟푒푠푇 0.5 푇2 2) ∗ #푐표푟푒푠 . The average differences is 1.168% indicated by the errors column. By adjusting the #푐표푟푒푠푇 fraction between inner-core cache eviction on L1+L2 and on LLC at beginning, less differences can be achieved. Table 8 presents the results when setting the initial distribution ratio with 0.6:0.4 respectively, and the coefficient changes to 0.199515 using the same derivation process. The average differences now becomes 0.261%. To conclude, the following three formulas demonstrates the number of cache misses:

0.6 inner-core L1+L2 cache misses in each core = (#푐표푟푒푠)푇

35

0.4 inner-core LLC misses in each core = 2 (#푐표푟푒푠)푇

푇 푇 cross-core LLC misses in total #푀푐푟표푠푠 = 0.199515 ∗ 푙표𝑔2(#푐표푟푒푠 )

Figure 21. The plot of the calculated cross-core evictions from 1 core to 4 cores based on the measured retired instructions (left) and calculated and estimated cross-core evictions (right)

Table 8. Predicted cache evictions when different number of cores applied (* indicates the data comes from measurements, all in abstract unit)

#cores inner-core inner-core calculated predicted predicted errors l1+l2 LLC misses cross-core cross-core increment of misses cache misses cache misses instructions 1 0.6 (init) 0.4 (init) 0 (init) 0 0.00% 0.00% 2 0.3 0.1 0.1978 0.199515 -0.05% 0.17% 3 0.2 0.044444 0.323467 0.316224 4.96% -0.72% 4 0.15 0.025 0.3968 0.39907 9.90% 0.22% 6 0.1 0.011111 0.5119333 0.515739 18.24% 0.38% 7 0.085714 0.008163 0.5593571 0.560109 21.73% 0.08%

Any cache miss on private caches but hit on LLC has less influence than miss on LLC, and normally, large time variation is caused by the cache miss on LLC. Hence, to reduce modeling complexity without losing much accuracy, only influences of cache misses on LLC are considered in this dynamic penalty model. From Figure 8, it can be seen that transfer in the same die can be pipelined, which means transfers to different cores can be performed concurrently. So when data is prepared in the memory controller, transfers to one core do not need to wait for other transfers to be finished. For instance, in Figure 8(a), if core 2 and core 3 send requests to the memory controller simultaneously, request from core 3 follows route: 3->2->1->0-> memory controller, while request from core 2 follows route 2->1->0-> memory controller. At any point in time, there is no overlapped routing segment between the two requests. However, time cost on the connection between the memory controller and DRAM varies.

36

Modern multiprocessor uses NUMA to design the computer memory, where time to access the memory depends on the memory location to the processor, said in [15]. Each cache miss on LLC has different access/loading time. Loading data from remote memory definitely costs more time than from local memory. In the Xeon platform, there is two CPU packaged and each CPU has its local memory, and the local memory is the remote memory for each other CPU. When the main thread loads data to the DRAM at the beginning, it first uses the space of its local memory, and then to its remote memory. It is hard to determine how many data is loaded into both memories, and which cache miss requires data from its remote memory. But these can be hidden in the waiting distance model. In this waiting distance model, it models the abstract time for each core to load data from DRAM in case of an LLC miss, this time includes the time to load from local and remote memory.

Table 9 is used to find an estimated function describing the waiting distance from cores to DRAM. And this table relates the cache misses in each core. For instance, to get the inner-core LLC misses of one core 0.4 in the 6-cores environment, the entire inner-core LLC misses is first calculated by the formula ∗ 6. This 62 is then distributed through the cores with the workload ratio, here is the ratio of the number of bands issued to one core compared to the whole 23 bands. So the inner-core LLC misses of one core almost 0.4 4 equals to ∗ 6 ∗ in the 6-core environment. Same procedure can be performed to get the cross-core 62 23 4 LLC misses, that is 0.199515 ∗ 푙표𝑔 (6) ∗ in this example. If the processing time of each band in the 1- 2 23 core environment is known, then the scale ratio can be the whole processing time to process the issued bands to the processing time of the whole bands. The third and fourth column are filled by applying these two formulas, and the fifth column gives the sum of the LLC misses of one core in each environment of different number of cores.

Table 9. Table used to predict the waiting distance model (* indicates the data comes from measurements, all in abstract unit)

#cores # bands inner-core predicted total LLC increment of calculated waiting issued to LLC cross-core misses processing LLC latency latency the core misses evictions time* 1 23 0.4 0 0.4 0.00% 0.4 0 2 12 0.104348 0.104095 0.208443 -0.31% 0.39876 0.190317 3 8 0.046377 0.109991 0.156368 7.39% 0.42956 0.273192 4 6 0.026087 0.104095 0.130182 7.48% 0.42992 0.299738 5 5 0.017391 0.100709 0.1181 8.19% 0.43276 0.31466 6 4 0.011594 0.089694 0.101288 13.14% 0.45256 0.351272 7 4 0.009938 0.09741 0.107348 13.88% 0.45552 0.348172 8 3 0.006522 0.078071 0.084593 12.26% 0.449041 0.364448 9 3 0.005797 0.082493 0.08829 11.83% 0.447309 0.359019 10 3 0.005217 0.086449 0.091666 12.70% 0.4508 0.359134 11 3 0.004743 0.090027 0.09477 12.63% 0.45051 0.35574 12 2 0.002899 0.062196 0.065095 14.40% 0.457614 0.392519 13 2 0.002676 0.064199 0.066875 15.60% 0.460927 0.394052 14 2 0.002484 0.066054 0.068539 15.64% 0.462572 0.394033 15 2 0.002319 0.067781 0.0701 18.69% 0.474746 0.404646

37

The increments of the processing time are the data labels of the average difference in the left figure of Figure 13. Although it indicates the increment of the bands’ processing time, it is used to derive the waiting distance model of each core as bands are processed by the cores. From the perspective of each core, the models derived from the increment in the processing time of each core is the approximately the same models used to scale the bands’ processing time. It can be seen from Figure 20 that in no-hyper-threaded mode, the sub-model – interconnect utilization determines the utilization of the CPU directly, and is used to scale the processing time of each band. Hence, the increment of the bands’ processing time also 푇 퐵 퐿퐿퐿퐶−퐿퐿퐿퐶 푇 indicates the increment of the overhead latency because of LLC misses, which is 퐵 = 퐼푡 . 퐿퐿퐿퐶 퐿퐿퐿퐶 푇 푇 equals to #푀푐푟표푠푠 multiplies the penalty of the these LLC misses – 푃퐴 . And the following formula can be derived to get the relation between the penalty of LLC misses in the environments of different number of cores. It can be seen in Formula (4), the penalty in the target environment is proportional to the penalty 퐵 퐵 in the baseline environment. 푃퐴 is hence set to be 1 for computation convenience, and then 퐿퐿퐿퐶 equals to 0.4, which is the baseline LLC misses latency.

푇 퐵 푇 푇 퐵 퐵 푇 푇 퐵 퐿퐿퐿퐶−퐿퐿퐿퐶 #푀푐푟표푠푠∗푃퐴 −#푀푐푟표푠푠∗푃퐴 푇 푃퐴 (퐼푖 +1)∗(#푀푐푟표푠푠) 퐵 = 퐵 퐵 = 퐼푡 → 퐵 = 푇 퐿퐿퐿퐶 #푀푐푟표푠푠∗푃퐴 푃퐴 (#푀푐푟표푠푠) Formula (4)

Assume that each LLC miss from one core is not delayed when accessing the interconnect, as the baseline does (no other cores exists). Then with the penalty to each LLC miss in this situation to be 1 as 푇 푇 the LLC misses penalty in the baseline, then 퐿퐿퐿퐶 just equals to #푀푐푟표푠푠. However, seen from the seventh 푇 푇 퐵 퐵 푇 column in Table 9, which calculated 퐿퐿퐿퐶 by the formula 퐼푡 ∗ 퐿퐿퐿퐶 + 퐿퐿퐿퐶, the calculated 퐿퐿퐿퐶 does not 푇 equals to #푀푐푟표푠푠. So there must be LLC misses from different cores require the interconnect at the same time. This penalty because of the concurrent accesses to the interconnect is simulated by the sub-model 푇 푇 푇 waiting distance. And then 퐿퐿퐿퐶=#푀푐푟표푠푠 ∗ (1 + 퐷퐿퐿퐶), with 1 indicates the penalty without influence of 푇 the other cores and 퐷퐿퐿퐶 represents the penalty given by the waiting distance sub-model. The last column 푇 푇 in Table 9 shows the waiting latency 퐿푤푎𝑖푡𝑖푛𝑔 because of the waiting distance penalty, that 퐿푤푎𝑖푡𝑖푛𝑔 = 푇 푇 푇 푇 #푀푐푟표푠푠 ∗ 퐷퐿퐿퐶 = 퐿퐿퐿퐶 − #푀푐푟표푠푠 ∗ 1. Figure 22 shows the plot of the eighth column, the waiting latency 푇 퐿푤푎𝑖푡𝑖푛𝑔 in Table 9. Currently, there is no nice formula to fit the curve shown in Figure 22, hence in the implementation, each point in the following curve is filled in a lookup table.

As a result, sub-models in the communication time-related domain can be realized as shown in Figure 푇 23. The cache eviction (#푀푐푟표푠푠) is a function of workload ratio and number of cores applied in the

푇 0.4 푇 environment, #푀푐푟표푠푠 = ( 2 ∗ (#푐표푟푒푠) + 0.199515 ∗ 푙표𝑔2(#푐표푟푒푠 )) ∗ 푤표푟푘푙표푎푑 푟푎푡𝑖표 . As the (#푐표푟푒푠)푇 bands’ processing time is not shown in this report, the workload ratio is then defined to be #푏푎푛푑푠 푝푟표푐푒푠푠푒푑 𝑖푛 푡ℎ푒 푐표푟푒 ; when implementing it, the workload ratio can be the ratio of the baseline #푏푎푛푑푠 𝑖푛 표푛푒 푏𝑖푡푚푎푝 bands’ processing time in each core to the entire baseline processing time of the 23 bands. The waiting 푇 푇 distance 퐷퐿퐿퐶 is a function that derived from a lookup table of 퐿푤푎𝑖푡𝑖푛𝑔, which represents the points in 푇 Figure 22, divided by #푀푐푟표푠푠. From Figure 20, the interconnect utilization sub-model is used to scale the 푚𝑖푙푙𝑖푠푒푐표푛푑 multicore processor utilization, which is set to be 1 in Chapter IX, in the no hyper- abstract_time_unit 푇 threaded mode and to estimate the bands’ processing time in the target environment, denoted by 퐵퐴푡 , 퐵 with the baseline bands’ processing time, denoted by 퐵퐴푡 , known. The target bands’ processing time

38

푇 퐵 푇 퐵 1 퐵퐴푡 equals to 퐵퐴푡 ∗ (퐼푡 + 1) = 퐵퐴푡 ∗ 1 . Hence, the formula for the interconnect utilization sub- 푇 (퐼푡 +1) 푇 푇 퐵 퐵 푇 푇 퐵 1 푇 #푀푐푟표푠푠∗푃퐴 −#푀푐푟표푠푠∗푃퐴 #푀푐푟표푠푠∗(1+퐷퐿퐿퐶)−#푀푐푟표푠푠 model equals to 푇 . From Formula (4), 퐼푡 = 퐵 퐵 = 퐵 = 퐼푡 +1 #푀푐푟표푠푠∗푃퐴 #푀푐푟표푠푠 퐿푇 푇 푤푎푖푡푖푛푔 퐵 #푀푐푟표푠푠∗(1+ 푇 )−#푀푐푟표푠푠 #푀푇 +퐿푇 −#푀퐵 #푀푐푟표푠푠 푐푟표푠푠 푤푎푖푡푖푛푔 푐푟표푠푠 퐵 = 퐵 . Then the waiting distance sub-model can be #푀푐푟표푠푠 #푀푐푟표푠푠 replaced by the waiting latency model when implementing it.

0.7

0.6

0.5

0.4

0.3

waitinglatency 0.2

0.1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 number of cores applied

Figure 22. The plot of the calculated waiting latency

Figure 23. The penalty model in Xeon platform without hyper-threaded mode on

Model Validation:

This dynamic penalty model is then built into the DES model made in Chapter IX. Inputs to this model are the bands’ processing time running on the 1-core Xeon environment in no hyper-threaded mode. This DES model with the dynamic penalty model is used to predict the bitmap processing time when different number of cores are used on the Xeon platform. Figure 24 shows the average differences between the predicted time using this model to the measured processing time of bitmaps. Same as figures used to validate the models in chapter IX and X, the middle blue line indicates the average errors of all the

39 available five test-sets, and the dots above and below this line represent the standard deviation range at each environment of different cores. Figure 24 shows that the maximum average error is 3.06% at the 11- core environment, but, the maximum difference happens on the 7-core environment when predicting the bitmaps’ processing time of set-W, which is 3.91%. This can be imaged as the deviation ranges at 7-core, including 6-core, 8-core, 14-core, 15-core are wider than other environments, seen from Figure 24. This is not surprising because the penalty model is used to estimate the average increases of the bands’ processing time, which have large deviation ranges seen in Figure 13. And the standard deviations at these environments are all around 1.2%, which is actually not a significant value to be concerned. Compared to the predicted results from a static model shown in Figure 15, this DES model with the dynamic penalty model has almost the same results. This indicates that if the model is only used to predict the processing time of bitmaps on environments of different number of cores applied on one known platform, the DES model with the static penalty model is more suitable than the one with the dynamic model as it costs much less effort to be built.

4.00% 3.50% 3.00% 2.50% 2.00% 1.50%

difference(%) 1.00% 0.50% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -0.50% number of cores applied

average differences

Figure 24. average differences in no hyper-threaded mode between predicted and the measured bitmaps processing time 2) The dynamic penalty model developed for capability 2 Model Analysis and Realization:

As indicated in the bottlenecks’ list, when the hyper-threaded mode is enabled, the utilization of 퐵−퐻푇 푇−퐻푇 퐵퐴푡 resources in both computation and communication domain are influenced. As 퐵퐴푡 = 1 . So before 푇−퐻푇 퐼푡 퐵 푇−퐻푇 퐵−퐻푇 퐵−퐻푇 퐵퐴푡 getting the value of 퐵퐴푡 , it is necessary to know the value of 퐵퐴푡 , where 퐵퐴푡 = 퐵−퐻푇, with U푡 퐵−퐻푇 퐵−퐻푇 U푡 is the utilization of the multicore process in the 1-core hyper-threaded environment. 퐵퐴푡 can 퐵−퐻푇 be a seen as the result first scaled down by the computation resources’ utilization (푈푐 ), and then 퐵 퐵−퐻푇 퐵−퐻푇 퐵퐴푡 scaled down by the utilization of the interconnect resources ( 푈𝑖푛푡푒푟 ). Hence, 퐵퐴푡 = 퐵−퐻푇 = U푡

40

퐵 퐵 퐵퐴푡 퐵퐴푡 퐵−퐻푇 퐵−퐻푇 퐵퐴 퐵 푈푐 = 푈푐 = 푡 . As mentioned in section A-(2) in this chapter, the utilization of 푈퐵−퐻푇 1 푈퐵−퐻푇∗(퐼퐵−퐻푇+1) 푖푛푡푒푟 퐵−퐻푇 푐 푖푛푡푒푟 (퐼푖푛푡푒푟 +1) 퐵 퐵 multicore processor in 1-core environment is 0.671, and 퐿퐿퐿퐶 equals to #푀푐푟표푠푠 ∗ 1. Then the scale down 퐿퐵−퐻푇−0.4 factor of the computation resources 푈퐵−퐻푇 equals to 0.671/( 퐿퐿퐶 + 1). So the latency of LLC misses 푐 0.4 in 1-core environment with hyper-threaded mode need to be derived.

Table 10. Table used to inspect the influence on the waiting distance when hyper-threaded mode is enabled (* indicates the data comes from measurements, all in abstract units)

#cores total LLC increment of calculated waiting ratio between waiting latency in no misses processing time* LLC latency latency hyper-threaded mode and waiting latency in the hyper-threaded mode 1 0.31279 0.00% 0.31279 0 0.885764 2 0.156269 3.85% 0.324845 0.168576 0.879284 3 0.112882 12.89% 0.353096 0.240214 0.878832 4 0.091115 13.35% 0.354534 0.26342 0.827982 5 0.096884 14.27% 0.357416 0.260533 0.862102 6 0.067993 18.55% 0.370825 0.302832 0.859654 7 0.071023 18.40% 0.37033 0.299307 0.802802 8 0.073744 17.11% 0.366324 0.29258 0.804467 9 0.076209 16.70% 0.365028 0.288819 0.805998 10 0.07846 17.63% 0.367921 0.289461 0.817212 11 0.080529 18.69% 0.371244 0.290715 0.847879 12 0.041222 19.58% 0.374031 0.332809 0.845264 13 0.042112 20.35% 0.376435 0.334323 0.84115 14 0.042944 19.69% 0.374385 0.331441 0.822087 15 0.043725 20.33% 0.376379 0.332655 0.885764

From the perspective of the core, when 2 hardware threads are in the same core, the number of bands to be processed is the same as in no hyper-threaded mode. So the formula for inner-core LLC eviction 0.4 2 ∗ (#푐표푟푒푠) ∗ 푤표푟푘푙표푎푑 푟푎푡𝑖표 still holds in the hyper-threaded mode, but the workload ratio is (#푐표푟푒푠)푇 different to the ratio in the no hyper-thread mode as the number of bands issued to the thread are different. For example, the workload ratio in 1-core environment with hyper-threaded mode is 12/23 instead of 23/23. And when 2 threads compete for access to the interconnect, the contention happens. 푇 The formula for cross-core LLC misses of each thread now turns to 0.199515 ∗ 푙표𝑔2(#푐표푟푒푠 ∗ 2) ∗ 푤표푟푘푙표푎푑 푟푎푡𝑖표, as the number of threads doubles the number of cores. The second column in Table 10

푇−퐻푇 푇−퐻푇 0.4 calculates the total LLC misses #푀푐푟표푠푠 with formula #푀푐푟표푠푠 = ( 2 ∗ (#푐표푟푒푠) + 0.199515 ∗ (#푐표푟푒푠)푇

푇 푙표𝑔2(#푐표푟푒푠 ∗ 2)) ∗ 푤표푟푘푙표푎푑 푟푎푡𝑖표 . Same procedures are performed to get the waiting latency

푇−퐻푇 퐿푤푎𝑖푡𝑖푛𝑔 as subsection (1) does, the fifth column in Table 10 shows the results. Figure 25 shows the waiting latency in no hyper-threaded mode and hyper-threaded mode. It can be seen that the two waiting latency curves have the similar increase pattern. The decrease of the waiting latency from no hyper-threaded

41 mode to hyper-threaded mode is expected as the number of LLC misses to each thread also decreases. 푇 푇−퐻푇 One factor is needed to scale the waiting latency 퐼푡 to get 퐼푡 . The last column in Table 10 shows the 푇−퐻푇 푇 ratio between 퐿푤푎𝑖푡𝑖푛𝑔 and 퐿푤푎𝑖푡𝑖푛𝑔, and on average the ratio is 0.841463. The utilization of computation 0.671 0.671 resources in hyper-threaded mode is now equals to = = 0.8581 = 85.81%. To 퐿퐵−퐻푇−0.4 #푀퐵−퐻푇∗1−0.4 퐿퐿퐶 +1 푐푟표푠푠 +1 0.4 0.4 푇 0.31279−0.4 1 conclude, the utilization of the multicore processorU푡 equals 0.8581 ∗ ( + 1) ∗ 푇−퐻푇 = 0.4 (퐼푡 +1) 0.31279−0.4 1 0.31279−0.4 0.8581 ∗ ( + 1) ∗ 푇−퐻푇 푇−퐻푇 퐵−퐻푇 = 0.8581 ∗ ( + 1) ∗ 0.4 #푀푐푟표푠푠 +퐿푤푎푖푡푖푛푔−#푀푐푟표푠푠 0.4 ( 퐵−퐻푇 +1) #푀푐푟표푠푠 1 푇−퐻푇 푇 퐵−퐻푇 . #푀푐푟표푠푠 +0.841463∗퐿푤푎푖푡푖푛푔−#푀푐푟표푠푠 퐵−퐻푇 +1 #푀푐푟표푠푠

0.45 0.4 0.35 0.3 0.25 0.2

0.15 waitinglatency 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 number of cores applied

hyper-threaded mode no hyper-threaded mode

Figure 25 The calculated waiting distance in no hyper-threaded mode and hyper-threaded mode

Model Validation:

This model is developed on the dynamic penalty model built in the subsection 1) above, and Figure 26 shows the average errors of the predicted processing time of bitmaps compared to the measured time in hyper-threaded mode. Seen from this figure, the average differences in 15 different environments vary between [1.05%, 3.77%], and the maximum difference is 5.14%, which happens in a 3-core environment of set-W. These errors are because of two reasons. One is the same as for other DES models, namely that all the developed penalty models focus on the average increment at each environment of different cores. Hence, the resulting time must have some variation ranges as not all bands would like to have quite similar increments indicated by the average increments. Another reason is the differences between the calculated waiting latency model in hyper-threaded and the predicted waiting latency model by applying a coefficient to the waiting latency model in the no hyper-threaded mode. Figure 27 shows the differences between the calculated waiting latency indicated in the fourth column in Table 10, and the predicted waiting latency in this scaled model. It can be seen that two curves are almost overlapped, but there are still little differences at points indicated by the red ellipses. These deviations along with the deviations

42 caused by the first reason, it results in varying errors at the different environment of different test-sets. The average error range [1.05%, 3.77%] on all the five test-sets are acceptable and still quite accurate to predict the bitmaps’ processing time in the hyper-threaded mode. Compared to the results shown in Figure 16, using the static penalty model, this DES model with the dynamic penalty model is less accurate. However, in the static penalty model, the increments with and without hyper-threaded mode on have to be filled into two lookup tables. These increments are so specific that for each different environment, measurements have to be done to get that specific increments. That is the main reason that it cannot be used to predict the performance of the F-path application on a different platform if there is no measured bands’ processing time. In this dynamic penalty model, the scale down factor used to scale the waiting latency model is derived from the ratio between the calculated waiting latency in hyper-threaded and no hyper-threaded mode, which also uses the measured increments shown in Figure 13. However, if this waiting latency in hyper-threaded mode is still applicable on other platforms, then it can declare that this DES model dynamic penalty model is more useful for prediction than the one with the static penalty model.

6.00%

5.00%

4.00%

3.00%

2.00% differences(%)

1.00%

0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 number of cores applied

average differences

Figure 26. average differences in hyper-threaded mode between predicted and the measured bitmaps processing time

43

0.4 0.35 0.3 0.25 0.2 0.15

waitinglatency 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 number of cores applied

calculated latency predicted latency

Figure 27. Calculated waiting latency and predicted waiting latency in hyper-threaded mode 3) The dynamic penalty model developed for capability 3 Model Realization:

From the previous derivations, it can be seen that the key to get an accurate prediction relates to an accurate estimation of #푀푐푟표푠푠 and 퐿푤푎𝑖푡𝑖푛𝑔, where 퐿푤푎𝑖푡𝑖푛𝑔 also relates to #푀푐푟표푠푠. Hence, to translate between platforms, #푀푐푟표푠푠 on a different platform needs to be predicted. As discussed, the cache misses relates to the amount of loads each time, the memory allocation techniques like locality and cache replacement. In a different platform, the caches, communication speed, and type of the remote memory may be different, which may lead to a total different influence on the number of cache misses and their effects. As these influences are hard to be inspected and inspection on them requires a large amount of work and will increase the complexity of modeling. Here, only cache associativity is considered to be an effect that influences the cache misses and their latency significantly.

In Xeon E5-2650, LLC is a 20-way cache, while in i7-6700, LLC is a 16-way cache. And the associativity is inverse proportional to the number of cache misses. That is usually, the higher the associativity is, the fewer cache misses could happen. Hence, the ratio of the associativity is used to scale the function of LLC misses in previous sub-sections. As 퐿푤푎𝑖푡𝑖푛𝑔 relates to #푀푐푟표푠푠, but no function currently is inspected to represent the relation between 퐿푤푎𝑖푡𝑖푛𝑔 and #푀푐푟표푠푠. So in this subsection, it assumes that 퐿푤푎𝑖푡𝑖푛𝑔 is linearly proportional to #푀푐푟표푠푠, although it might not be true. Then the associativity ratio factor is also 푇 applied to 퐿푤푎𝑖푡𝑖푛𝑔 if a different platform is used.

퐵−푃 As a result, the predicted bands’ processing time in a different platform equals to 퐵퐴푡 ∗ 푇−푃 푇 퐵−푃 #푀푐푟표푠푠+푅푎∗퐿푤푎푖푡푖푛푔−#푀푐푟표푠푠 퐵−푃 퐵−푃 , with superscript 푃 be a platform indication, so 퐵푡 in the i7 6700 platform #푀푐푟표푠푠 퐵−𝑖7 is 퐵푡 ; and 푅푎 be the ratio of the cache associativity in Xeon E5-2650 platform to the cache associativity 푐푎푐ℎ푒 associativity in Xeon E5−2650 20 in the i7 6700 platform, so 푅 is , which is = 1.25 in this case. And 푎 푐푎푐ℎ푒 associativity in i7 6700 platform 16

푇−푃 푇 0.4 #푀푐푟표푠푠 is modelled by the same function of #푀푐푟표푠푠 = ( 2 ∗ (#푐표푟푒푠) + 0.199515 ∗ (#푐표푟푒푠)푇

44

푇 푙표𝑔2(#푐표푟푒푠 )) ∗ 푤표푟푘푙표푎푑 푟푎푡𝑖표, but also multiplies 푅푎. The full image of the dynamic penalty model can be shown now, and Figure 28 demonstrates this penalty model.

Figure 28. The dynamic penalty model to get utilization of the shared resources

Factors in this lowest layer are the parameters when building a DES model, and this penalty model generates the utilization of the multicore CPU finally to scale the bands’ processing time. The following list summarizes the sub-models shown in Figure 28:

 computation resources utilization sub-model: A constant factor which is different in hyper- threaded and no hyper-threaded model. The model is represented by the conditional formula: 푈푐 = no hyper-threaded mode?1:0.8581  cache eviction sub-model: A model used to predict the relative number of LLC misses in a different environment. The model is represented by the function:

푇 0.4 푇 #푀푐푟표푠푠 = ( 2 ∗ (#푐표푟푒푠) + 0.199515 ∗ 푙표𝑔2(#푡ℎ푟푒푎푑푠 )) ∗ 푤표푟푘푙표푎푑 푟푎푡𝑖표 ∗ 푅푎 = (#푐표푟푒푠)푇 푇−퐻푇 #푀푐푟표푠푠  waiting distance sub-model: A model used to estimate the influence of each LLC misses. Currently, no function is inspected to represent this sub-model, but is shown as waiting latency 퐿푤푎𝑖푡𝑖푛𝑔 model used a lookup table. When the hyper-threaded mode is on, a scale down factor 0.841463 is applied to this lookup table.  interconnect utilization sub-model: A model used to integrate the cache eviction sub-model and waiting distance sub-model. 푇 1 푇−퐻푇 1 U𝑖푛푡푒푟 = 푇 푇 퐵 , U𝑖푛푡푒푟 = 푇−퐻푇 푇−퐻푇 퐵−퐻푇 #푀푐푟표푠푠+푅푎∗퐿푤푎푖푡푖푛푔−#푀푐푟표푠푠 #푀푐푟표푠푠 +푅푎∗퐿푤푎푖푡푖푛푔−#푀푐푟표푠푠 퐵 +1 퐵−퐻푇 +1 #푀푐푟표푠푠 #푀푐푟표푠푠

45

 dynamic penalty model: A model of the utilization of the multicore CPU, an integration of its sub- models in layer 1, is represented by the formula: 푇 U푝푟표푐푒푠푠표푟 = (no hyper − threaded mode? 1: 0.8581) (푛표 ℎ푦푝푒푟 − 푡ℎ푟푒푎푑푒푑 푚표푑푒? 0.31279: 0.4) − 0.4 ∗ ( + 1) 0.4 푇 푇−퐻푇 ∗ (no hyper − threaded mode? U𝑖푛푡푒푟: U𝑖푛푡푒푟 ) Model Validation:

As mentioned in Chapter IX, the pre-processing time for each bitmap is almost the same on the same platform – Xeon E5-2650. But this pre-processing time on a different platform – i7 should not be the same as the working frequency in the two platforms are totally different. Like the bands’ processing time, this pre-processing time includes computation time and communication time. While the computation time can be estimated by the ratio of working frequency, it is hard to predict the communication time especially as the cache behavior to process the schedule & dispatch task has not yet been studied as handler tasks. Since the pre-processing time occupies a few percentages of the whole bitmap’ processing time (millisecond in hundreds of millisecond), it is safe only to estimate the pre-processing time on i7 by scaling the pre-processing time on Xeon by the working frequency ratio, which means only the computation time is considered to be scaled. Without considering the over clock techniques (Turbo frequency), only ratio of base working frequency given is used. The pre-processing time in [email protected] is around 5.1 load_units, and the processing time is reverse proportional to the frequency, so the estimated pre-processing time of each bitmap in [email protected] is set to be 3 load_units.

The inputs required by the DES model is the bands’ processing time of the bands in the five test-sets on i7 when one thread (1-core environment without hyper-threaded mode on) is used to process the F-path application. In the previous work where an analytical model is built, a formula, called linear processing time to the bands’ size formula, is used to translate the bands’ processing time in the single-thread environment from one platform to another different platform, which has proved its accuracy. So the bands’ processing time calculated from this formula can be used as the inputs of our model. But since this formula is not confidential, so the bands’ processing time in single-thread on i7 is measured and used as the inputs for validation. Table 11 shows the average errors between the predicted processing time using this final DES model and the measured processing time on i7 instead of Xeon.

It can be seen that errors in the 1-core and 2-core environment on i7 in the no-hyper-threaded mode are insignificant. That the maximum error indicated by max. in Table 11 is 1.1% in the 1 and 2-core environment. But the average error becomes 7.46% in the 3-core environment, with the maximum error being 8.25%. The maximum error in hyper-threaded mode is 7.67%, also in the 3-core environment. The reason for these relatively large errors is the simple relation used to translate the cache eviction model and the waiting latency model. From the analysis in this subsection, the ratio of cache associativity is used to translate both models. But actually, as discussed in the introduction in this section B, the cache evictions relate to many aspects, and it is not possible to model all of them. The ratio of cache associativity may not be that representative, but is the easiest way to translate the cache eviction model. And as said before, the waiting latency model is realized into a lookup table, the relation to the cache misses has not yet been found. So scaling this waiting latency model with that ratio may be too rough, but is a realistic way to get the waiting latency model for a different platform. An interesting thing noticed is that scaling the waiting latency model in no hyper-threaded mode on i7 by a factor 0.841463 derived in the subsection 2), is

46 applicable to estimate the waiting latency in the hyper-threaded mode on i7, indicated by Figure 29. This means that the factor 0.841463 can be applied to translate the waiting latency model in the no hyper- threaded mode to the one in the hyper-threaded mode on a different platform. Although this DES model with the dynamic penalty model is less accurate when predicting the bitmap’ processing time on i7 in the 3-core environment than in other environments, it is still applicable to estimate the performance of F- path application on i7 because of relative low complexity of modeling and its relative accuracy to predict (the maximum error does not exceed 10%). Although no measured data can be used to validate this DES model on other platforms in addition to i7-6700 and Xeon E5-2650, this model can still be used to predict the bitmap’ processing time. This is because only cache associativity is considered to be a factor when translating the models, and this information can be easily searched through the Internet. The measured inputs for a different platform can be calculated by the linear processing time to the bands’ size formula, which is available to the colleges in Océ .

Table 11. Average differences between the predicted and the measured bitmaps processing time on i7

no hyper-threaded mode hyper-threaded mode average standard max. min. average standard max. min. differences deviation differences deviation core1 0.13% 0.09% 0.27% 0.02% 1.54% 1.08% 3.36% 0.01% core2 0.51% 0.33% 1.10% 0.22% 1.86% 1.60% 5.01% 0.63% core3 7.46% 0.48% 8.25% 6.75% 5.60% 1.69% 7.67% 2.93%

0.3 0.25 0.2 0.15 0.1

waitinglatency 0.05 0 1 2 3 number of cores applied

calculated latency predicted latency

Figure 29. Calculated waiting latency and predicted waiting latency in hyper-threaded mode on i7

XII. COMPARISONS TO THE ANALYTICAL MODEL As mentioned in Chapter V, a DES model may not be as accurate as an analytical model, so it is necessary to compare the accuracy given by these two models. Figure 30 and 31 show the average differences on the Xeon platform of three existing models, one analytical model, one DES model with static penalty model and one DES model with dynamic penalty model in no hyper-threaded and hyper- threaded mode. From both figures, it can be seen that the predicted bitmap’ processing time by the analytical model has larger differences to the measured time than other two DES models. And the difference at 11-core environment is 26.33% in no hyper-threaded mode and 58.38% in hyper-threaded mode. This is not the same as declared by the previous work building this analytical model. Such

47 contradiction is because that the accuracy claimed by the analytical model comes from 8-bands bitmaps, while the DES model built in the report focus on 23-bands bitmaps. So it can claim that the analytical model is not that accurate when bitmaps are divided into 23 bands in some cases, and sometimes it can even not predict the performance as the errors are too significant. The reason why the analytical model is not suitable for this 23-bands bitmaps is that the model regards the processing time of each band as the average time of all the bands. For instance, if the average bands’ processing time in a bitmap is 15 milliseconds, then this is used to be the processing time for all the bands in this bitmaps. However, there are cases that few of the bands in the bitmaps has quite short processing time while some others have relatively long time. Figure 32 demonstrates an example indicating these cases, it assumes a bitmap consisting of 5 bands, and four cores (also four threads) are used to process this bitmap. Figure 32(a) shows how these 5 bands with different loads are processed in a 4-core environment, and the time indicated to process this 5-bands bitmap is 11 unit time. Figure 32(b) shows how the analytical model predicts the bitmap’ processing time. It first predicts the average bands’ processing time of each band in this bitmap, the predicted result will be around 6.9 unit time in this case. Then it arranges the bands on the available cores, and estimates the bitmap’ processing time. In the example, the predicted processing time for such a bitmap is 13.8 unit time and the real time to process the same bitmap is 11 unit time. So in such an example, the error is around 26.36%. The reason intuitively indicated by Figure 32 is that the two bands processed by core No.1 have actually much less processing time (1 and 2 unit of time) than the predicted time by the analytical model (both 6.9 unit of time). As a result, the bottleneck of processing time changes from the band of 11 unit of time to these two bands which are processed by the same core.

When a bitmap is divided into 23 bands, there is higher possibility that the first several bands and last several bands have much less workload than other bands than a bitmap of 8-bands (think of the margin of an A4 pdf). When these 23 bands are distributed by 11 cores in no hyper-threaded mode, 10 of the cores receive 2 bands to process, and one core receives 3 bands. This core owns the first band and the last band of the whole bitmaps, which have a high possibility with much less workload. Actually, the size of the last bands is around 1KB, while the size of other bands in most cases is around 700KB. So when the core needs to process one more band than other cores, and this extra band is the last band of the bitmap, the analytical model will overestimate the influences of this last band and predict the processing time not accurately. This is mainly the reason why large error happens on 11 core in both hyper- and no hyper- threaded modes (In hyper-threaded mode, 22 threads are used to process the 23 bands, and the first thread has to process the first and the last band). However, the analytical model is still valuable to estimate the processing time in an early design phase. Since it focuses on the performance when the workload is almost equally distributed, so comparisons between the analytical model and the DES models are unfair. But as the processing time of the F-path application in different environments has not been measured when bitmaps are divided into 8 bands, so comparisons between the analytical model and DES models for 8-bands bitmaps are not analyzed at the moment, but can be analyzed in the future.

48

average differences of the whole test-set 30.00% 25.00% 20.00% 15.00%

10.00% difference(%) 5.00% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 number of cores applied

analytical model DES model with static penalty model DES model with dynamic penalty model

Figure 30. Average differences between predicted to measured bitmap’ processing time of three models in no hyper-threaded mode

average difference of the whole test set 70.00%

60.00%

50.00%

40.00%

30.00% difference(%) 20.00%

10.00%

0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 number of cores applied

analytical model DES model with static penalty model DES model with dynamic penalty model

Figure 31. Average differences between predicted to measured bitmap’ processing time of three models in hyper-threaded mode

49

(a) (b)

Figure 32. An example shows how analytical model averages the bands’ processing time

The two DES models built in the report is much more accurate to predict the bitmaps in 23-bands than the analytical model. As the DES models do not predict the average bands’ processing time, but predict the increments of bands’ processing time, seen the example given in Figure 14. So they are more durable than the analytical model when the size of bands are quite different. This complies the reason why a DES model is required indicated in Chapter V.

In the validation of the dynamic model in the last chapter, comparisons between the two DES models with static penalty model and dynamic penalty model are given. In general, the DES model with the static model is more accurate to estimate the processing time of bitmaps on the Xeon platform. But it cannot estimate the bitmap’ processing time on other platforms, if no measurements are done before building another static penalty model. The DES model with dynamic penalty model cannot improve the accuracy of the one with static penalty model, and has relatively large errors in some cases on the Xeon platform. However, it can be used to predict the processing time on other platforms without measuring the bands’ processing time in all the environments of different cores on that platform. So the DES model with the dynamic penalty model is more powerful in prediction than the DES model with the static penalty model, although more efforts are put on building this dynamic model. When combined with the DES model of the P-path application, this DES model with dynamic penalty model is used.

XIII. COMBINED DATAPATH DES MODEL WITH THE DYNAMIC PENALTY MODEL When combing the two DES models into one to estimate the performance of the entire datapath, the model of multicore processor in the P-path application is replaced by the multicore processor model developed in the DES model of F-path application, but the abstract piece-wise linear on the multicore processor’s utilization is kept for scaling the tasks from the P-path application. All other resources in the P-path DES model remain to be the same, and are not shared with the F-path application. As discussed in Chapter VI, this combined datapath model has a high cost-to-validation. And the trust level given to this non-validated model comes from two aspects. One is the accuracy of the models constituting this model, which should be validated before combining the models. Another aspect is a reasonable explanation on the influences when two systems sharing the same resources. The accuracy of the DES model of the F- path application with the dynamic penalty model has been validated in Chapter XI, and of the P-path application has also been validated in previous work, so it is necessary to analyze the influences on the

50 shared resources – multicore processor in this case, that finally results in the variation of the processing time.

From the system description in Chapter III, tasks from the F-path application have higher priority than tasks from the P-path application. The computation resources inside the multicore processor are not shared between the two applications. But the resources used for communication, LLC and interconnect in the multicore processor model, are shared between the applications. Hence, the sub-models shown in the communication time-related zone in Figure 20 including Figure 28 need to be enhanced to be able to simulate the possible corresponding performance when the P-path and F-path application are on one platform.

The LLC eviction model derived in Chapter XI is a logarithm function that with the number of cores increases, the overall number of LLC misses also increase logarithmically. One reason for this logarithmic growth may be because of the uneven distribution of workload when different numbers of cores are used. Think of a bitmap with 23 bands. In the 2-cores environment, one core processes 12 bands with one last band having quite less workload compared to other bands, the other core needs to process 11 bands. But when using 15 cores for instance, 8 of the cores have to process 2 bands, while the rest cores only process one band. In this way, when the workload of these middle bands is quite similar, only 7 cores (excludes the one processes the last band) compete for the accesses to LLC and then interconnect when processing their respective the second band. So in this 15-cores environment, this function of LLC misses indicates LLC contention of all the 15 cores when processing respective first band and LLC misses of 7 cores when process the second band. However, when the P-path application shares the LLC with the F-path application, the cores free from processing bands are used to process tasks from the P-path. As a result, the LLC misses no longer relates to the distribution of bands’ workload but to the elapsed time of the application. And the contention between all the applied cores on LLC happens through the entire elapse of the datapath application. So a reasonable model for the growth on the LLC misses is a linear increase with the increase in the number of cores used. This could be an upper bound of the number of LLC misses. Because with more cores applied, the elapsed time of the datapath application becomes shorter, same as the time during which contention on LLC happens, so the LLC misses should always less than the linear upper bound. Figure 33 (a) shows this upper bound for LLC misses for the entire datapath application compared with the LLC misses only in the F-path application in no hyper-threaded mode. This curve indicating the upper bound is a function by replacing the logarithmic part in the number of LLC misses with the linear increase in the number of cores, that equals to 0.199515 ∗ (#푐표푟푒푠 − 1).

Tasks from the P-path application can be seen as tasks in F-path application but with lower priority. If the interconnect resource can recognize the priority, then the waiting latency model to the F-path application should stay the same. And the waiting latency model to the P-path application needs to be modified based on the algorithm interconnect uses to configure the priority. But if the interconnect resource does not recognize the requests from tasks of different priority, so waiting latency has no relationship to the application that requests belongs to. And in general, more cores are used and more LLC misses in one application, there are more possibilities that requests to access the interconnect could happen at the same time. The possible worst case is that all these requests happen at the same time. Although this could be an upper bound for the waiting latency model, it is too pessimistic. The waiting latency model shown in Figure 22 indicates that not all requests from LLC misses could happen at the same time. But since there is no relation found between the waiting latency and LLC misses and the

51 number of cores used, it is hard to determine how many requests from LLC misses could happen concurrently in the entire datapath application. Hence, when combing the two DES models, the waiting latency can be any value between the best case situation and proposed worst case situation. In the best case situation, the waiting latency model can be the same as the one in Figure 22, implying the tasks from the P-path application have less significant influences. And the upper bound of the waiting latency for the possible worst case is given in Figure 33 (b). It uses the possible worst case LLC misses shown by the blue curve in Figure 33 (a). The average distance for LLC misses at each core can be #푐표푟푒푠 푎푝푝푙𝑖푒푑∗ (#푐표푟푒푠 푎푝푝푙𝑖푒푑 + 1) , indicating all these misses could send the requests to the interconnect 2 at the same time. With LLC misses to each core (0.199515 ∗ (#푐표푟푒푠 − 1) ∗ 푤표푟푘푙표푎푑 푟푎푡𝑖표) and the average distance could happen in the worst case, the orange curve in Figure 33 (b) is derived representing the waiting latency (multiplies these two and subtracts the LLC misses).

1.4 2 1.2 1 1.5 0.8 0.6 1

LLCmisses 0.4 0.2 waitinglatency 0.5 0 1 2 3 4 5 6 7 0 number of cores applied 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 number of cores used F-path entire datapath best case pessimistic case

(a) (b)

Figure 33. An upper bound for the LLC misses (left) and waiting latency when no priority based algorithm is applied in the interconnect (right) when the two applications are combined together

The combined model is not built in a Java environment yet, so the predicted processing time for the whole datapath application is not shown here. But anyway, this model cannot be validated as discussed, it is not that important to give the predicted time at this moment.

XIV. CONCLUSION Two questions are answered in this report. The first one is whether the accuracy of the analytical model of the F-path application can be improved by other modeling techniques. The second question is whether there is a necessity to develop a DES model if it can improve the accuracy and if yes, how to build this DES model. It is discussed in the report that the accuracy of the analytical model can be improved when the predicted bands’ processing time is not just an average value of all the bands’ processing time in a bitmap. And building a DES model is a way to achieve this. Besides the possibility to increase the accuracy of the F-path application, another advantage of a DES model is its flexibility to inspect the influences of different scheduling algorithms. Although in both DES models developed in this report, other scheduling algorithms have not yet been embedded, it is easy to build these algorithms into the models and check the influences by inspecting the Gantt chart generated. Moreover, a DES model with detailed characteristics of the resources shared with the P-path application can help to investigate the influences when two applications

52 are on the same platform. Hence, it is decided in the report that building a DES model is quite useful to predict the system’s performance.

Two approaches are applied in the report to develop the DES model. The DES model using the top-down design approach finds the increments of environments of different cores in both hyper-threaded and no hyper-threaded mode and fills these increments in a lookup table. Although this model gives more accurate results than another DES model within its capability to predict, it needs to measure all the bands’ processing time in all different environments to find the increments in the lookup table. If there is no measured data available, for instance, the bands’ processing time running on a different platform, it cannot predict the bitmap’s processing time. So actually, the top-down model cannot really predict as the results are known before the model simulates the application. The DES model using the bottom-up design approach is more complicated to be built than using the top-down design approach. However, it can really predict the performance of a different environment without measurements to be performed before. The bitmaps’ processing time predicted by this model on the target platform – Xeon has around 1% to 3% differences than the measured bitmaps’ processing time during the validation. The error of this model when predicting the F-path application in a different platform could be larger than 3%. In the report, i7 is used as another platform to validate the DES model built by the bottom-up approach. From the validation results, the maximum error is around 8% on the i7 platform. The reason for this relatively large error may be that the cache eviction sub-model inside that is used to estimate the LLC misses is not representative enough by only considering the cache associativity. Another reason could be that the waiting latency sub- model, used to find the penalty of LLC misses, is built by a lookup table, and the relation to the LLC misses at each core failed to be inspected. These result into the predictions not being that accurate in a new platform. However, a maximum 8% difference, within 10% difference still indicates that the model is accurate enough for this application. Hence, it is concluded that this bottom-up DES model has the capability to predict the performance of the F-path application in different environments.

Both DES models show much more accuracy when predicting the bitmaps’ processing time than the analytical model when the bitmaps are divided into 23 bands. As to 23-bands bitmaps, there are more possibilities of uneven workload distribution to the applied cores, and it is claimed that the analytical model does not perform well when the workload in each core varies a lot. And this DES model can compensate this disadvantage by predicting the increments of bands processing time instead of bands’ processing time itself. But this also leads to a question whether the DES model can predict bitmaps’ processing time in other bands’ division. The answer is positive when predicting the F-path application running in an environment with the number of threads no more than the number of bands. As with more threads than the number of bands, there are threads not being issued any workload through the elapsed time of the F-path application. But when the number of bands is greater or equal than the number of cores used, then the more bands in a bitmap, the fewer LLC misses could happen to process each band between cores. But since it also increases the number of bands for each thread to be processed, so in general, the entire LLC misses should have only a few differences to the total LLC misses in a 23-bands division.

Two upper bounds to scale the model of the two shared resources – cache eviction and waiting latency sub-models, are analyzed in the report when the P-path and F-path application are executed on the same platform. They provide pessimistic predictions on the processing time of the entire datapath. This combined model of the two DES models is not realized into the Java model as no data can be used for validation, so even when the model is implemented, and the predicted results are collected, it is not

53 possible to analyze its accuracy. But anyway, with the provided upper bound of the two shared resources’ models, the probability distribution of the performance can be estimated to see whether this combined datapath implementation can meet the throughput constraints.

XV. REFERENCES [1] Barb Schmitz. What a Concept: The Importance of the Early Phase of Design. CAD Software Blog. http://www.ptc.com/cad-software-blog/what-a-concept-the-importance-of-the-early-phase-of-design (Accessed 2017-01-17).

[2] Hendriks M., et al.. Performance Engineering for Industrial Embedded Data-processing Systems. Product-Focused Software Process Improvement / Ed. P. Abrahamsson, L. Corral, M. Oivo, B. Russo. - Berlin: Springer, 2015. - ISBN 978-3-319-26843-9. - (LNCS ; 9459). - p. 399-414. doi: http://dx.doi.org/10.1007/978-3-319-26844-6_29.

[3] Paolieri M., Quiñones E., Cazorla F. J., Bernat G., Valero M.. Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems. ACM SIGARCH News, Vol. 37 Issue 3(June 2009), p. 57-68. doi: 10.1145/1555815.1555764.

[4] Schranzhofer A., Pellizzoni R., Chen J.J., Thiele L., Caccamo M.. Worst-Case Response Time Analysis of Resource Access Models in Multi-Core Systems. 47th ACM/IEEE Design Automation Conference (DAC)( 2010). doi: 10.1145/1837274.1837359.

[5] Dasari D., et al.. Response Time Analysis of COTS-Based Multicores Considering The Contention On The Shared Memory Bus. 2011 IEEE 10th International Conference, Trust, Security and Privacy in Computing and Communications (TrustCom). doi: 10.1109/TrustCom.2011.146.

[6] Javier Jalle, Mikel Fernandez, Jaume Abella, Jan Andersson, Mathieu Patte, et al.. Bounding Resource Contention Interference in the Next-Generation (NGMP). 8th European Congress on Embedded Real Time Software and Systems (ERTS 2016), Jan 2016, TOULOUSE, France. . [7] Kees Goossens, Arnaldo Azevedo, Karthik Chandrasekar, et al.. Virtual Execution Platforms for Mixed- Time-Criticality Systems: The CompSOC Architecture and Design Flow. Newsletter ACM SIGBED Review, Volume 10 Issue 3, October 2013 Pages 23-24. doi: 10.1145/2544350.2544353 [8] Abel A. et al. (2013) Impact of Resource Sharing on Performance and Performance Prediction: A Survey. In: D’Argenio P.R., Melgratti H. (eds) CONCUR 2013 – Concurrency Theory. CONCUR 2013. Lecture Notes in Computer Science, vol 8052. Springer, Berlin, Heidelberg. doi: https://doi.org/10.1007/978-3-642- 40184-8_3 [9] Trilla D., Jalle J., Fernandez M., Abella J., Cazorla F.J.. Improving Early Design Stage Timing Modeling in Multicore Based Real-Time Systems. 2016 IEEE, Real-Time and Embedded Technology and Applications Symposium (RTAS). doi: 10.1109/RTAS.2016.7461338

[10] Chi Xu, Xi Chen, Robert P. Dick, Zhuoqing Morley Mao. Cache Contention and Application Performance Prediction for Multi-Core Systems. 2010 IEEE International Symposium, Performance Analysis of Systems & Software (ISPASS). doi: 10.1109/ISPASS.2010.5452065

[11] Wikipedia. Amdahl’s law. https://en.wikipedia.org/wiki/Amdahl's_law (Accessed 2017-01-17).

[12] PASSMARK® SOFTWARE. https://www.cpubenchmark.net/cpu_list.php (Accessed 2017-07-14).

54

[13] Wikipedia. Amdahl’s law. https://en.wikipedia.org/wiki/Amdahl's_law (Accessed 2017-01-17).

[14] Hendriks M., Basten T., Verriet J. H., Brasse M. H. H., Somers L.J.A.M.. A blueprint for system-level performance modeling of software-intensive . International Journal on Software Tools for Technology Transfer, Vol. 18(2016), no. 1, p. 21-40. doi: http://dx.doi.org/10.1007/s10009-014-0340- 3.

[15] Wikipedia. Non-uniform memory access. https://en.wikipedia.org/wiki/Non- uniform_memory_access (Accessed 2017-08-17)

[16] ANANDTECH. Intel’s Sandy Bridge Architecture Exposed. http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed (Accessed 2017-08- 23)

[17] System Architecture. System Architecture Sandy Bridge EN & EP. http://www.qdpma.com/systemarchitecture/systemarchitecture_sandybridge.html (Accessed 2017-08- 23)

55