HSTREAM: A directive-based language extension for heterogeneous stream computing

Suejb Memeti Sabri Pllana Department of Department of Computer Science Linnaeus University Linnaeus University Vaxj¨ o,¨ Sweden Vaxj¨ o,¨ Sweden [email protected] [email protected]

Abstract—Big data streaming applications require utilization of one supercomputer in the TOP500 list consist of two IBM heterogeneous systems, which may comprise POWER9 CPUs and six NVIDIA Volta V100 GPUs. multiple multi-core CPUs and many-core accelerating devices While the combination of such heterogeneous processing such as NVIDIA GPUs and Intel Xeon Phis. Programming such systems require advanced knowledge of several hardware units may deliver high performance, scalability, and energy architectures and device-specific programming models, including efficiency, programming and optimizing such systems is much OpenMP and CUDA. In this paper, we present HSTREAM, more complex [3, 12, 4]. Different manufacturers of acceler- a directive-based language extension to support pro- ating devices prefer to use different programming frameworks gramming stream computing applications for heterogeneous par- for offloading (which means transferring the data and control allel computing systems. HSTREAM source-to-source compiler aims to increase the programming productivity by enabling from the host to the device). For instance, OpenMP is used to programmers to annotate the parallel regions for heterogeneous offload computations to Intel Xeon Phi accelerators, whereas and generate target specific code. The HSTREAM CUDA and OpenCL are used for offloading computations to runtime automatically distributes the workload across CPUs GPUs. and accelerating devices. We demonstrate the usefulness of The programming complexity of such systems leads to HSTREAM language extension with various applications from the STREAM benchmark. Experimental evaluation results show system underutilization. For example, applications designed that HSTREAM can keep the same programming simplicity as for multi-core processing are able to utilize the available OpenMP, and the generated code can deliver performance beyond resources on the host CPUs. However, while the host CPUs are what CPUs-only and GPUs-only executions can deliver. performing the actual work, the accelerating devices remain Index Terms—stream computing, heterogeneous parallel com- idle. On the other hand, most of the applications designed puting systems, source-to-source compilation for accelerating devices use the CPU resources just for per- forming the data transfers and initiation of kernels, which is I.INTRODUCTION often performed by a single . Modern multi-core CPUs Nowadays, a huge amount of data is generated throughout comprise a larger number of cores/threads, hence most of the various mechanisms, such as scientific measurement and ex- CPU resources remain idle. periments (including genetics, physics, and astronomy), social Researchers have proposed different techniques to address media (including Facebook, and Twitter), and health-care challenges of heterogeneous parallel programming and big [6, 10]. The current challenges of big data include storing data. For instance, source-to-source compilation techniques are and processing very large files. proposed to ease programming of data-parallel applications However, in big data applications, not necessarily the entire [22, 8, 1, 18]. Similarly, approaches that are based on ++ arXiv:1809.09387v1 [cs.PL] 25 Sep 2018 data has to be processed at once. Furthermore, in most of the are used for stream and data parallel comput- big-data applications, the data may be streamed, which means ing systems [5, 7]. Pop and Cohen [16] propose a language flowing in real-time (for instance data coming from different extension to OpenMP for stream computing on multi-core ar- sensors in the Internet of Things), and therefore, it may not be chitectures. To the best of our knowledge, there does not exist available entirely. In such cases, the data needs to be processed any source-to-source compiler that supports stream computing in chunks and continuously [13]. for heterogeneous parallel computing systems. Heterogeneous parallel computing systems comprise mul- In this paper, we present HSTREAM, a compiler directive- tiple non-identical processing units (PU), including CPUs based language extension that supports heterogeneous stream on the host and accelerating devices (such as GPU, Intel computing. HSTREAM aims to keep the same simplicity as Xeon Phi, and FPGA). Most of the top supercomputers in programming with OpenMP and to enable programmers to the world [20] comprise multiple nodes with heterogeneous easily utilize the available heterogeneous parallel computing processing units. For instance, the nodes of the current number resources on the host (CPU threads) and device (GPUs, or Intel Xeon Phis). The overview of the HSTREAM solution is PREPRINT, CSE 2018, ©IEEE depicted in Fig. 1. The HSTREAM source-to-source compiler Input HSTREAM Compiler Output remain idle. Distributing the data and computations of the same loop across multiple accelerating devices and host CPUs HSTREAM CPU requires additional programming investment. Code HSTREAM enables the automatic distribution of data and GPU computations across different PUs. It enables programmers to

Code Code Syntax Lexical Analysis Analysis Analysis Semantic

Generation easily exploit the available resources in heterogeneous parallel

HSTREAM HSTREAM Runtime computing systems. The source-to-source code generation PDL MIC Code helps to reduce the programming time investment and errors that may come when explicitly handling communication and Fig. 1: Overview of the proposed approach. synchronization between various processing units. While HSTREAM is designed for heterogeneous computing of stream applications, it can also be used for data-parallel performs several analysis steps (including lexical, syntactical, applications. In the context of HSTREAM, data-parallel ap- and semantical) and generates target specific code from a given plications process data that is entirely available in memory, source code annotated with HSTREAM compiler directives whereas stream computing process some data that is streamed and a PDL file that describes the hardware architecture. from an I/O device. Streams are read in chunks, stored in local HSTREAM supports code generation for multi-core CPUs buffers first, processed by multiple heterogeneous PUs, and using OpenMP, GPUs using CUDA, and Intel Xeon Phis transformed back to a stream or stored somewhere in memory. (also known as MIC) using Intel Language Extensions for Listing 1 shows the syntax of the HSTREAM compiler Offloading (LEO). The HSTREAM runtime is responsible for directive, which starts with the #pragma keyword followed the workload across the heterogeneous PUs. by the hstream keyword that stands for heterogeneous stream. We use the HSTREAM source-to-source compiler to gener- Thereafter, multiple clauses can occur in any order. Details ate the heterogeneous version of the STREAM benchmark [9]. about each of the directive clauses are provided below. We evaluate the generated heterogeneous STREAM bench- mark with respect to programming productivity and perfor- Listing 1: An example of the HSTREAM compiler directive. 1 #pragma hstream in(...) out(...) inout(...) device(...) mance. The experimental results show that HSTREAM keeps scheduling (...) the same simplicity as OpenMP, and the code generated for ex- 2 { 3 / / body ecution on heterogeneous systems delivers higher performance 4 } compared to CPUs-only and GPUs-only execution. Major contributions of this paper include: In clause: The in clause is used to indicate the data that • HSTREAM compiler - a source-to-source compiler for should be transferred to the accelerating devices for process- generating target specific code from high-level directive- ing. The syntax for the in clause is inspired from the Intel LEO based annotated source code. and looks as follows: in(variable ref [, variable ref ...]). • HSTREAM runtime - a runtime system for scheduling the The variable ref can be a simple , array, or stream. workload across various non-identical processing units. For example, in(a) where a may be either a simple data type • Evaluation of the usefulness of HSTREAM using appli- or an array, and in(a : double) where a is a stream of data cations from the STREAM and STREAM2 benchmarks. of type double. The in clause may accept multiple variables The rest of the paper is structured as follows. The design, of different types. For instance, in in(a, b, c : int), the first , and implementation aspects of HSTREAM variable (a) is an array, b is a scalar value, and c is a stream of are described in Section II. Section III describes the ex- integers. The in clause can be used multiple times within the perimental environment (including the system configuration, same directive. For instance, in(a, b) in(c : int) is equivalent STREAM benchmark, and evaluation metrics) and experimen- to the example above. tal results. We compare and contrast our work with the current Out clause: The out clause is used to indicate the state-of-the-art in Section IV. Section V concludes this paper variables that need to be transferred from the accel- and provides information about future work. erators to the host memory. The syntax for the out clause is similar to the in clause and looks as follows: II.HSTREAM:LANGUAGEEXTENSIONTOSUPPORT out(variable ref [, variable ref ...]). HETEROGENEOUSSTREAMCOMPUTING InOut clause: The inout clause is used to indicate the In this section, we describe the design, the execution model, variables that need to be transferred to and back from and the implementation aspects of HSTREAM. the accelerating devices. The syntax looks as follows: inout(variable ref [, variable ref ...]). The inout clause A. Design combines the functionality of the in and out clause. For OpenMP 4.5 supports offloading of computations to acceler- example, inout(a) has the same functionality as using the in ators. However, a single loop is usually offloaded to a single clause and the out clause separately (in(a) out(a)). device, and while one device is performing some computa- Device clause: In comparison to the existing OpenMP tions, the other PUs (including host CPUs and accelerators) device clause, which is used to specify only one accel- erating device id as offloading target, the device clause Data Producer Data Processor/Consumer Data Store of the HSTREAM language extension allows providing a list of PU ids that will collaboratively process the in- put data elements provided using the in, out, and inout clauses. The syntax of the device clause looks as follows: device(device id [, device id ...]). The following examples describe different scenarios of the use of device clause: (1) device(∗) is the default value of the device clause, which means that all PUs should be used including the CPUs and accelerating devices; (2) device(1) means that only PU with id 1 should be used; (3) device(0, 1, 2) means that PUs with id 0, 1 and 2 will work together to process some data. Details for each PU (such as, id, number of cores, cache size, core Fig. 2: The execution model of HSTREAM solution for stream frequency, and global memory) are extracted from the platform computing in heterogeneous parallel computing systems. description file (.pdl), which needs to be provided as input to our compiler. The PDL is a platform description language for the explicit description of heterogeneous parallel computing the second batch (only if T1 has finished reading batch 2), and systems [19]. T1 can start reading batch 3. Scheduling clause: The scheduling clause is used to deter- mine the sizes of data chunks for each PU. There are different T1: Read batch_1 batch_2 batch_3 batch_4 ... scenarios on how to use the scheduling clause: (1) Device T2: Process idle batch_1 batch_2 batch_3 batch_4 ... specific distribution, where the programmer will explicitly batch_1 batch_2 batch_3 batch_4 ... the chunk size for each processing unit. The syntax looks as T3: Write idle follows, scheduling(device id : chunk size [, device id : Fig. 3: An example of overlapping data read, data process, and chunk size ...]); (2) Uniform distribution, where the pro- data write. grammer explicitly sets a constant uniform distribution for all PUs. The syntax looks as follows, scheduling(chunk size); The data processor in Fig. 2 uses all the available PUs (or and (3) Automatic distribution, where the HSTREAM runtime specific ones, depending on what the developer has provided system automatically determines the chunk sizes for each of in the scheduling clause) to process the input data. There the PU. The syntax for automatic distribution looks as follows, is a separate CPU thread controlling each PU, including scheduling(AUT O). (remaining) CPU threads, GPUs, and Intel Xeon Phis. Each B. The execution model CPU thread (except the one that controls the CPU threads) is responsible to transfer the corresponding data chunk from the Figure 2 depicts an overview of the HSTREAM execution host local storage to the memory of the accelerating device. model. The main components of our solution are, the data The data will be processed in the accelerating devices, and producer, data processor, and data store. The data producer then transferred back to the host. The thread that controls reads the data in batches from an input stream (that can be the remaining CPU threads does not need to do any explicit a file, data coming from the network, ...). The data processor data transfer, because the controlling thread and the processing stores the batches locally and then applies a specified function threads share the same DRAM. Please note that the same to each data item. Once the data is processed (consumed), the overlapping strategy used for reading, processing, and writing output is sent to the third component (named data store) to data can be used within the data processing component, such either write the data to a file or print it. that the data transfer (host-to-device, and device-to-host) is Please note that this process can be overlapped, which overlapped with the data processing. means that while data producer is reading one batch, the data consumer can process another batch, and at the same time the C. Implementation data store can write another batch to a file. Figure 3 shows an In this section, we will describe the tools and techniques example of overlapping the reading, processing, and writing of used to implement our source-to-source compiler. Thereafter, data. To achieve this result, we need three separate threads, T1 throughout an example, we will describe the transforma- is responsible to read the data, T2 is responsible to initiate the tion process from a high-level C++ code annotated with data processing, and T3 writes the data. First, the data producer HSTREAM compiler directives, to C++ code with OpenMP starts reading the first batch and then notifies T2 that there is directives for execution on host CPUs, Intel LEO for execution data available for processing. While T2 reads data, T1 can on Intel Xeon Phi coprocessors, and CUDA for execution on continue reading the next batch. When T2 finishes processing GPU accelerators. the first batch, it will notify T3 that there is data ready to be 1) Source-to-Source Compiler: Figure 4 depicts an written. Then, T3 will start writing the output data to a file. overview of the implementation steps, including the While T3 is writing the first batch, T2 may start processing HSTREAM language definition, front-end, and back-end. HSTREAM Lanugage Definition be provided, such as type mismatch, undeclared variables, C++14 Grammar + #pragma hstream ... = HSTREAM Grammar multiple variable declarations within a scope, and variable access is out of scope. For instance, an HSTREAM directive HSTREAM Frontend with two or more scheduling or device clauses is syntactically Lexical Analysis Syntax Analysis Semantic Analysis - Identifies atomic - Analyses if the program is - Type checking correct, but semantically wrong, because only one such clause language constructs correct according to the - Checks variable declarations can be provided. If no semantic errors are identified, then an - Eliminates unnecessary language definition - Checks if variables are constructs - Returns a parse tree that defined before they are used intermediate representation (IR) of the source code will be - Returns of represents a syntactically ... tokens correct program - Returns an IR created, which may be used for optimization and translation. HSTREAM Backend We do not perform any optimization of the code during Code Generation source-to-source compilation, but we use the IR to generate the - Generate OpenMP code - Genrate CUDA code target-specific code. We use the String Template (ST) Library - Generate OpenMP + offload code - Returns a program that can be executed on heterogeneous parallel computing [15] to generate the target source code. A simple example of systems the ST library that is used to generate the CUDA memory transfer statements is shown in Listing 3. Fig. 4: Overview of the implementation of our source-to- source compiler. Listing 3: An example of the string template library that is used to generate the CUDA memory transfer statements. cuda_memcpy_host_to_device(from, to, type) ::= We use ANTLR4 (Another Tool for Language Recognition) "cudaCheckError(cudaMemcpy($from$, [14] to write the grammar for our compiler, that we call d_$to$, sizeof($type$)*myN, HSTREAM grammar. As a basis for our work, we used the cudaMemcpyHostToDevice));" C++14 grammar from ANTLR Project GitHub repository of a collection of ANTLR4 grammars [17]. We extend the C++14 Based on the type of the system architecture and based on grammar with the HSTREAM compiler directives to allow the type of PUs provided in the device clause the correspond- developers to annotate parts of the code that need to be ing functions will be generated. At the time of writing this executed in heterogeneous parallel computing systems. Listing paper, HSTREAM source-to-source compiler supports code 2 shows an excerpt of the HSTREAM grammar written in generation for OpenMP, CUDA, and Intel LEO. ANTLR4. 2) Code transformation example: In this section, through- out an example we describe the transformation process from Listing 2: An excerpt of the HSTREAM Grammar. high-level C++ code annotated with HSTREAM directives, 1 directives: directive* ; to target specific source code. The source code for the TRIAD 2 directive: PRAGMA HSTREAM clauses body ; function written in C++ with HSTREAM annotations is shown 3 clauses: clause ; * in Listing 4. Line 1 shows the HSTREAM pragma directive, 4 clause: inclause 5 | outclause which includes the hstream keyword, the in and out clause for 6 | inoutclause , and the device and scheduling clause 7 | deviceclause for scheduling of the workload. The TRIAD function takes 8 | schedulingclause three array variables and one scalar variable as input and 9 ; outputs the result to one of the arrays. In this example, we The ANTLR tool is used to generate the parser, which is want to use all available PUs (device(*)), and we want a used for lexical, syntax, and semantic analysis. The lexical uniform distribution of the workload for each of the PUs analyzer takes the input file, identifies atomic language con- (scheduling(4096)). structs (with the help of regular expressions and pattern rules), Listing 4: HSTREAM TRIAD function of the STREAM removes any unnecessary constructs, such as comments and benchmark. white spaces, and returns a sequence of tokens. In the case of 1 #pragma hstream in(b,c,a,scalar) out(a) device( * ) illegal language constructs (tokens) errors will be generated. scheduling(4096) The sequence of tokens is passed to the syntax analyzer 2 { 3 a = b+ s c a l a *c ; (also known as the parser) to check (with the help of context- 4 } free grammars) if the input source code is correct according to the language definition. The parser generates a parse tree, We assume that we have a heterogeneous platform that com- which represents a syntactically correct program. prises CPUs, GPUs, and Intel Xeon Phi co-processor. Please The generated parse tree is used by the semantic analyzer note that we describe our hardware architecture using XML- to judge whether the syntax derives any meaning. Some of the based platform description language [19]. The HSTREAM main tasks associated with the semantic analysis include type compiler will generate three types of functions, each designed checking, scope resolution, checking for variable declarations, for execution on one of the PUs. For example, an OpenMP and checking whether variables are defined before they are based function will be generated for execution on multi-core used. If there are semantic errors, appropriate feedback will CPUs, a CUDA kernel will be generated for execution on GPU devices, and another pragma-based function that uses section, which means that while one thread is determining its Intel LEO for execution on Intel Xeon Phi. start and finish positions, no other thread can do the same. Listing 5 shows the TRIAD code generated for execution Furthermore, thread private variables of the start and finish on host CPUs. variables are created, to enable other threads to pick other chunks of data while another one is processing its chunk of Listing 5: TRIAD function of the STREAM benchmark de- data. signed for execution on multi-core CPUs. Once the start and finish positions are determined, the data 1 #pragma omp parallel f o r 2 f o r ( i n t i = s t a r t ; i

CPU+4GPUs instructions on how data should flow among these tasks. In comparison to their work, which targets multi-core architec- tures, we provide stream computing support for heterogeneous 4GPUs parallel computing systems that may comprise multiple multi-

CS = 32MB = CS core processors and many-core accelerating devices. CPU Del Rio Astorga et al. [5] and Ernstsson et al. [7] propose pattern based high-level application programming for CPU+4GPUs stream and data computing on heterogeneous parallel comput- ing systems. While template and skeleton based libraries may 4GPUs be helpful for generic applications, they are not recommended

CS = 8MB = CS in case programmers want to use data-structures or algorithms CPU optimized for a particular type of problem. Zhang et al. [23] propose an auto-tuning approach for CPU+4GPUs stream applications running on systems accelerated with In- tel Xeon Phi. The authors exploit the parallelism

4GPUs through temporal sharing, which means that computations are overlapped with communication (data transfer from host to CS = 2MB = CS device, and vice-versa). We employ similar techniques in our CPU solution, but in comparison we support heterogeneous systems

MB/s 0 5000 10000 15000 20000 25000 30000 accelerated with GPUs as well. COPY SCALE ADD TRIAD FILL DAXPY With the aim to alleviate the programming of heterogeneous systems, several source-to-source [2, 8, 1, 18] Fig. 5: The performance of heterogeneous STREAM bench- are proposed that can generate target specific code from a mark when using host CPUs only, 4GPUs, and CPU+4GPUs high-level representation. In comparison, our approach targets on our DISA system. We vary the chunk size between 2, 8, streaming applications. Furthermore, rather than introducing and 32MB. The input size used in this experiment is 8192MB. a new , our solution extends a well- established programming model, such as OpenMP, to enable acceleration of streaming applications. IV. RELATED WORK Recent versions of OpenMP and OpenACC support offload- V. CONCLUSIONANDFUTUREWORK ing computations to accelerating devices [21]. Offloading to a We presented our HSTREAM language extension to support single accelerating device can be easily achieved by using sim- stream computing on heterogeneous parallel computing sys- ple compiler-directives, whereas offloading to multiple devices tems. HSTREAM source-to-source compiler can automatically requires additional programming effort. HSTREAM enables generate device-specific code, such as OpenMP for CPUs, developers to use high-level compiler directives (similar to CUDA for GPUs, and Intel Language Extension for Offloading OpenMP and OpenACC) to develop applications that can be for Intel Xeon Phi, from a high-level source code anno- executed on multiple heterogeneous devices, such as multi- tated with OpenMP-like compiler directives. The HSTREAM core CPUs, NVIDIA GPUs, and Intel Xeon Phi accelerators. runtime is responsible to distribute the workload across the Zhang and Mueller [24] present GStream, a framework for engaged processing units accordingly. We have evaluated the data-streaming on systems accelerated with GPUs. GStream usefulness of our HSTREAM solution for stream computing in provides an application interface that developers can heterogeneous parallel computing systems with the STREAM use to express the parallelism of their streaming application benchmark. We have observed that while HSTREAM keeps without explicitly writing MPI messaging or CUDA memory the same programming simplicity as OpenMP, the generated copy instructions. Similar to our approach, the aim is to reduce code outperforms the CPU and GPU only versions of the code. the development time. Rather than exposing system developers Future work may focus on extending the HSTREAM run- to a new application programming interface, we have decided time to support dynamic and adaptive workload scheduling. Furthermore, we aim to extend the source-to-source compiler 2017 Workshop on Adaptive Resource Management and to support additional accelerating devices (such as FPGAs) Scheduling for Cloud Computing, ARMS-CC ’17, pages and programming frameworks (such as OpenCL). 1–6, New York, NY, USA, 2017. ACM. ISBN 978-1- 4503-5116-4. doi: 10.1145/3110355.3110356. REFERENCES [12] Suejb Memeti, Sabri Pllana, Alecio´ Binotto, Joanna [1] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Kołodziej, and Ivona Brandic. Using meta-heuristics Qin Zhao, Alan Edelman, and Saman Amarasinghe. and machine learning for software optimization of par- PetaBricks: a language and compiler for algorithmic allel computing systems: a systematic literature review. choice, volume 44. ACM, 2009. Computing, Apr 2018. ISSN 1436-5057. doi: 10.1007/ [2] Thomas Henry Beach and Nicholas J Avis. An intelligent s00607-018-0614-9. semi-automatic application porting system for applica- [13] Eric Mizell and Roger Biery. Introduction to GPUs for tion accelerators. In Proceedings of the combined work- Data Analytics. O’Reilly, 2017. shops on UnConventional high performance computing [14] Terence Parr. The definitive ANTLR 4 reference. Prag- workshop plus memory access workshop, pages 7–10. matic Bookshelf, 2013. ACM, 2009. [15] Terence John Parr. Enforcing strict model-view sepa- [3] Siegfried Benkner, Sabri Pllana, Jesper Larsson Traff,¨ ration in template engines. In Proceedings of the 13th Philippas Tsigas, Andrew Richards, Raymond Namyst, international conference on World Wide Web, pages 224– Beverly Bachmayer, Christoph Kessler, David Moloney, 233. ACM, 2004. and Peter Sanders. The PEPPHER Approach to Pro- [16] Antoniu Pop and Albert Cohen. A stream-computing grammability and Performance Portability for Heteroge- extension to . In Proceedings of the 6th Interna- neous many-core Architectures. In ParCo, 2011. tional Conference on High Performance and Embedded [4] Paweł Czarnul. Benchmarking performance of a hybrid Architectures and Compilers, pages 5–14. ACM, 2011. intel xeon/xeon phi system for parallel computation of [17] Antlr Project. Grammars written for ANTLR v4. https: similarity measures between large vectors. International //github.com/antlr/grammars-v4/wiki, 2018. [Online; ac- Journal of Parallel Prog., 45(5):1091–1107, Oct 2017. cessed 06-June-2018]. ISSN 1573-7640. doi: 10.1007/s10766-016-0455-0. [18] Christopher J Rossbach, Yuan Yu, Jon Currey, Jean- [5] David Del Rio Astorga, Manuel F Dolz, Javier Philippe Martin, and Dennis Fetterly. Dandelion: a Fernandez,´ and J Daniel Garc´ıa. A generic parallel compiler and runtime for heterogeneous systems. In pattern interface for stream and data processing. Con- Proceedings of the Twenty-Fourth ACM Symposium on currency and Computation: Practice and Experience, 29 Operating Systems Principles, pages 49–68. ACM, 2013. (24), 2017. [19] Martin Sandrieser, Siegfried Benkner, and Sabri Pllana. [6] Ciprian Dobre and Fatos Xhafa. Parallel programming Explicit platform descriptions for heterogeneous many- paradigms and frameworks in big data era. Int. J. Parallel core architectures. In Parallel and Distributed Processing Program., 42(5):710–738, 10 2014. ISSN 0885-7458. Workshops and Phd Forum (IPDPSW), 2011 IEEE Inter- doi: 10.1007/s10766-013-0272-7. national Symposium on, pages 1292–1299. IEEE, 2011. [7] August Ernstsson, Lu Li, and Christoph Kessler. Skepu2: [20] TOP500. TOP500 Supercomputer Sites. http://www. Flexible and type-safe skeleton programming for het- top500.org/, 2016. Accessed: July. 2018. erogeneous parallel systems. International Journal of [21] R. Xu, S. Chandrasekaran, and B. Chapman. Exploring Parallel Programming, 46(1):62–80, Feb 2018. ISSN programming multi-gpus using openmp and openacc- 1573-7640. doi: 10.1007/s10766-017-0490-5. based hybrid model. In 2013 IEEE International Sym- [8] Alcides Fonseca and Bruno Cabral. ÆminiumGPU: An posium on Parallel Distributed Processing, Workshops Intelligent Framework for GPU Programming. In Facing and Phd Forum, pages 1169–1176, May 2013. doi: the Multicore-Challenge III, pages 96–107. Springer, 10.1109/IPDPSW.2013.263. 2013. [22] Yonghong Yan, Jiawen Liu, Kirk W Cameron, and [9] John D. McCalpin. Memory bandwidth and machine Mariam Umar. Homp: Automated distribution of parallel balance in current high performance computers. IEEE loops and data in highly parallel accelerator-based sys- Computer Society Technical Committee on Computer tems. In Parallel and Distributed Processing Symposium Architecture (TCCA) Newsletter, pages 19–25, 12 1995. (IPDPS), International, pages 788–798. IEEE, 2017. [10] Suejb Memeti and Sabri Pllana. Analyzing large-scale [23] Peng Zhang, Jianbin Fang, Tao Tang, Canqun Yang, and DNA on Multi-core Architectures. In Com- Zheng Wang. Auto-tuning streamed applications on intel putational Science and Engineering (CSE), 18th Interna- xeon phi. 2018. tional Conference on, pages 208–215. IEEE, 2015. [24] Yongpeng Zhang and Frank Mueller. Gstream: A [11] Suejb Memeti, Lu Li, Sabri Pllana, Joanna Kolodziej, and general-purpose data streaming framework on gpu clus- Christoph Kessler. Benchmarking OpenCL, OpenACC, ters. In Parallel Processing (ICPP), 2011 International OpenMP, and CUDA: Programming Productivity, Perfor- Conference on, pages 245–254. IEEE, 2011. mance, and Energy Consumption. In Proceedings of the